Catboost用泰坦尼克号数据训练

程序员文章站 2022-05-02 15:13:19

...

source:https://github.com/catboost/tutorials/blob/master/python_tutorial.ipynb

导入数据

Catboost里自带了一些数据集

from catboost.datasets import titanic
import numpy as np

train_df,test_df = titanic()
print(train_df.head())

特征工程

检查空值

null_value_stats = train_df.isnull().sum(axis = 0)
print(null_value_stats[null_value_stats!=0])

结果：

Age         177
Cabin       687
Embarked      2
dtype: int64

填补空缺值

做法：
采用一些不同于其分布的数字来填充。
目的：
为了使模型能够更容易区别填补后的空缺值与真实值。

train_df.fillna(-999,inplace=True)
test_df.fillna(-999,inplace=True)

ps：给总记不住轴的我
Catboost用泰坦尼克号数据训练

划分特征与标签

X = train_df.drop('Survived',axis=1)
y = train_df.Survived

特征通常有各种类型，先查看有哪些类型，再将字符串特性（比如我要做的岩性）交给CatBoost处理。
查看字符串特性：

print(X.dtypes)
#运行结果：
PassengerId      int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64

可见：名字和性别可以交给Catboost

开始分数据集

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.75,random_state=42)

加一个步骤，选择要训练的特征，并且忽略浮点数据。（不然BUG）

cate_features_index = np.where(X.dtypes != float)[0]

CatBoost基础

导入

from catboost import CatBoostClassifier
from catboost import Pool
from catboost import cv
from sklearn.metrics import accuracy_score

模型训练
现在开始创建模型。使用默认参数。作者认为默认参数已经提供了一个较好的默认值。因此这里只设置了损失函数。

建立模型

#错误示范
model = CatBoostClassifier(
    custom_loss=['Accuracy'],#定义损失函数
    random_seed=42,
    logging_level='Silent'
)

#正确示范
model = CatBoostClassifier(eval_metric='Accuracy',use_best_model=True,random_seed=42)

关于logging_level的解释

CatBoost has several parameters to control verbosity. Those are verbose, silent and logging_level.
By default logging is verbose, so you see loss value on every iteration. If you want to see less logging, you need to use one of these parameters. It’s not allowed to set two of them simultaneously.
silent has two possible values - True and False.
verbose can also be True and False, but it also can be an integer. If
it is an integer N, then logging will be printed out each N-th iteration.
logging_level can be ‘Silent’, ‘Verbose’, ‘Info’ and ‘Debug’:
’Silent’ means no output to stdout (except for important warnings) and is same as silent=True or verbose=False.
'Verbose’ is the default logging mode. It’s the same as verbose=True or silent=False.
**‘Info’**prints out the trees that are selected on every iteration.
'Debug’ prints a lot of debug info.
在两个地方设置这个参数：
1）model creation
2）fitting of the created model.

训练你的模型

model.fit(X_train,y_train,cat_features=cate_features_index,eval_set=(X_test,y_test))

运行结果

bestTest = 0.8295964126
bestIteration = 53
Shrink model to first 54 iterations.

开始验证你的模型

cv_params = model.get_params()
cv_params.update({
    'loss_function': 'Logloss'
})
cv_data = cv(Pool(X,y,cat_features=cate_features_index),cv_params,plot=True)

Pool的用法可以等待之后细看。

参考文章：
讯飞广告反欺诈赛的王牌模型catboost介绍

上一篇：时间戳怪事. 已经想到解决办法

下一篇： R语言处理日期值的数值和字符串之间的相互转换

Catboost用泰坦尼克号数据训练

导入数据

特征工程

CatBoost基础

计算机视觉（3）：用inception-v3模型重新训练自己的数据模型

Catboost用泰坦尼克号数据训练

计算机视觉（3）：用inception-v3模型重新训练自己的数据模型

用pytorch搭建简单的语义分割(可训练自己的数据集)

【Python】Python读取mat数据集（COFW，300WLP）,解析后训练模型用