欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

Catboost用泰坦尼克号数据训练

程序员文章站 2022-05-02 15:13:19
...

source:https://github.com/catboost/tutorials/blob/master/python_tutorial.ipynb

导入数据

Catboost里自带了一些数据集

from catboost.datasets import titanic
import numpy as np

train_df,test_df = titanic()
print(train_df.head())

特征工程

  • 检查空值
null_value_stats = train_df.isnull().sum(axis = 0)
print(null_value_stats[null_value_stats!=0])

结果:

Age         177
Cabin       687
Embarked      2
dtype: int64
  • 填补空缺值

    做法
    采用一些不同于其分布的数字来填充。
    目的:
    为了使模型能够更容易区别填补后的空缺值与真实值。

train_df.fillna(-999,inplace=True)
test_df.fillna(-999,inplace=True)

ps:给总记不住轴的我
Catboost用泰坦尼克号数据训练

  • 划分特征与标签
X = train_df.drop('Survived',axis=1)
y = train_df.Survived

特征通常有各种类型,先查看有哪些类型,再将字符串特性(比如我要做的岩性)交给CatBoost处理。
查看字符串特性:

print(X.dtypes)
#运行结果:
PassengerId      int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64

可见:名字和性别可以交给Catboost

开始分数据集

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.75,random_state=42)

加一个步骤,选择要训练的特征,并且忽略浮点数据。(不然BUG)

cate_features_index = np.where(X.dtypes != float)[0]

CatBoost基础

  • 导入
from catboost import CatBoostClassifier
from catboost import Pool
from catboost import cv
from sklearn.metrics import accuracy_score
  • 模型训练
    现在开始创建模型。使用默认参数。作者认为默认参数已经提供了一个较好的默认值。因此这里只设置了损失函数。

建立模型

#错误示范
model = CatBoostClassifier(
    custom_loss=['Accuracy'],#定义损失函数
    random_seed=42,
    logging_level='Silent'
)
#正确示范
model = CatBoostClassifier(eval_metric='Accuracy',use_best_model=True,random_seed=42)

关于logging_level的解释

CatBoost has several parameters to control verbosity. Those are verbose, silent and logging_level.
By default logging is verbose, so you see loss value on every iteration. If you want to see less logging, you need to use one of these parameters. It’s not allowed to set two of them simultaneously.
silent has two possible values - True and False.
verbose can also be True and False, but it also can be an integer. If
it is an integer N, then logging will be printed out each N-th iteration.
logging_level can be ‘Silent’, ‘Verbose’, ‘Info’ and ‘Debug’:
’Silent’ means no output to stdout (except for important warnings) and is same as silent=True or verbose=False.
'Verbose’ is the default logging mode. It’s the same as verbose=True or silent=False.
**‘Info’**prints out the trees that are selected on every iteration.
'Debug’ prints a lot of debug info.
在两个地方设置这个参数:
1)model creation
2)fitting of the created model.

训练你的模型

model.fit(X_train,y_train,cat_features=cate_features_index,eval_set=(X_test,y_test))

运行结果

bestTest = 0.8295964126
bestIteration = 53
Shrink model to first 54 iterations.

开始验证你的模型

cv_params = model.get_params()
cv_params.update({
    'loss_function': 'Logloss'
})
cv_data = cv(Pool(X,y,cat_features=cate_features_index),cv_params,plot=True)

Pool的用法可以等待之后细看。

参考文章:
讯飞广告反欺诈赛的王牌模型catboost介绍