Catboost用泰坦尼克号数据训练
source:https://github.com/catboost/tutorials/blob/master/python_tutorial.ipynb
导入数据
Catboost里自带了一些数据集
from catboost.datasets import titanic
import numpy as np
train_df,test_df = titanic()
print(train_df.head())
特征工程
- 检查空值
null_value_stats = train_df.isnull().sum(axis = 0)
print(null_value_stats[null_value_stats!=0])
结果:
Age 177
Cabin 687
Embarked 2
dtype: int64
-
填补空缺值
做法:
采用一些不同于其分布的数字来填充。
目的:
为了使模型能够更容易区别填补后的空缺值与真实值。
train_df.fillna(-999,inplace=True)
test_df.fillna(-999,inplace=True)
ps:给总记不住轴的我
- 划分特征与标签
X = train_df.drop('Survived',axis=1)
y = train_df.Survived
特征通常有各种类型,先查看有哪些类型,再将字符串特性(比如我要做的岩性)交给CatBoost处理。
查看字符串特性:
print(X.dtypes)
#运行结果:
PassengerId int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
可见:名字和性别可以交给Catboost
开始分数据集
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.75,random_state=42)
加一个步骤,选择要训练的特征,并且忽略浮点数据。(不然BUG)
cate_features_index = np.where(X.dtypes != float)[0]
CatBoost基础
- 导入
from catboost import CatBoostClassifier
from catboost import Pool
from catboost import cv
from sklearn.metrics import accuracy_score
- 模型训练
现在开始创建模型。使用默认参数。作者认为默认参数已经提供了一个较好的默认值。因此这里只设置了损失函数。
建立模型
#错误示范
model = CatBoostClassifier(
custom_loss=['Accuracy'],#定义损失函数
random_seed=42,
logging_level='Silent'
)
#正确示范
model = CatBoostClassifier(eval_metric='Accuracy',use_best_model=True,random_seed=42)
CatBoost has several parameters to control verbosity. Those are verbose, silent and logging_level.
By default logging is verbose, so you see loss value on every iteration. If you want to see less logging, you need to use one of these parameters. It’s not allowed to set two of them simultaneously.
silent has two possible values - True and False.
verbose can also be True and False, but it also can be an integer. If
it is an integer N, then logging will be printed out each N-th iteration.
logging_level can be ‘Silent’, ‘Verbose’, ‘Info’ and ‘Debug’:
’Silent’ means no output to stdout (except for important warnings) and is same as silent=True or verbose=False.
'Verbose’ is the default logging mode. It’s the same as verbose=True or silent=False.
**‘Info’**prints out the trees that are selected on every iteration.
'Debug’ prints a lot of debug info.
在两个地方设置这个参数:
1)model creation
2)fitting of the created model.
训练你的模型
model.fit(X_train,y_train,cat_features=cate_features_index,eval_set=(X_test,y_test))
运行结果
bestTest = 0.8295964126
bestIteration = 53
Shrink model to first 54 iterations.
开始验证你的模型
cv_params = model.get_params()
cv_params.update({
'loss_function': 'Logloss'
})
cv_data = cv(Pool(X,y,cat_features=cate_features_index),cv_params,plot=True)
Pool的用法可以等待之后细看。
上一篇: 时间戳怪事. 已经想到解决办法