Otto商品分类--决策树模型
目录
思路:原始特征+tfidf特征
训练部分**
我们以Kaggle2015年举办的Otto Group Product Classification Challenge竞赛数据为例,分别调用缺省参数CART、CART+GrideSearchCV以进行超参数调优。
1.工具准备
import pandas as pd
import numpy as np
from sklearn.model_selection import GridSearchCV
2.读取数据
#读取数据
dpath='./data/'
#采用原始特征+tf_idf特征
train1=pd.read_csv(dpath+"Otto_FE_train_org.csv")
train2=pd.read_csv(dpath+"Otto_FE_train_tfidf.csv")
#去掉多余的id
train2=train2.drop(["id"],['targer'],axis=1)
train=pd.concat([train1,train2],axis=1,ignore_index=False)
train.head()
del train1
del train2
3.准备数据
y_train=train['target']
X_train=train.drop(["id","target"],axis=1)
#保存特征名字以备后用
feat_names=X_train.columns
#生成稀疏数据
from scipy.sparse import csr_matrix
X_train=csr_matrix(X_train)
4.默认参数的决策树模型
from sklearn.tree import DecisionTreeClassifier
DT1=DecisionTreeClassifier()
#交叉验证用于评估模型性能和进行参数调优(模型选择)
#分类任务中交叉验证缺省是采用StratifiedKFold
#数据集比较大,采用3折交叉验证
from sklearn.model_selection import cross_val_score
loss=cross_val_score(DT1,X_train,y_train,cv=3,scoring="neg_log_loss")
print('logloss of each fold is:',-loss)
print('cv logloss is:',-loss.mean())
logloss of each fold is: [10.1700857 9.86630808 9.74333791]
cv logloss is: 9.926577231188997
5.决策树超参数调优
决策树的超参数有:
- max_depth(树的深度)或max_leaf_nodes(叶子节点的数目)
- min_samples_leaf(叶子节点的最小样本数)、min_samples_split(中间节点的最小样本树)
- min_weight_fraction_leaf(叶子节点的样本权重占总权重的比例)
- min_impurity_split(最小不纯净度也可以调整)
- max_features(最大特征数目)
在sklearn框架下,不同学习器的参数调整步骤相同:
- 设置参数搜索范围
- 生成GridSearchCV的实例(参数)
- 调用GridSearchCV的fit方法
from sklearn.model_selection import GridSearchCV
#需要调优的参数
max_depth=range(10,100,10)
min_samples_leaf=range(1,10,2)
tuned_parameters=dict(max_depth=max_depth,min_samples_leaf=min_samples_leaf)
DT2=DecisionTreeClassifier()
grid=GridSearchCV(DT2,tuned_parameters,cv=3,scoring="neg_log_loss")
grid.fit(X_train,y_train)
print('Best score: %f using %s"%(-grid.best_score_,-grid.best_parms_))
输出结果:
test_means=-grid.cv_results_['mean_test_score']
test_scores=np.array(test_means).reshape(len(max_depth),len(min_samples_leaf))
for i,value in enumerate(max_depth):
plt.plot(min_samples_leaf,test_scores[i],label='test_max_score'+str(value))
plt.legend()
plt.xlabel('min_samples_leaf')
plt.ylabel('logloss')
plt.show()
结果:
看来max_depth最好是10,再细看一下当max_depth取10时,模型性能随参数min_samples_leaf的变化
plt.plot(min_samples_leaf,test_scores[0],label='test_max_depth'+str(10))
plt.show()
输出结果:
可以看出模型性能随参数min_samples_leaf的变化趋势是越大越好(可能是因为样本数目比较大),下一步继续减小max_depth,同时增大min_samples_leaf的数目
max_depth=range(3,10,2)
min_samples_leaf=range(11,20,2)
tuned_parameters=dict(max_depth=max_depth,min_samples_leaf=min_samples_leaf)
DT2=DecisionTreeClassifier()
grid=GridSearchCV(DT2,tuned_parameters,cv=3,scoring='neg_log_loss')
grid.fit(X_tain,y_tain)
print('Best score:%f using %s'%(-grid.best_score_,-grid.best_param_))
输出分数:
Best score:1.206972 using {'max_depth': 10, 'min_samples_leaf': 9}
test_means=-grid.cv_results_['mean_test_score']
test_scores=np.array(test_means).reshape(len(max_depth),len(min_samples_leaf))
for i,value in enumerate(max_depth):
plt.plot(min_samples_leaf,test_scores[i],label='test_max_depth'+str(value))
plt.legend()
plt.xlabel('min_samples_leaf')
plt.ylabel('logloss')
plt.show()
输出结果:
plt.plot(min_samples_leaf,test_scores[3],label='test_max_depth:'+str(9))
plt.show()
输出结果:
扩大max_depth和min_samples_leaf
from sklearn.model_selection import GridSearchCV
#需要调优的参数
max_depth=range(10,20,2)
min_samples_leaf=range(20.30,2)
tuned_parameters=dict(max_depth=max_dapth,min_samples_leaf=min_samples_leaf)
DT2=DecisionTreeClassifier()
grid=GridSearchCV(DT2,tuned_parameters,cv=3,scoring='neg_log_loss')
grid.fit(X_train,y_train)
print('Best score:%f using %s'%(-grid.best_score_,-grid.best_params_))
输出结果:
从结果来看,max_depth可以确定为10,但是min_samples_leaf还得继续调整
test_means=-grid.cv_results_['mean_test_score']
test_scores=np.array(test_means).reshape(len(max_depth),len(min_samples_leaf))
for in,value in enumerate(max_dapth):
plt.plot(min_samples_leaf,test_scores[i],label='test_max_depth:'+
str(value))
plt.legend()
plt.xlabel('min_samples_leaf')
plt.ylabel('logloss')
plt.show()
输出结果:
max_depth固定在10,扩大min_samples_leaf
from sklearn.model_selection import GridSearchCV
#需要调整的参数
#max_depth=10
min_samples_leaf=range(30,40,2)
tuned_parameters=dict(min_samples_leaf=min_samples_leaf)
DT2=DecisionTreeClassifier(max_depth=10)
grid=GridSearchCV(DT2,tuned_parameters,cv=3,scoring='neg_log_loss')
grid.fit(X_train,y_tain)
test_means=-grid.cv_results_['mean_test_score']
plt.plot[min_samples_leaf,test_means,label='test_max_depth'+str(10))
plt.legend()
plt.xlabel('min_samples_leaf')
plt.ylabel('logloss')
plt.show()
画图结果:
#输出最佳分数
print('Best score: %f using %s'%(-grid.best_score_,grid.best_params_))
输出分数:
CART其他模型复杂度参数:
- max_leaf_nodes和max_depth类似,调试其中任意一个就好;
- min_samples_split和min_samples_leaf通常也有关系,调试其中任意一个就好
- min_weight_fraction_leaf:由于本任务我们对类别/样本没有设置权重,min_weight_fraction_leaf和min_samples_leaf通常也有关系功能类似,也无需调整
- max_features:原则上应该是越大越好,这里我们用了所有特征,无需再调
当然固若有计算资源,对上述参数进行调优也可以,只是预计再调优得到的性能提升不大。
保存模型,用于后续测试
import cPickle
cPickle.dump(grid.best_estimator,open("Otto_CART_org_tfidf.pkl",'wb'))
查看特征重要性:
DT3=grid.best_estimator_
df=pd.DataFrame({"columns":list(feat_names),"importance":
list(DT3.feature_importances_.T)})
df=df.sort_values(by=['importance'],ascending=False)
print(df)
测试部分**决策树
#读取数据
dpath='./data/'
#采用原始特征+tf_idf特征
train1=pd.read_csv(dpath+"Otto_FE_train_org.csv")
train2=pd.read_csv(dpath+"Otto_FE_train_tfidf.csv")
#去掉多余的id
train2=train2.drop(["id"],['targer'],axis=1)
train=pd.concat([train1,train2],axis=1,ignore_index=False)
train.head()
del train1
del train2
y_train=train['target']
X_train=train.drop(["id","target"],axis=1)
#保存特征名字以备后用
feat_names=X_train.columns
#生成稀疏数据
from scipy.sparse import csr_matrix
X_train=csr_matrix(X_test)
#load训练好的模型
import cPickle
CART_best=cPickle.load(open("Otto_CART_org_tfidf.pkl",'rb'))
#输出每类的概率
y_test_pred=CART_best.predict_proba(X_test)
print(y_test_pred.shape)
#生成提交的结果
out_df=pd.DataFrame(y_test_pred)
columns=np.empty(9,dtype=object)
for i in range(9):
columns[i]='Class_'+str(i+1)
out_df.columns=columns
out_df=pd.concat([test_id,out_df],axis=1)
out_df.to_csv("CART_org_tfidf.csv",index=False)
原始特征和tfidf两种特征
Logistic回归:在Kaggle的Private Leaderboard的分数0.59817(排名第2243位)
RBF核SVM(只有tfidf特征):0.48947(排名1254位)
CART:1.07144(交叉验证估计的测试误差很难,可惜性能太差),单棵决策树性能不好