构建决策树时出现ValueError: Length of feature_names, 4 does not match number of features, 10的解决办法
构建决策树时出现ValueError: Length of feature_names, 4 does not match number of features, 10的解决办法
import pandas as pd
from sklearn import tree
ball_data = pd.read_csv('ball.csv')
cat_features = ['Outlook','Temperature','Humidity','Windy']
ball_data_onehot = pd.get_dummies(ball_data,columns=cat_features)
ball_data_onehot['Play'] = ball_data_onehot['Play'].map({'No':0,'Yes':1}).astype(int)
ball_data_tr_in = ball_data_onehot.drop(columns=['Play'])
ball_data_tr_out = ball_data_onehot['Play']
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(ball_data_tr_in,ball_data_tr_out,test_size=0.3,random_state=1)
clf = tree.DecisionTreeClassifier(criterion='entropy')
model = clf.fit(X_train,y_train)
res = model.predict(X_test)
print(res) #模型结果输出
print(y_test) #实际值
print(sum(res==y_test)/len(res)) #准确率
clf1 = tree.DecisionTreeClassifier()
clf1 = clf.fit(ball_data_tr_in,ball_data_tr_out)
print('clf1:'+str(clf1)) #打印模型,打印出来是一个函数
A = ([[0,1,0,0,0,1,0,1,0,1]])
predict_result = clf.predict(A)
if predict_result==0:
import graphviz
dot_data = tree.export_graphviz(model,out_file = None,
class_names = ['No','Yes'],
filled = True,rounded = True,
special_characters = True)
graph = graphviz.Source(dot_data)
dot_data = tree.export_graphviz(model,out_file = None,
class_names = ['No','Yes'],
filled = True,rounded = True,
special_characters = True)
错误就出在feature_names=[‘Outlook’,‘Temperature’,‘Humidity’,‘Windy’] 这个地方,feature_names这一参数的作用是使决策树图中的各个小块可以显示其对应的特征名(自己是初学者描述不够详细,具体可以去搜其他关于export_graphviz()参数的资料),而这一参数要求参数值必须与被分析的数据集中的特征名(也叫属性名)对应,数量一致且顺序一致,且不要把类标号class也写进去(因为类标号是最终分析的结果)。
Outlook | Temperature | Humidity | Windy | Play |
sunny | hot | high | no | No |
sunny | hot | high | yes | No |
overcast | hot | high | no | Yes |
rain | mild | high | no | Yes |
rain | cool | normal | no | Yes |
rain | cool | normal | yes | No |
overcast | cool | normal | yes | Yes |
sunny | mild | high | yes | No |
sunny | cool | normal | no | Yes |
rain | mild | normal | no | Yes |
sunny | mild | normal | yes | Yes |
overcast | mild | high | yes | Yes |
overcast | hot | normal | no | Yes |
rain | mild | high | yes | No |
可以发现特征名为Outlook,Temperature,Humidity,Windy(且以此顺序),Play是类标号,在原代码一开始我们定义了一个cat_features = [‘Outlook’,‘Temperature’,‘Humidity’,‘Windy’] 的列表用于存放我们所需要进行的独热编码的列名称(即特征名),ball_data_onehot = pd.get_dummies(ball_data,columns=cat_features) 用于对数据进行独热编码,注意一定要理解独热编码的具体含义,在此处非常重要,具体可以去查其他相关资料;一开始提到错误出现在feature_name这里,因为feature_name要求和数据的特征名一致,但是此时观察上表以及cat_features发现不管是特征名还是特征名的顺序都一致,但是为什么会出现Length of feature_names, 4 does not match number of features, 10(feature_name的个数4与数据中特征名的个数10不符) 这样的错误呢?
其实稍微了解独热编码后会发现,pd.get_dummies() 对数据进行独热编码后,数据的结构会发生很大的改变,在对上表的数据进行独热编码后数据变为:
Outlook_overcast | Outlook_rain | Outlook_sunny | Temperature_cool | Temperature_hot | Temperature_mild | Humidity_high | Humidity_normal | Windy_no | Windy_yes |
0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |
0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |
0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |
1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |
1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |
0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 |