欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

[机器学习-Sklearn]决策树DecisionTreeClassifier学习

程序员文章站 2022-03-30 18:55:27
...

1. 什么是决策树

决策树是什么,我们来“决策树”这个词进行分词,那么就会是决策/树。大家不妨思考一下,重点是决策还是树呢?其实啊,决策树的关键点在上。

我们平时写代码的那一串一串的If Else其实就是决策树的思想了。看下面的图是不是觉得很熟悉呢?

[机器学习-Sklearn]决策树DecisionTreeClassifier学习

2. 决策树介绍

决策树之所以叫决策树,就是因为它的结构是树形状的,如果你之前没了解过树这种数据结构,那么你至少要知道以下几个名词是什么意思。

  • 根节点:最顶部的那个节点
  • 叶子节点:每条路径最末尾的那个节点,也就是最外层的节点
  • 非叶子节点:一些条件的节点,下面会有更多分支,也叫做分支节点
  • 分支:也就是分叉

[机器学习-Sklearn]决策树DecisionTreeClassifier学习

3. 信息熵

[机器学习-Sklearn]决策树DecisionTreeClassifier学习

4. 例子1

安装panda 和 scikit-learn 如果你没有安装的话
conda install pandas
conda install scikit-learn

1. 准备数据及读取

季节 时间已过 8 点 风力情况 要不要赖床
spring no breeze yes
winter no no wind yes
autumn yes breeze yes
winter no no wind yes
summer no breeze yes
winter yes breeze yes
winter no gale yes
winter no no wind yes
spring yes no wind no
summer yes gale no
summer no gale no
autumn yes breeze no
spring,no,breeze,1
winter,no,no wind,1
autumn,yes,breeze,1
winter,no,no wind,1
summer,no,breeze,1
winter,yes,breeze,1
winter,no,gale,1
winter,no,no wind,1
spring,yes,no wind,0
summer,yes,gale,0
summer,no,gale,0
autumn,yes,breeze,0

2. 决策树的特征向量化

sklearn的DictVectorizer能对字典进行向量化。什么叫向量化呢?比如说你有季节这个属性有[春,夏,秋,冬]四个可选值,那么如果是春季,就可以用[1,0,0,0]表示,夏季就可以用[0,1,0,0]表示。不过在调用DictVectorizer它会将这些属性打乱,不会按照我们的思路来运行,但我们也可以一个方法查看,我们看看代码就明白了

通过DictVectorizer,我们就能够把字符型的数据,转化成0 1的矩阵,方便后面进行运算。额外说一句,这种转换方式其实就是one-hot编码。

import pandas as pd
import sklearn as sklearn
from sklearn.feature_extraction import DictVectorizer
from sklearn import tree

# pandas 读取 csv 文件,header = None 表示不将首行作为列
data = pd.read_csv('data/laic.csv', header=None)
# 指定列
data.columns = ['season', 'after 8', 'wind', 'lay bed']

# sparse=False意思是不产生稀疏矩阵
vec = DictVectorizer(sparse=False)
# 先用 pandas 对每行生成字典,然后进行向量化
feature = data[['season', 'after 8', 'wind']]

X_train = vec.fit_transform(feature.to_dict(orient='record'))
# 打印各个变量
print('show feature\n', feature)
print('show vector\n', X_train)
print('show vector name\n', vec.get_feature_names())
print('show vector name\n', vec.vocabulary_)

执行结果

show feature
     season after 8     wind
0   spring      no   breeze
1   winter      no  no wind
2   autumn     yes   breeze
3   winter      no  no wind
4   summer      no   breeze
5   winter     yes   breeze
6   winter      no     gale
7   winter      no  no wind
8   spring     yes  no wind
9   summer     yes     gale
10  summer      no     gale
11  autumn     yes   breeze
show vector
 [[1. 0. 0. 1. 0. 0. 1. 0. 0.]
 [1. 0. 0. 0. 0. 1. 0. 0. 1.]
 [0. 1. 1. 0. 0. 0. 1. 0. 0.]
 [1. 0. 0. 0. 0. 1. 0. 0. 1.]
 [1. 0. 0. 0. 1. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0. 1. 1. 0. 0.]
 [1. 0. 0. 0. 0. 1. 0. 1. 0.]
 [1. 0. 0. 0. 0. 1. 0. 0. 1.]
 [0. 1. 0. 1. 0. 0. 0. 0. 1.]
 [0. 1. 0. 0. 1. 0. 0. 1. 0.]
 [1. 0. 0. 0. 1. 0. 0. 1. 0.]
 [0. 1. 1. 0. 0. 0. 1. 0. 0.]]
show vector name
  ['after 8=no', 'after 8=yes', 'season=autumn', 'season=spring', 'season=summer', 'season=winter', 'wind=breeze', 'wind=gale', 'wind=no wind']
show vector name
 {'season=spring': 3, 'after 8=no': 0, 'wind=breeze': 6, 'season=winter': 5, 'wind=no wind': 8, 'season=autumn': 2, 'after 8=yes': 1, 'season=summer': 4, 'wind=gale': 7}

3. 决策树训练

Y_train = data['lay bed']
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf.fit(X_train, Y_train)

4. 决策树可视化

当完成一棵树的训练的时候,我们也可以让它可视化展示出来,不过sklearn没有提供这种功能,它仅仅能够让训练的模型保存到dot文件中。但我们可以借助其他工具让模型可视化,先看保存到dot的代码:

with open("out.dot", 'w') as f :
    f = tree.export_graphviz(clf, out_file = f,
            feature_names = vec.get_feature_names())

5 预测结果

result = clf.predict([[1., 0.,  0. ,1. , 0. , 0. , 1. , 0. , 0.]])
print(result)
[1]

然后可以执行下面命令生成一个out.pdf

dot out.dot -T pdf -o out.pdf

[机器学习-Sklearn]决策树DecisionTreeClassifier学习

after 8=no after 8=yes season=autumn season=spring season=summer season=winter wind=breeze wind=gale wind=no wind lay bed
1. 0. 0. 1. 0. 0. 1. 0. 0. 1
1. 0. 0. 0. 0. 1. 0. 0. 1. 1
0. 1. 1. 0. 0. 0. 1. 0. 0. 1
1. 0. 0. 0. 0. 1. 0. 0. 1. 1
1. 0. 0. 0. 1. 0. 1. 0. 0. 1
0. 1. 0. 0. 0. 1. 1. 0. 0. 1
1. 0. 0. 0. 0. 1. 0. 1. 0. 1
1. 0. 0. 0. 0. 1. 0. 0. 1. 1
0. 1. 0. 1. 0. 0. 0. 0. 1. 0
0. 1. 0. 0. 1. 0. 0. 1. 0. 0
1. 0. 0. 0. 1. 0. 0. 1. 0. 0
0. 1. 1. 0. 0. 0. 1. 0. 0. 0

6. 自己算验证熵的结果

import math
root_node_entropy = -(8/12)*(math.log(8/12, 2)) - (4/12)*(math.log(4/12, 2))
node1_left = (-(3/7)*(math.log(3/7, 2)) - (4/7)*(math.log(4/7,2)))
#node1_right =  (-(5/5)*(math.log(5/5, 2)) - (0/5)*(math.log(0/5,2)))
node1_right =  (-(5/5)*(0) - 0)
#node2_left =  -(3/3)*(math.log(3/3, 2)) - (0/3)*(math.log(0/3, 2))
node2_left =  -(3/3)*(0) - 0
node2_right = -(3/4)*(math.log(3/4, 2)) - (1/4)*(math.log(1/4, 2))

print('Entropy of season=winter ', root_node_entropy)
print('Entropy of wind=breeze ', node1_left)
print('Entropy of wind=breeze ', node1_right)
print('Entropy of node2_left', node2_left)
print('Entropy of node2_right', node2_right)
Entropy of season=winter  0.9182958340544896
Entropy of wind=breeze  0.9852281360342516
Entropy of wind=breeze  -0.0
Entropy of node2_left -0.0
Entropy of node2_right 0.8112781244591328

7. 把数据集全部改成数字不用DictVectorizer做向量化

spring :1 , summer : 2, spring : 3 , winter : 4
时间已过 8 点-no : 0
时间已过 8 点-yes :1
breeze : 1 , no wind : 2 , gale :3

laic1.csv 文件

1,0,1,1
4,0,2,1
3,1,1,1
4,0,2,1
2,0,1,1
4,1,1,1
4,0,3,1
4,0,2,1
1,1,2,0
2,1,3,0
2,0,3,0
3,1,1,0

代码

import pandas as pd
from sklearn import tree
data = pd.read_csv('data/laic1.csv', header=None)
# 指定列
data.columns = ['season', 'after 8', 'wind', 'lay bed']
X_train = data[['season', 'after 8', 'wind']]
Y_train = data['lay bed']
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf.fit(X_train, Y_train)
with open("out1.dot", 'w') as f :
    f = tree.export_graphviz(clf, out_file = f,
            feature_names =['season', 'after 8', 'wind'])

结果, 可以看到决策树图其实都一样的。
[机器学习-Sklearn]决策树DecisionTreeClassifier学习
预测结果

result = clf.predict([[1,1,1]])
print('Predict result:', result)
Predict result: [0]

可能遇到问题

如果你这个graphvis 的问题 (GraphViz’s executables not found), 可以根据下面这个link解决它

https://blog.csdn.net/qq_40304090/article/details/88594813