[机器学习-Sklearn]决策树DecisionTreeClassifier学习

程序员文章站 2022-03-30 18:55:27

...

决策树DecisionTreeClassifier学习

1. 什么是决策树
2. 决策树介绍
3. 信息熵
4. 例子1

1. 准备数据及读取
2. 决策树的特征向量化
3. 决策树训练
4. 决策树可视化
5 预测结果
6. 自己算验证熵的结果
7. 把数据集全部改成数字不用DictVectorizer做向量化

可能遇到问题

1. 什么是决策树

决策树是什么，我们来“决策树”这个词进行分词，那么就会是决策/树。大家不妨思考一下，重点是决策还是树呢？其实啊，决策树的关键点在树上。

我们平时写代码的那一串一串的If Else其实就是决策树的思想了。看下面的图是不是觉得很熟悉呢？

[机器学习-Sklearn]决策树DecisionTreeClassifier学习

2. 决策树介绍

决策树之所以叫决策树，就是因为它的结构是树形状的，如果你之前没了解过树这种数据结构，那么你至少要知道以下几个名词是什么意思。

根节点：最顶部的那个节点
叶子节点：每条路径最末尾的那个节点，也就是最外层的节点
非叶子节点：一些条件的节点，下面会有更多分支，也叫做分支节点
分支：也就是分叉

[机器学习-Sklearn]决策树DecisionTreeClassifier学习

3. 信息熵

[机器学习-Sklearn]决策树DecisionTreeClassifier学习

4. 例子1

安装panda 和 scikit-learn 如果你没有安装的话
conda install pandas
conda install scikit-learn

1. 准备数据及读取

季节	时间已过 8 点	风力情况	要不要赖床
spring	no	breeze	yes
winter	no	no wind	yes
autumn	yes	breeze	yes
winter	no	no wind	yes
summer	no	breeze	yes
winter	yes	breeze	yes
winter	no	gale	yes
winter	no	no wind	yes
spring	yes	no wind	no
summer	yes	gale	no
summer	no	gale	no
autumn	yes	breeze	no

spring,no,breeze,1
winter,no,no wind,1
autumn,yes,breeze,1
winter,no,no wind,1
summer,no,breeze,1
winter,yes,breeze,1
winter,no,gale,1
winter,no,no wind,1
spring,yes,no wind,0
summer,yes,gale,0
summer,no,gale,0
autumn,yes,breeze,0

2. 决策树的特征向量化

sklearn的DictVectorizer能对字典进行向量化。什么叫向量化呢？比如说你有季节这个属性有[春,夏,秋,冬]四个可选值，那么如果是春季，就可以用[1,0,0,0]表示，夏季就可以用[0,1,0,0]表示。不过在调用DictVectorizer它会将这些属性打乱，不会按照我们的思路来运行，但我们也可以一个方法查看，我们看看代码就明白了

通过DictVectorizer，我们就能够把字符型的数据，转化成0 1的矩阵，方便后面进行运算。额外说一句，这种转换方式其实就是one-hot编码。

import pandas as pd
import sklearn as sklearn
from sklearn.feature_extraction import DictVectorizer
from sklearn import tree

# pandas 读取 csv 文件，header = None 表示不将首行作为列
data = pd.read_csv('data/laic.csv', header=None)
# 指定列
data.columns = ['season', 'after 8', 'wind', 'lay bed']

# sparse=False意思是不产生稀疏矩阵
vec = DictVectorizer(sparse=False)
# 先用 pandas 对每行生成字典，然后进行向量化
feature = data[['season', 'after 8', 'wind']]

X_train = vec.fit_transform(feature.to_dict(orient='record'))
# 打印各个变量
print('show feature\n', feature)
print('show vector\n', X_train)
print('show vector name\n', vec.get_feature_names())
print('show vector name\n', vec.vocabulary_)

执行结果

show feature
     season after 8     wind
0   spring      no   breeze
1   winter      no  no wind
2   autumn     yes   breeze
3   winter      no  no wind
4   summer      no   breeze
5   winter     yes   breeze
6   winter      no     gale
7   winter      no  no wind
8   spring     yes  no wind
9   summer     yes     gale
10  summer      no     gale
11  autumn     yes   breeze
show vector
 [[1. 0. 0. 1. 0. 0. 1. 0. 0.]
 [1. 0. 0. 0. 0. 1. 0. 0. 1.]
 [0. 1. 1. 0. 0. 0. 1. 0. 0.]
 [1. 0. 0. 0. 0. 1. 0. 0. 1.]
 [1. 0. 0. 0. 1. 0. 1. 0. 0.]
 [0. 1. 0. 0. 0. 1. 1. 0. 0.]
 [1. 0. 0. 0. 0. 1. 0. 1. 0.]
 [1. 0. 0. 0. 0. 1. 0. 0. 1.]
 [0. 1. 0. 1. 0. 0. 0. 0. 1.]
 [0. 1. 0. 0. 1. 0. 0. 1. 0.]
 [1. 0. 0. 0. 1. 0. 0. 1. 0.]
 [0. 1. 1. 0. 0. 0. 1. 0. 0.]]
show vector name
  ['after 8=no', 'after 8=yes', 'season=autumn', 'season=spring', 'season=summer', 'season=winter', 'wind=breeze', 'wind=gale', 'wind=no wind']
show vector name
 {'season=spring': 3, 'after 8=no': 0, 'wind=breeze': 6, 'season=winter': 5, 'wind=no wind': 8, 'season=autumn': 2, 'after 8=yes': 1, 'season=summer': 4, 'wind=gale': 7}

3. 决策树训练

Y_train = data['lay bed']
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf.fit(X_train, Y_train)

4. 决策树可视化

当完成一棵树的训练的时候，我们也可以让它可视化展示出来，不过sklearn没有提供这种功能，它仅仅能够让训练的模型保存到dot文件中。但我们可以借助其他工具让模型可视化，先看保存到dot的代码：

with open("out.dot", 'w') as f :
    f = tree.export_graphviz(clf, out_file = f,
            feature_names = vec.get_feature_names())

5 预测结果

result = clf.predict([[1., 0.,  0. ,1. , 0. , 0. , 1. , 0. , 0.]])
print(result)

[1]

然后可以执行下面命令生成一个out.pdf

dot out.dot -T pdf -o out.pdf

[机器学习-Sklearn]决策树DecisionTreeClassifier学习

after 8=no	after 8=yes	season=autumn	season=spring	season=summer	season=winter	wind=breeze	wind=gale	wind=no wind	lay bed
1.	0.	0.	1.	0.	0.	1.	0.	0.	1
1.	0.	0.	0.	0.	1.	0.	0.	1.	1
0.	1.	1.	0.	0.	0.	1.	0.	0.	1
1.	0.	0.	0.	0.	1.	0.	0.	1.	1
1.	0.	0.	0.	1.	0.	1.	0.	0.	1
0.	1.	0.	0.	0.	1.	1.	0.	0.	1
1.	0.	0.	0.	0.	1.	0.	1.	0.	1
1.	0.	0.	0.	0.	1.	0.	0.	1.	1
0.	1.	0.	1.	0.	0.	0.	0.	1.	0
0.	1.	0.	0.	1.	0.	0.	1.	0.	0
1.	0.	0.	0.	1.	0.	0.	1.	0.	0
0.	1.	1.	0.	0.	0.	1.	0.	0.	0

6. 自己算验证熵的结果

import math
root_node_entropy = -(8/12)*(math.log(8/12, 2)) - (4/12)*(math.log(4/12, 2))
node1_left = (-(3/7)*(math.log(3/7, 2)) - (4/7)*(math.log(4/7,2)))
#node1_right =  (-(5/5)*(math.log(5/5, 2)) - (0/5)*(math.log(0/5,2)))
node1_right =  (-(5/5)*(0) - 0)
#node2_left =  -(3/3)*(math.log(3/3, 2)) - (0/3)*(math.log(0/3, 2))
node2_left =  -(3/3)*(0) - 0
node2_right = -(3/4)*(math.log(3/4, 2)) - (1/4)*(math.log(1/4, 2))

print('Entropy of season=winter ', root_node_entropy)
print('Entropy of wind=breeze ', node1_left)
print('Entropy of wind=breeze ', node1_right)
print('Entropy of node2_left', node2_left)
print('Entropy of node2_right', node2_right)

Entropy of season=winter  0.9182958340544896
Entropy of wind=breeze  0.9852281360342516
Entropy of wind=breeze  -0.0
Entropy of node2_left -0.0
Entropy of node2_right 0.8112781244591328

7. 把数据集全部改成数字不用DictVectorizer做向量化

spring :1 , summer : 2, spring : 3 , winter : 4
时间已过 8 点-no : 0
时间已过 8 点-yes :1
breeze : 1 , no wind : 2 , gale :3

laic1.csv 文件

1,0,1,1
4,0,2,1
3,1,1,1
4,0,2,1
2,0,1,1
4,1,1,1
4,0,3,1
4,0,2,1
1,1,2,0
2,1,3,0
2,0,3,0
3,1,1,0

代码

import pandas as pd
from sklearn import tree
data = pd.read_csv('data/laic1.csv', header=None)
# 指定列
data.columns = ['season', 'after 8', 'wind', 'lay bed']
X_train = data[['season', 'after 8', 'wind']]
Y_train = data['lay bed']
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf.fit(X_train, Y_train)
with open("out1.dot", 'w') as f :
    f = tree.export_graphviz(clf, out_file = f,
            feature_names =['season', 'after 8', 'wind'])

结果，可以看到决策树图其实都一样的。
[机器学习-Sklearn]决策树DecisionTreeClassifier学习
预测结果

result = clf.predict([[1,1,1]])
print('Predict result:', result)

Predict result: [0]

可能遇到问题

如果你这个graphvis 的问题（GraphViz’s executables not found），可以根据下面这个link解决它

https://blog.csdn.net/qq_40304090/article/details/88594813

[机器学习-Sklearn]决策树DecisionTreeClassifier学习

决策树DecisionTreeClassifier学习

1. 什么是决策树

2. 决策树介绍

3. 信息熵

4. 例子1

1. 准备数据及读取

2. 决策树的特征向量化

3. 决策树训练

4. 决策树可视化

5 预测结果

6. 自己算验证熵的结果

7. 把数据集全部改成数字不用DictVectorizer做向量化

可能遇到问题

这个机器人厉害了具备行为学习和云备份功能

智小乐智能机器人发布拥有主动学习能力

荐 14天数据分析与机器学习实践之Day02——数据分析处理库Pandas应用总结

Python机器学习k-近邻算法(K Nearest Neighbor)实例详解

Python机器学习库scikit-learn安装与基本使用教程

JavaScript中七种流行的开源机器学习框架

机器学习 AI 谷歌ML Kit 与苹果Core ML

玩火*？科学家让机器人学习如何捕猎

收藏 | 数据分析师最常用的10个机器学习算法！

Python 机器学习库 NumPy入门教程

after 8=no	after 8=yes	season=autumn	season=spring	season=summer	season=winter	wind=breeze	wind=gale	wind=no wind	lay bed
1.	0.	0.	1.	0.	0.	1.	0.	0.	1
1.	0.	0.	0.	0.	1.	0.	0.	1.	1
0.	1.	1.	0.	0.	0.	1.	0.	0.	1
1.	0.	0.	0.	0.	1.	0.	0.	1.	1
1.	0.	0.	0.	1.	0.	1.	0.	0.	1
0.	1.	0.	0.	0.	1.	1.	0.	0.	1
1.	0.	0.	0.	0.	1.	0.	1.	0.	1
1.	0.	0.	0.	0.	1.	0.	0.	1.	1
0.	1.	0.	1.	0.	0.	0.	0.	1.	0
0.	1.	0.	0.	1.	0.	0.	1.	0.	0
1.	0.	0.	0.	1.	0.	0.	1.	0.	0
0.	1.	1.	0.	0.	0.	1.	0.	0.	0

after 8=no	after 8=yes	season=autumn	season=spring	season=summer	season=winter	wind=breeze	wind=gale	wind=no wind	lay bed
1.	0.	0.	1.	0.	0.	1.	0.	0.	1
1.	0.	0.	0.	0.	1.	0.	0.	1.	1
0.	1.	1.	0.	0.	0.	1.	0.	0.	1
1.	0.	0.	0.	0.	1.	0.	0.	1.	1
1.	0.	0.	0.	1.	0.	1.	0.	0.	1
0.	1.	0.	0.	0.	1.	1.	0.	0.	1
1.	0.	0.	0.	0.	1.	0.	1.	0.	1
1.	0.	0.	0.	0.	1.	0.	0.	1.	1
0.	1.	0.	1.	0.	0.	0.	0.	1.	0
0.	1.	0.	0.	1.	0.	0.	1.	0.	0
1.	0.	0.	0.	1.	0.	0.	1.	0.	0
0.	1.	1.	0.	0.	0.	1.	0.	0.	0

[机器学习-Sklearn]决策树DecisionTreeClassifier学习

决策树DecisionTreeClassifier学习

1. 什么是决策树

2. 决策树介绍

3. 信息熵

4. 例子1

1. 准备数据及读取

2. 决策树的特征向量化

3. 决策树训练

4. 决策树可视化

5 预测结果

6. 自己算验证熵的结果

7. 把数据集全部改成数字不用DictVectorizer做向量化

可能遇到问题

这个机器人厉害了 具备行为学习和云备份功能

智小乐智能机器人发布 拥有主动学习能力

荐 14天数据分析与机器学习实践之Day02——数据分析处理库Pandas应用总结

Python机器学习k-近邻算法(K Nearest Neighbor)实例详解

Python机器学习库scikit-learn安装与基本使用教程

JavaScript中七种流行的开源机器学习框架

机器学习 AI 谷歌ML Kit 与苹果Core ML

玩火*？科学家让机器人学习如何捕猎

收藏 | 数据分析师最常用的10个机器学习算法！

Python 机器学习库 NumPy入门教程

这个机器人厉害了具备行为学习和云备份功能

智小乐智能机器人发布拥有主动学习能力

after 8=no	after 8=yes	season=autumn	season=spring	season=summer	season=winter	wind=breeze	wind=gale	wind=no wind	lay bed
1.	0.	0.	1.	0.	0.	1.	0.	0.	1
1.	0.	0.	0.	0.	1.	0.	0.	1.	1
0.	1.	1.	0.	0.	0.	1.	0.	0.	1
1.	0.	0.	0.	0.	1.	0.	0.	1.	1
1.	0.	0.	0.	1.	0.	1.	0.	0.	1
0.	1.	0.	0.	0.	1.	1.	0.	0.	1
1.	0.	0.	0.	0.	1.	0.	1.	0.	1
1.	0.	0.	0.	0.	1.	0.	0.	1.	1
0.	1.	0.	1.	0.	0.	0.	0.	1.	0
0.	1.	0.	0.	1.	0.	0.	1.	0.	0
1.	0.	0.	0.	1.	0.	0.	1.	0.	0
0.	1.	1.	0.	0.	0.	1.	0.	0.	0