决策树

程序员文章站 2022-05-21 23:45:45

...

原文转自：https://blog.csdn.net/m0epNwstYk4/article/details/81437498
决策树（DT）是用于分类和回归的非参数监督学习方法。目标是创建一个模型，通过学习从数据特征推断出的简单决策规则来预测目标变量的价值。

例如，在下面的例子中，决策树从数据中学习使用一组if-then-else决策规则来逼近正弦曲线。树越深，决策规则越复杂，模型也越复杂。

用决策树进行1D回归。
该决策树来拟合与另外嘈杂观察正弦曲线。结果，它学习了近似正弦曲线的局部线性回归。我们可以看到，如果树的最大深度（由max_depth参数控制）设置得太高，那么决策树会学习过细的训练数据细节，并从噪声中学习，即它们会过度训练。640?wx_fmt=png

print(doc)

Import the necessary modules and libraries

import numpy as np

from sklearn.tree import DecisionTreeRegressor

import matplotlib.pyplot as plt

Create a random dataset

rng = np.random.RandomState(1)

X = np.sort(5 * rng.rand(80, 1), axis=0)

y = np.sin(X).ravel()

y[::5] += 3 * (0.5 - rng.rand(16))

Fit regression model

regr_1 = DecisionTreeRegressor(max_depth=2)

regr_2 = DecisionTreeRegressor(max_depth=5)

regr_1.fit(X, y)

regr_2.fit(X, y)

Predict

X_test = np.arange(0.0, 5.0, 0.01)[:, np.newaxis]

y_1 = regr_1.predict(X_test)

y_2 = regr_2.predict(X_test)

Plot the results

plt.figure()

plt.scatter(X, y, s=20, edgecolor=“black”,

       c="darkorange", label="data")

plt.plot(X_test, y_1, color=“cornflowerblue”,

    label="max_depth=2", linewidth=2)

plt.plot(X_test, y_2, color=“yellowgreen”, label=“max_depth=5”, linewidth=2)

plt.xlabel(“data”)

plt.ylabel(“target”)

plt.title(“Decision Tree Regression”)

plt.legend()

plt.show()

决策树的一些优点是：

很容易理解和解释。树可以被可视化。

只需很少的数据准备。其他技术通常需要数据标准化，需要创建虚拟变量并删除空白值。但请注意，此模块不支持缺少的值。

使用树的成本（即预测数据）是用于训练树的数据点的数量的对数。

能够处理数字和分类数据。其他技术通常专门用于分析只有一种类型变量的数据集。

能够处理多输出问题。

使用白盒模型。如果给定的情况在模型中是可观察的，则条件的解释很容易通过布尔逻辑来解释。相比之下，在黑盒模型（例如，在人工神经网络中），结果可能更难以解释。

可以使用统计测试来验证模型。这可以说明模型的可靠性。

即使其假设受到数据生成的真实模型的某种程度的侵犯，也能很好地执行。

决策树的缺点包括：

决策树学习者可以创建过于复杂的树，不能很好地概括数据。这被称为过度拟合。诸如修剪（目前不支持）等机制，设置叶节点所需的最小样本数或设置树的最大深度是避免此问题所必需的。

决策树可能不稳定，因为数据中的小变化可能会导致生成完全不同的树。通过在集合中使用决策树可以缓解这个问题。

学习最优决策树的问题在最优化的几个方面甚至简单的概念下已知是NP完全的。因此，实际决策树学习算法基于启发式算法，例如在每个节点进行局部最优决策的贪心算法。这样的算法不能保证返回全局最优决策树。这可以通过在集合学习器中训练多棵树来缓解，其中特征和样本随机地用替换采样。

有些概念很难学，因为决策树不能很容易地表达它们，例如XOR，奇偶校验或多路复用器问题。

如果某些类占主导地位，决策树学习者会创建偏向性树。因此，建议在拟合决策树之前平衡数据集。

分类
DecisionTreeClassifier 是一个能够对数据集进行多级分类的类。与其他分类器一样， DecisionTreeClassifier将两个数组作为输入：一个数组X，稀疏或密集，其大小保持训练样本，以及一个整数值数组Y，其中包含训练样本的类标签： [n_samples,n_features][n_samples]

pythonfromsklearnimporttree X=[[0,0],[1,1]]Y=[0,1]clf=tree.DecisionTreeClassifier()clf=clf.fit(X,Y)

经过拟合后，该模型可用于预测样本的类别

clf.predict([[2., 2.]])

array([1])

或者，可以预测每个类的概率，这是叶中同一类的训练样本的分数：

clf.predict_proba([[2., 2.]])

array([[ 0., 1.]])

DecisionTreeClassifier 能够同时具有二元（其中标签是[-1,1]）分类和多类别（其中标签是[0，…，K-1]）分类。使用Iris数据集，我们可以构建一棵树，如下所示

from sklearn.datasets import load_iris

from sklearn import tree

iris = load_iris()

clf = tree.DecisionTreeClassifier()

clf = clf.fit(iris.data, iris.target)

一旦训练完成，我们可以使用导出器以Graphviz格式导出树export_graphviz 。如果您使用conda软件包管理器，则可以使用graphviz二进制文件和python软件包进行安装conda安装python-graphviz或者，可以从graphviz项目主页下载graphviz的二进制文件，并使用pip安装graphviz从pypi安装Python包装程序。

以下是在整个虹膜数据集上训练的上述树的graphviz导出示例; 结果保存在一个输出文件iris.pdf中：

import graphviz

dot_data = tree.export_graphviz(clf, out_file=None)

graph = graphviz.Source(dot_data)

graph.render(“iris”)

export_graphviz出口也支持多种美学选项，包括可以通过类着色节点（或值回归）和如果需要的话使用显式的变量和类名称。Jupyter笔记本也自动内联这些图表

dot_data = tree.export_graphviz(clf, out_file=None,

                    feature_names=iris.feature_names,  

                    class_names=iris.target_names,  

                    filled=True, rounded=True,  

                    special_characters=True)

graph = graphviz.Source(dot_data)

graph

640?wx_fmt=png

经过拟合后，该模型可用于预测样本的类别：

clf.predict(iris.data[:1, :])

array([0])

或者，可以预测每个类的概率，这是叶中同一类的训练样本的分数：

clf.predict_proba(iris.data[:1, :])

array([[ 1., 0., 0.]])

例子：绘制虹膜数据集上决策树的决策表面
640?wx_fmt=png

print(doc)

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

from sklearn.tree import DecisionTreeClassifier

Parameters

n_classes = 3

plot_colors = “ryb”

plot_step = 0.02

Load data

iris = load_iris()

for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3],

                           [1, 2], [1, 3], [2, 3]]):

We only take the two corresponding features

X = iris.data[:, pair]

y = iris.target

Train

clf = DecisionTreeClassifier().fit(X, y)

Plot the decision boundary

plt.subplot(2, 3, pairidx + 1)

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),

                    np.arange(y_min, y_max, plot_step))

plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

cs = plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu)

plt.xlabel(iris.feature_names[pair[0]])

plt.ylabel(iris.feature_names[pair[1]])

Plot the training points

for i, color in zip(range(n_classes), plot_colors):

   idx = np.where(y == i)

   plt.scatter(X[idx, 0], X[idx, 1], c=color, label=iris.target_names[i],

               cmap=plt.cm.RdYlBu, edgecolor='black', s=15)

plt.suptitle(“Decision surface of a decision tree using paired features”)

plt.legend(loc=‘lower right’, borderpad=0, handletextpad=0)

plt.axis(“tight”)

plt.show()

多输出问题
多输出问题是一个监督学习问题，有几个输出可以预测，也就是说当Y是一个二维数组时。 [n_samples,n_outputs]当输出之间不存在关联时，解决这类问题的一种非常简单的方法是构建n个独立模型，即每个输出一个模型，然后使用这些模型独立预测n个输出中的每一个。但是，因为与同一输入相关的输出值可能本身是相互关联的，所以通常更好的方法是建立能够同时预测所有n个输出的单个模型。首先，由于只建立一个估计器，所以它需要较短的训练时间。其次，结果估计量的泛化精度往往会增加。

关于决策树，这个策略可以很容易地用来支持多输出问题。这需要进行以下更改：

将n个输出值存储在树叶中，而不是1; 使用分裂标准计算所有n个输出的平均减少量。

该模块通过实现双方这一战略提供了多路输出的问题，支持DecisionTreeClassifier和 DecisionTreeRegressor。如果决策树适合输出数组Y的大小，那么得到的估计器将：[nsamples, noutputs]

输出noutput值predict; 输出类概率的noutput数组列表 predict_proba。

多输出树进行回归演示。
在这个例子中，输入X是单个实数值，输出Y是X的正弦和余弦。

640?wx_fmt=png

print(doc)

import numpy as np

import matplotlib.pyplot as plt

from sklearn.tree import DecisionTreeRegressor

Create a random dataset

rng = np.random.RandomState(1)

X = np.sort(200 * rng.rand(100, 1) - 100, axis=0)

y = np.array([np.pi * np.sin(X).ravel(), np.pi * np.cos(X).ravel()]).T

y[::5, :] += (0.5 - rng.rand(20, 2))

Fit regression model

regr_1 = DecisionTreeRegressor(max_depth=2)

regr_2 = DecisionTreeRegressor(max_depth=5)

regr_3 = DecisionTreeRegressor(max_depth=8)

regr_1.fit(X, y)

regr_2.fit(X, y)

regr_3.fit(X, y)

Predict

X_test = np.arange(-100.0, 100.0, 0.01)[:, np.newaxis]

y_1 = regr_1.predict(X_test)

y_2 = regr_2.predict(X_test)

y_3 = regr_3.predict(X_test)

Plot the results

plt.figure()

s = 50

s = 25

plt.scatter(y[:, 0], y[:, 1], c=“navy”, s=s,

       edgecolor="black", label="data")

plt.scatter(y_1[:, 0], y_1[:, 1], c=“cornflowerblue”, s=s,

       edgecolor="black", label="max_depth=2")

plt.scatter(y_2[:, 0], y_2[:, 1], c=“red”, s=s,

       edgecolor="black", label="max_depth=5")

plt.scatter(y_3[:, 0], y_3[:, 1], c=“orange”, s=s,

       edgecolor="black", label="max_depth=8")

plt.xlim([-6, 6])

plt.ylim([-6, 6])

plt.xlabel(“target 1”)

plt.ylabel(“target 2”)

plt.title(“Multi-output Decision Tree Regression”)

plt.legend(loc=“best”)

plt.show()

面对多输出估计器完成
在这个例子中，输入X是面的上半部分的像素，输出Y是这些面的下半部分的像素。640?wx_fmt=png

print(doc)

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import fetch_olivetti_faces

from sklearn.utils.validation import check_random_state

from sklearn.ensemble import ExtraTreesRegressor

from sklearn.neighbors import KNeighborsRegressor

from sklearn.linear_model import LinearRegression

from sklearn.linear_model import RidgeCV

Load the faces datasets

data = fetch_olivetti_faces()

targets = data.target

data = data.images.reshape((len(data.images), -1))

train = data[targets < 30]

test = data[targets >= 30] # Test on independent people

Test on a subset of people

n_faces = 5

rng = check_random_state(4)

face_ids = rng.randint(test.shape[0], size=(n_faces, ))

test = test[face_ids, :]

n_pixels = data.shape[1]

Upper half of the faces

X_train = train[:, :(n_pixels + 1) // 2]

Lower half of the faces

y_train = train[:, n_pixels // 2:]

X_test = test[:, :(n_pixels + 1) // 2]

y_test = test[:, n_pixels // 2:]

Fit estimators

ESTIMATORS = {

“Extra trees”: ExtraTreesRegressor(n_estimators=10, max_features=32,

                                  random_state=0),

“K-nn”: KNeighborsRegressor(),

“Linear regression”: LinearRegression(),

“Ridge”: RidgeCV(),

}

y_test_predict = dict()

for name, estimator in ESTIMATORS.items():

estimator.fit(X_train, y_train)

y_test_predict[name] = estimator.predict(X_test)

Plot the completed faces

image_shape = (64, 64)

n_cols = 1 + len(ESTIMATORS)

plt.figure(figsize=(2. * n_cols, 2.26 * n_faces))

plt.suptitle(“Face completion with multi-output estimators”, size=16)

for i in range(n_faces):

true_face = np.hstack((X_test[i], y_test[i]))

if i:

   sub = plt.subplot(n_faces, n_cols, i * n_cols + 1)

else:

   sub = plt.subplot(n_faces, n_cols, i * n_cols + 1,

                     title="true faces")

sub.axis(“off”)

sub.imshow(true_face.reshape(image_shape),

          cmap=plt.cm.gray,

          interpolation="nearest")

for j, est in enumerate(sorted(ESTIMATORS)):

   completed_face = np.hstack((X_test[i], y_test_predict[est][i]))


   if i:

       sub = plt.subplot(n_faces, n_cols, i * n_cols + 2 + j)


   else:

       sub = plt.subplot(n_faces, n_cols, i * n_cols + 2 + j,

                         title=est)


   sub.axis("off")

   sub.imshow(completed_face.reshape(image_shape),

              cmap=plt.cm.gray,

              interpolation="nearest")

plt.show()

决策树

Import the necessary modules and libraries

Create a random dataset

Fit regression model

Predict

Plot the results

Parameters

Load data

We only take the two corresponding features

Train

Plot the decision boundary

Plot the training points

Create a random dataset

Fit regression model

Predict

Plot the results

Load the faces datasets

Test on a subset of people

Upper half of the faces

Lower half of the faces

Fit estimators

Plot the completed faces

Python实现决策树C4.5算法的示例

python实现决策树ID3算法的示例代码

决策树详细解析（python决策树ID3和C4.5）

非线性分类和决策树(scikit-learn 机器学习)

机器学习python实战之决策树

解读python如何实现决策树算法

python实现决策树分类（2）

Python决策树之基于信息增益的特征选择示例

Graphviz可视化决策树框架（生成决策树、可视化树）

决策树模型的流程梳理