Chapter 6 决策树

程序员文章站 2022-07-14 21:00:11

...

OReilly.Hands-On Machine Learning with Scikit-Learn and TensorFlow读书笔记

Chapter 6 Decision Trees

Like SVMs, Decision Trees are versatile ML algorithms that can perform both classification and regression tasks, and even multioutput tasks.

Decision Trees are also fundamental components of Random Forest, which are among the most powerful ML algorithms.

6.1 Training and Visualizing a Decision Tree

Training a $\verb+DecisionTreeClassifier+$ on the iris dataset.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
iris=load_iris()

#receive different results, why?
#X=iris['data'][:,2:] #petal length and width
#y=iris['target']

X=iris.data[:,2:]
y=iris.target

tree_clf=DecisionTreeClassifier(max_depth=2)
tree_clf.fit(X,y)

In your working directory, manually create directories of the following structure: ./images/decision_trees

You can visualize the trained decision tree by using the $\verb+export_graphviz()+$ method to output a graph definition file called iris_tree.dot.

from sklearn.tree import export_graphviz
import os
# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "decision_trees"

def image_path(fig_id):
    return os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID, fig_id)

export_graphviz(
    tree_clf,
    out_file=image_path("iris_tree.dot"),
    feature_names=iris.feature_names[2:],
    class_names=iris.target_names,
    rounded=True,
    filled=True
)

Then you can convert this .dot file to a variety formats such as PDF or PNG using the dot command-line tool from the graphviz package.

$dot -Tpng iris_tree.dot -o iris_tree.png

Chapter 6 决策树

6.2 Making Predictions

One of the many qualities of Decision Trees is that they require very little data preparation. In particular, they don’t require feature scaling or centering at all.

A node’s $\verb+samples+$ attribute counts how many training instances it applies to.

A node’s $\verb+value+$ attribute tells you how many training instances of each class this node contains.

A node’s $\verb+gini+$ attribute measures its impurity: a node is “pure” ( $\verb+gini=0+$ ) if all training instances it applies to belong to the same class.

Equation 6-1. Gini impurity
$G_i=1-\sum_{k=1}^n p_{i,k}^2$

$p_{i,k}$ is the ratio of class $k$ instances among the training instances in the $i$ -th node.

For example, the depth-2 left node has a $\verb+gini+$ score equal to $1 – (0/54)^2 – (49/54)^2 – (5/54)^2 \approx 0.168$ .

Scikit-Learn uses the CART algorithm, which produces only binary trees. Other algorithms such as ID3 can produce Decision Trees with nodes that have more than two children.

Model Interpretation: White Box Versus Black Box

White Box Models are fairly intuitive and their decisions are easy to interpret, such as Decision Trees.

Black box models are usually hard to explain in simple terms why the predictions were made, such as Random Forest and Neural Networks.

6.3 Estimating Class Probabilities

tree_clf.predict_proba([[5,1.5]])
#array([[0.        , 0.90740741, 0.09259259]])
tree_clf.predict([[5,1.5]])#array([1])

6.4 The CART Training Algorithm

Scikit-Learn use CART (Classification And Regression Tree) algorithm to train Decision Trees (this process is also called “growing” trees). The idea is: the algorithm first splits the training set in two subsets using a single feature $k$ and a threshold $t_k$ (e.g., “petal length $\le$ 2.45 cm”).

How does it choose $k$ and $t_k$ ?

It searches for the pair $(k, t_k)$ that produces the purest subsets (weighted by their size). The cost function that the algorithm tries to minimize is given by Equation 6-2.

Equation 6-2. CART cost function for classification
$J(k,t_k)= \frac{m_{\textrm{left}}}{m}G_{\textrm{left}}+\frac{m_{\textrm{right}}}{m}G_{\textrm{right}}\\ \textrm{where } \begin{cases} G_\textrm{left/right} \textrm{ measures the impurity of the left/right subsets,}\\ m_\textrm{left/right} \textrm{ is the number of instances in the left/right subsets,} \end{cases}$
Termination conditions: It stops recursing once it reaches the maximum depth (defined by the $\verb+max_depth+$ hyperparameter), or if it cannot find a split that will reduce impurity. A few other hyperparameters control additional stopping conditions ( $\verb+min_samples_split+$ , $\verb+min_samples_leaf+$ , $\verb+min_weight_fraction_leaf+$ , and $\verb+max_leaf_nodes+$ ).

The CART algorithm is a greedy algorithm: it greedily searches for an optimum split at the top level, then repeat the process at each level. It does not check whether or not the split will lead to the lowest possible impurity several levels down. A greedy algorithm often produces a reasonably good solution, but it is not guaranteed to be the optimal solution.

Finding the optimal tree is known to be an NP-Complete problem.

6.5 Computational Complexity

Traversing the Decision Tree requires going through roughly $O(\log_2(m))$ nodes. Since each node requires checking the value of one feature, the overall prediction complexity is just $O(\log_2(m))$ , independent of the number of features.

However, the training algorithm compares all features (or less if $\verb+max_features+$ is set) on all samples at each node. This results in a training complexity of $O(n \times m \log_2(m))$ .

For small training sets (less than a few thousand instances), Scikit-Learn can speed up
training by presorting the data (set $\verb+presort=True+$ ), but this slows down training considerably for larger training sets.

6.6 Gini Impurity or Entropy?

Gini impurity measure is the default measurement, but you can use entropy impurity measure instead by setting the $\verb+criterion+$ hyperparameter to “ $\verb+entropy+$ ”.

Equation 6-3. Entropy
$H_i=-\mathop{\sum_{k=1}^n}\limits_{P_{i,k}\neq 0}P_{i,k}\log P_{i,k}$
So should you use Gini impurity or entropy? The truth is, most of the time it does not
make a big difference: they lead to similar trees. Gini impurity is slightly faster to compute, so it is a good default. However, when they differ, Gini impurity tends to isolate the most frequent class in its own branch of the tree, while entropy tends to produce slightly more balanced trees.

6.7 Regularization Hyperparameters

Decision Trees make very few assumptions about the training data (as opposed to lin‐
ear models, which obviously assume that the data is linear, for example).

nonparametric model: the number of parameters is not determined prior to training, so the model structure is free to stick closely to the data.

parametric model: has a predetermined number of parameters, so its degree of freedom is limited, reducing the risk of overfitting (but increasing the risk of underfitting).

Regularization:

$\verb+max_depth+$
$\verb+min_samples_split+$ (the minimum number of samples a node must have before it can be split)
$\verb+min_samples_leaf+$ (the minimum number of samples a leaf node must have)
$\verb+min_weight_fraction_leaf+$ (same as $\verb+min_samples_leaf+$ but expressed as a fraction of the total number of weighted instances)
$\verb+max_leaf_nodes+$ (maximum number of leaf nodes)
$\verb+max_features+$ (maximum number of features that are evaluated for splitting at each node).

Increasing $\verb+min_*+$ hyperparameters or reducing $\verb+max_*+$ hyperparameters will regularize the
model (i.e., not making the model too complex).

Other algorithms work by first training the Decision Tree without restrictions, then pruning (deleting) unnecessary nodes. The necessity of pruning a node is evaluates by standard statistical tests, such as the $\chi ^2$ test.

6.8 Regression

# Quadratic training set + noise
import numpy as np
np.random.seed(42)
m = 200
X = np.random.rand(m, 1)
y = 4 * (X - 0.5) ** 2
y = y + np.random.randn(m, 1) / 10
from sklearn.tree import DecisionTreeRegressor
tree_reg=DecisionTreeRegressor(max_depth=2)
tree_reg.fit(X,y)

export_graphviz(
    tree_reg,
    out_file=image_path("reg_tree.dot"),
    feature_names=['x1'],
    rounded=True,
    filled=True
)

$dot -Tpng reg_tree.dot -o reg_tree.png

Chapter 6 决策树

The main difference is that, given a $x_1$ value, instead of predicting a class in each node, it predicts a value. For example, the prediction of a new instance with $x_1=0.6$ , the algorithm traverses the decision tree above, reaches the second to right leaf node, and obtain the value 0.111 finally. This prediction is simply the average target value of the 110 training instances associated to this leaf node. This prediction results in a mse (Mean Squared Error) equal to 0.015 over these 110 instances.

The CART algorithm works mostly the same way as earlier, except that instead of try‐
ing to split the training set in a way that minimizes impurity, it now tries to split the
training set in a way that minimizes the MSE.

Equation 6-4. CART cost function for regression
$J(k,t_k)= \frac{m_{\textrm{left}}}{m}\textrm{MSE}_{\textrm{left}}+\frac{m_{\textrm{right}}}{m}\textrm{MSE}_{\textrm{right}} \textrm{where } \begin{cases} \textrm{MSE}_\textrm{node}=\sum_{i\in \textrm{node}}\left(\hat y_\textrm{node}-y^{(i)}\right)^2\\ \hat y_\textrm{node}=\frac{1}{m_\textrm{node}}\sum_{i\in \textrm{node}}y^{(i)} \end{cases}$

6.9 Instability

limitations:

orthogonal decision boundaries (all splits are perpendicular to an axis), which makes them sensitive to training set rotation. One way to limit this is PCA.
very sensitive to small variations in the training data

上一篇：人工智能 chapter 05 K均值

下一篇：人工智能-机器学习-算法-分类算法：决策树与随机森林算法

Chapter 6 决策树

OReilly.Hands-On Machine Learning with Scikit-Learn and TensorFlow读书笔记

Chapter 6 Decision Trees

6.1 Training and Visualizing a Decision Tree

6.2 Making Predictions

6.3 Estimating Class Probabilities

6.4 The CART Training Algorithm

6.5 Computational Complexity

6.6 Gini Impurity or Entropy?

6.7 Regularization Hyperparameters

6.8 Regression

6.9 Instability

99.6%屏占比！vivo NEX 3 5G亮相：6天线、性能爆表

详解陈氏太极拳24式中的6大招式初学者必学

小米6用户可继续当钉子户雷军：米10供货紧张让想买朋友先买

深入浅出讲解ES6的解构

浅析Javascript ES6中的原生Promise

详解Javascript ES6中的箭头函数(Arrow Functions)

联想七夕特别活动！购买联想Z6 Pro两台立减1040元

小米MIX 2S再降200元：6GB+128GB到手1799元

ES6中Math对象新增的方法实例详解

ES6正则的扩展实例详解

Chapter 6 决策树

OReilly.Hands-On Machine Learning with Scikit-Learn and TensorFlow读书笔记

Chapter 6 Decision Trees

6.1 Training and Visualizing a Decision Tree

6.2 Making Predictions

6.3 Estimating Class Probabilities

6.4 The CART Training Algorithm

6.5 Computational Complexity

6.6 Gini Impurity or Entropy?

6.7 Regularization Hyperparameters

6.8 Regression

6.9 Instability

99.6%屏占比！vivo NEX 3 5G亮相：6天线、性能爆表

详解陈氏太极拳24式中的6大招式 初学者必学

小米6用户可继续当钉子户 雷军：米10供货紧张 让想买朋友先买

深入浅出讲解ES6的解构

浅析Javascript ES6中的原生Promise

详解Javascript ES6中的箭头函数(Arrow Functions)

联想七夕特别活动！购买联想Z6 Pro两台立减1040元

小米MIX 2S再降200元：6GB+128GB到手1799元

ES6中Math对象新增的方法实例详解

ES6正则的扩展实例详解

详解陈氏太极拳24式中的6大招式初学者必学

小米6用户可继续当钉子户雷军：米10供货紧张让想买朋友先买