决策树分类算法_分类算法-决策树
决策树分类算法
分类算法-决策树 (Classification Algorithms - Decision Tree)
决策树简介 (Introduction to Decision Tree)
In general, Decision tree analysis is a predictive modelling tool that can be applied across many areas. Decision trees can be constructed by an algorithmic approach that can split the dataset in different ways based on different conditions. Decisions tress are the most powerful algorithms that falls under the category of supervised algorithms.
通常,决策树分析是一种预测建模工具,可以应用于许多领域。 决策树可以通过一种算法方法构建,该算法可以根据不同条件以不同方式拆分数据集。 决策树是属于监督算法类别的最强大的算法。
They can be used for both classification and regression tasks. The two main entities of a tree are decision nodes, where the data is split and leaves, where we got outcome. The example of a binary tree for predicting whether a person is fit or unfit providing various information like age, eating habits and exercise habits, is given below −
它们可用于分类和回归任务。 一棵树的两个主要实体是决策节点,在这里数据被拆分并离开,在这里我们得到结果。 下面提供了用于预测一个人是否适合或不适合的二叉树示例,它提供了诸如年龄,饮食习惯和运动习惯等各种信息-
In the above decision tree, the question are decision nodes and final outcomes are leaves. We have the following two types of decision trees −
在上面的决策树中,问题是决策节点,最终结果是叶子。 我们有以下两种类型的决策树-
-
Classification decision trees − In this kind of decision trees, the decision variable is categorical. The above decision tree is an example of classification decision tree.
分类决策树 -在这种决策树中,决策变量是分类的。 上面的决策树是分类决策树的示例。
-
Regression decision trees − In this kind of decision trees, the decision variable is continuous.
回归决策树 -在这种决策树中,决策变量是连续的。
实现决策树算法 (Implementing Decision Tree Algorithm)
基尼指数 (Gini Index)
It is the name of the cost function that is used to evaluate the binary splits in the dataset and works with the categorial target variable “Success” or “Failure”.
它是成本函数的名称,用于评估数据集中的二进制拆分,并与分类目标变量“成功”或“失败”一起使用。
Higher the value of Gini index, higher the homogeneity. A perfect Gini index value is 0 and worst is 0.5 (for 2 class problem). Gini index for a split can be calculated with the help of following steps −
基尼指数值越高,同质性越高。 理想的基尼系数值为0,最差的值为0.5(对于2类问题)。 拆分的基尼系数可以通过以下步骤计算-
-
First, calculate Gini index for sub-nodes by using the formula p^2+q^2 , which is the sum of the square of probability for success and failure.
首先,使用公式p ^ 2 + q ^ 2计算子节点的Gini指数,这是成功和失败概率的平方之和。
-
Next, calculate Gini index for split using weighted Gini score of each node of that split.
接下来,使用该拆分的每个节点的加权Gini得分计算拆分的Gini指数。
Classification and Regression Tree (CART) algorithm uses Gini method to generate binary splits.
分类和回归树(CART)算法使用Gini方法生成二进制拆分。
分割创作 (Split Creation)
A split is basically including an attribute in the dataset and a value. We can create a split in dataset with the help of following three parts −
拆分基本上包括数据集中的一个属性和一个值。 我们可以通过以下三个部分在数据集中创建拆分-
-
Part1: Calculating Gini Score − We have just discussed this part in the previous section.
第1部分:计算基尼分数 -我们在上一节中刚刚讨论了这一部分。
-
Part2: Splitting a dataset − It may be defined as separating a dataset into two lists of rows having index of an attribute and a split value of that attribute. After getting the two groups - right and left, from the dataset, we can calculate the value of split by using Gini score calculated in first part. Split value will decide in which group the attribute will reside.
第2部分:拆分数据集 -可以定义为将数据集分为两列,每列具有一个属性的索引和该属性的拆分值。 从数据集中获得左右两个组之后,我们可以使用第一部分中计算的基尼得分来计算split的值。 分割值将决定属性将驻留在哪个组中。
-
Part3: Evaluating all splits − Next part after finding Gini score and splitting dataset is the evaluation of all splits. For this purpose, first, we must check every value associated with each attribute as a candidate split. Then we need to find the best possible split by evaluating the cost of the split. The best split will be used as a node in the decision tree.
第三部分:评估所有分割 -找到基尼得分并分割数据集后的下一部分是所有分割的评估。 为此,首先,我们必须检查与每个属性关联的每个值作为候选拆分。 然后,我们需要通过评估分割成本来找到最佳分割。 最佳拆分将用作决策树中的节点。
建树 (Building a Tree)
As we know that a tree has root node and terminal nodes. After creating the root node, we can build the tree by following two parts −
我们知道一棵树有根节点和终端节点。 创建根节点后,我们可以通过以下两个部分来构建树:
第1部分:终端节点创建 (Part1: Terminal node creation)
While creating terminal nodes of decision tree, one important point is to decide when to stop growing tree or creating further terminal nodes. It can be done by using two criteria namely maximum tree depth and minimum node records as follows −
在创建决策树的终端节点时,重要的一点是确定何时停止增长树或创建其他终端节点。 可以通过以下两个条件来完成,即最大树深度和最小节点记录-
-
Maximum Tree Depth − As name suggests, this is the maximum number of the nodes in a tree after root node. We must stop adding terminal nodes once a tree reached at maximum depth i.e. once a tree got maximum number of terminal nodes.
最大树深度 -顾名思义,这是树中根节点之后的最大节点数。 一旦一棵树达到最大深度,即一棵树获得最大数量的终端节点,我们就必须停止添加终端节点。
-
Minimum Node Records − It may be defined as the minimum number of training patterns that a given node is responsible for. We must stop adding terminal nodes once tree reached at these minimum node records or below this minimum.
最小节点记录 -可以定义为给定节点负责的最小训练模式数。 一旦树达到这些最低节点记录或低于此最低节点记录,我们就必须停止添加终端节点。
Terminal node is used to make a final prediction.
终端节点用于做出最终预测。
第2部分:递归拆分 (Part2: Recursive Splitting)
As we understood about when to create terminal nodes, now we can start building our tree. Recursive splitting is a method to build the tree. In this method, once a node is created, we can create the child nodes (nodes added to an existing node) recursively on each group of data, generated by splitting the dataset, by calling the same function again and again.
正如我们了解何时创建终端节点一样,现在我们可以开始构建树了。 递归拆分是一种构建树的方法。 在这种方法中,一旦创建了一个节点,我们就可以在每一组数据上递归地创建子节点(添加到现有节点上的节点),这些子节点是通过拆分数据集,一次又一次地调用相同的函数而生成的。
预测 (Prediction)
After building a decision tree, we need to make a prediction about it. Basically, prediction involves navigating the decision tree with the specifically provided row of data.
构建决策树后,我们需要对其进行预测。 基本上,预测包括使用特定提供的数据行浏览决策树。
We can make a prediction with the help of recursive function, as did above. The same prediction routine is called again with the left or the child right nodes.
如上所述,我们可以借助递归函数进行预测。 左侧或右侧子节点再次调用相同的预测例程。
假设条件 (Assumptions)
The following are some of the assumptions we make while creating decision tree −
以下是我们在创建决策树时所做的一些假设-
-
While preparing decision trees, the training set is as root node.
在准备决策树时,训练集作为根节点。
-
Decision tree classifier prefers the features values to be categorical. In case if you want to use continuous values then they must be done discretized prior to model building.
决策树分类器更喜欢对要素值进行分类。 如果要使用连续值,则必须先离散化它们,然后再建立模型。
-
Based on the attribute’s values, the records are recursively distributed.
根据属性的值,记录将以递归方式分布。
-
Statistical approach will be used to place attributes at any node position i.e.as root node or internal node.
统计方法将用于将属性放置在任何节点位置,即根节点或内部节点。
用Python实现 (Implementation in Python)
例 (Example)
In the following example, we are going to implement Decision Tree classifier on Pima Indian Diabetes −
在以下示例中,我们将在Pima印度糖尿病上实现决策树分类器-
First, start with importing necessary python packages −
首先,从导入必要的python包开始-
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
Next, download the iris dataset from its weblink as follows −
接下来,如下所示从其网络链接下载iris数据集:
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
pima = pd.read_csv(r"C:\pima-indians-diabetes.csv", header=None, names=col_names)
pima.head()
pregnant glucose bp skin insulin bmi pedigree age label
0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
Now, split the dataset into features and target variable as follows −
现在,将数据集分为要素和目标变量,如下所示:
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable
Next, we will divide the data into train and test split. The following code will split the dataset into 70% training data and 30% of testing data −
接下来,我们将数据分为训练和测试拆分。 以下代码将数据集拆分为70%的训练数据和30%的测试数据-
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
Next, train the model with the help of DecisionTreeClassifier class of sklearn as follows −
接下来,借助sklearn的DecisionTreeClassifier类训练模型,如下所示-
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
At last we need to make prediction. It can be done with the help of following script −
最后,我们需要进行预测。 可以在以下脚本的帮助下完成-
y_pred = clf.predict(X_test)
Next, we can get the accuracy score, confusion matrix and classification report as follows −
接下来,我们可以获得准确性得分,混淆矩阵和分类报告,如下所示:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)
输出量 (Output)
Confusion Matrix:
[[116 30]
[ 46 39]]
Classification Report:
precision recall f1-score support
0 0.72 0.79 0.75 146
1 0.57 0.46 0.51 85
micro avg 0.67 0.67 0.67 231
macro avg 0.64 0.63 0.63 231
weighted avg 0.66 0.67 0.66 231
Accuracy: 0.670995670995671
可视化决策树 (Visualizing Decision Tree)
The above decision tree can be visualized with the help of following code −
上面的决策树可以在以下代码的帮助下可视化-
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('Pima_diabetes_Tree.png')
Image(graph.create_png())
决策树分类算法