Python数据结构--决策树

程序员文章站 2024-01-19 10:14:58

文章目录特征选择信息增益信息熵条件信息熵基尼指数信息增益比决策树生成ID3C4.5CART决策树剪枝CART剪枝特征选择信息增益信息熵信息熵就是用来表示随机变量不确定性的度量。我们以贷款申请样本数据表为例子。ID年龄有工作有房子贷款情况类别条件信息熵基尼指数信息增益比决策树生成ID3C4.5CART决策树剪枝CART剪枝......

文章目录

特征选择

信息增益

信息熵
条件信息熵

基尼指数
信息增益比

决策树生成

ID3
C4.5
CART

特征选择

信息增益

信息熵

信息熵就是用来表示随机变量不确定性的度量。随机变脸的熵可以定义为：
$H(x) = -\sum_{i=1}^{n}p_ilogp_i \tag{1}$
我们以贷款申请样本数据表为例子。

ID	年龄	有工作	有自己的房子	信贷情况	类别
1	青年	否	否	一般	否
2	青年	否	否	好	否
3	青年	是	否	好	是
4	青年	是	是	一般	是
5	青年	否	否	一般	否
6	中年	否	否	一般	否
7	中年	否	否	好	否
8	中年	是	是	好	是
9	中年	否	是	非常好	是
10	中年	否	是	非常好	是
11	老年	否	是	非常好	是
12	老年	否	是	好	是
13	老年	是	否	好	是
14	老年	是	否	非常好	是
15	老年	否	否	一般	否

条件信息熵

$G(D,A) = H(D) - H(D|A) \tag{2}$

基尼指数

在分类任务中，假设有K个类别，样本点属于第k类的概率是 $p_k$ ,那么概率分布的基尼指数定义为：
$\begin{aligned} Gini(p) =& \sum_{k=1}^{K}p_k(1-p_k) \\ =& \sum_{k=1}^{K}p_k -\sum_{k=1}^{K}p_k^2 \\ =& 1 - \sum_{k=1}^{K}p_k^2 \tag{1} \end{aligned}$
在给定特征A的条件下，集合D的基尼指数可以定义为:
$Gini(D,A) = \frac{|D_1|}{|D|}Gini(D_1) + \frac{|D_2|}{|D|}Gini(D_2)$

信息增益比

以信息增益作为划分训练数据集的特征，存在偏向于选择取值较多的特征问题，使用信息增益比可以对这个问题进行校正，这是特征选择的另一准则。
特征A对训练数据集D的信息增益比 $g_R(D,A) = \frac{g(D,A)}{H_A(D)}$ ，其中 $H_A(D) = \sum_{i=1}^{n}\frac{|D_i|}{|D|}log_2\frac{|D_i|}{|D|}$

决策树生成

ID3

'''
Author: your name
Date: 2020-08-08 11:52:19
LastEditTime: 2020-08-22 16:43:41
LastEditors: Please set LastEditors
Description: In User Settings Edit
FilePath: /统计学习方法/决策树/main.py
''' # %% import random import numpy as np import pprint from math import log from collections import Counter from sklearn.model_selection import train_test_split from sklearn.datasets import load_iris import matplotlib.pyplot as plt from collections import defaultdict import math import pandas as pd from graphviz import Digraph import networkx as nx import matplotlib.pyplot as plt # 书上题目5.1 g = Digraph('决策树') def H(D: pd.DataFrame, condition=None, class_label='class') -> float: class_list = D.groupby(class_label).size().to_list() total = sum(class_list) entropy = -1 * sum([c_num / total * math.log2(c_num / total) for c_num in class_list]) if condition is not None: condition_entropy = sum(D.groupby(condition).apply( lambda x: len(x)/len(D) * H(x))) return entropy - condition_entropy else: return entropy def create_data(): datasets = [['青年', '否', '否', '一般', '否'], ['青年', '否', '否', '好', '否'], ['青年', '是', '否', '好', '是'], ['青年', '是', '是', '一般', '是'], ['青年', '否', '否', '一般', '否'], ['中年', '否', '否', '一般', '否'], ['中年', '否', '否', '好', '否'], ['中年', '是', '是', '好', '是'], ['中年', '否', '是', '非常好', '是'], ['中年', '否', '是', '非常好', '是'], ['老年', '否', '是', '非常好', '是'], ['老年', '否', '是', '好', '是'], ['老年', '是', '否', '好', '是'], ['老年', '是', '否', '非常好', '是'], ['老年', '否', '否', '一般', '否'], ] labels = [u'age', u'work', u'house', u'credit', 'class'] # 返回数据集和每个维度的名称 return datasets, labels class Node: def __init__(self, is_leaf, class_type): self.son_nodes = [] self.is_leaf = is_leaf
        self.class_type = str(class_type) if is_leaf: self.class_type += str(random.randint(1,1000)) self.edges = [] def __str__(self): return "{} {}".format(self.is_leaf, self.class_type) def create_disicion_tree_by_id3(D, A, delta, class_label='class'): class_num = len(D.groupby(class_label).size()) most_frequence_class = D[class_label].value_counts().keys()[0] if class_num == 1: return Node(True, most_frequence_class) if len(A) == 0: return Node(True, most_frequence_class) entropy_list = [] for feature in A: entropy_list.append(H(D, condition=feature, class_label=class_label)) idx = np.argmax(entropy_list) selected_feature = A[idx] max_entropy = entropy_list[idx] if max_entropy < delta: return Node(True, most_frequence_class) base = Node(False, selected_feature) for group in D.groupby(selected_feature): base.son_nodes.append(create_disicion_tree_by_id3(group[1], list( set(A) - set([selected_feature])), delta)) base.edges.append(group[0]) return base


datasets, labels = create_data() train_data = pd.DataFrame(datasets, columns=labels) train_data['class'] = train_data['class'] == '是' root = create_disicion_tree_by_id3(train_data, [ u'age', u'work', u'house', u'credit'], 0.0) G = nx.Graph() queue = [root] G.add_node(root.class_type) while len(queue) > 0: head = queue.pop(0) for son in head.son_nodes: G.add_node(son.class_type) G.add_edge(head.class_type, son.class_type) queue.append(son) print(head.class_type) pos = nx.spring_layout(G) nx.draw(G, pos,with_labels=True) plt.show() # %%

Python数据结构--决策树

C4.5

和ID3的区别在于使用信息增益比作为特征选择标准

CART

分类回归树(classification and regression tree,CART)是给定随机变量X条件下，输出随机变量Y的条件概率分布的学习方法。CART假设决策树是二叉树。也即是在计算现有特征A，对其所有可能取到的每一个值a，根据样本点对A=a的测试为”是“或者”否“，将训练数据集D切分为D1和D2，样本分割位置叫样本切分点。再利用公示X计算A=a的时候的基尼指数。

本文地址：https://blog.csdn.net/Scau_Jack/article/details/107999593

相关标签：概率图模型统计学习方法决策树 python 算法机器学习

上一篇：荐信息收集-DNS

下一篇： hadoop CHD4 HA 详细安装和配置

Python数据结构--决策树

文章目录

特征选择

信息增益

信息熵

条件信息熵

基尼指数

信息增益比

决策树生成

ID3

C4.5

CART

Python数据结构--决策树

Python实现比较扑克牌大小程序代码示例

Python高阶应用，读《Python 3面向对象编程》

Python实现数据图表可视化

python-找出数组中出现次数大于n/2的元素

Python导出文本数据字典

数据结构算法（两数之和、整数反转）

二叉树的完全性检验——Python实现（BFS）

Python算法有效符号 isValid

python多线程爬虫实例分析