机器学习实践—基于Scikit-Learn、Keras和TensorFlow2第二版—第7章 集成算法和随机森林
在机器学习中,通常情况下综合多个机器学习算法模型的预测结果往往比单个算法的结果要好,包括这些机器学习算法中效果最好的那个。将综合多个机器算法结果的这种方法称作集成。
随机森林就是通过一个训练集训练很多个随机的决策树,最终综合这些决策树的结果达到不错的效果。
常见的集成算法有:bagging, boosting, stacking
集成算法的主要实现方式有两类:一是集成不同类型的算法,二是在不同的训练集子集上使用相同的算法。
0. 导入所需的库
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import sklearn
import os
for i in (np, mpl, sklearn):
print(i.__name__,": ",i.__version__,sep="")
输出:
numpy: 1.17.4
matplotlib: 3.1.2
sklearn: 0.21.3
1. 投票分类器
假设已经训练好了逻辑回归、SVM、随机森林、K近邻及其它一些分类器模型,并且每个分类器的精度达到80%,获得一个更好分类器的方法就是通过综合这些分类的预测结果,选择投票最多的那个类别,这种方法叫硬投票分类器。
heads_proba = 0.51
np.random.seed(42)
coin_tosses = (np.random.rand(10000,10) < heads_proba).astype(np.int32)
cumulative_heads_ration = np.cumsum(coin_tosses, axis=0)/np.arange(1,10001).reshape(-1,1)
coin_tosses.shape, cumulative_heads_ration.shape
输出:
((10000, 10), (10000, 10))
plt.figure(figsize=(12,5))
plt.plot(cumulative_heads_ration)
plt.plot([0,10000],[0.51,0.51],"k--",linewidth=2, label="51%")
plt.plot([0,10000],[0.5,0.5],"k-",label="50%")
plt.axis([0,10000,0.42,0.58])
plt.xlabel("Number of coin tosses")
plt.ylabel("Heads of ratio")
plt.legend()
plt.show()
输出:
假设有一枚坏的硬币,抛硬币正面的概率是51%,反面概率是49%。上图结果所示为10次投硬币的结果,最终硬币正面的概率更接近趋向于51%。
集成算法通常将不同类型算法的模型集成在一起,这样保证了每个模型是相互独立的,并且产生的误差是不同类型的误差。
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_moons
X, y = make_moons(n_samples=500, noise=0.30, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
输出:
((375, 2), (125, 2), (375,), (125,))
train_test_split函数如果train_size和test_size均未指定,则按3:1的比例分隔。
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
log_clf = LogisticRegression(solver="lbfgs",random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
svm_clf = SVC(gamma="scale",random_state=42)
voting_clf = VotingClassifier(estimators=[("lr",log_clf),("rf",rnd_clf),("svc",svm_clf)],
voting="hard")
voting_clf.fit(X_train, y_train)
输出:
VotingClassifier(estimators=[('lr',
LogisticRegression(C=1.0, class_weight=None,
dual=False, fit_intercept=True,
intercept_scaling=1,
l1_ratio=None, max_iter=100,
multi_class='warn',
n_jobs=None, penalty='l2',
random_state=42,
solver='lbfgs', tol=0.0001,
verbose=0, warm_start=False)),
('rf',
RandomForestClassifier(bootstrap=True,
class_weight=None,
criterion='gini',
m...
n_estimators=100,
n_jobs=None,
oob_score=False,
random_state=42, verbose=0,
warm_start=False)),
('svc',
SVC(C=1.0, cache_size=200, class_weight=None,
coef0=0.0, decision_function_shape='ovr',
degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=False,
random_state=42, shrinking=True, tol=0.001,
verbose=False))],
flatten_transform=True, n_jobs=None, voting='hard',
weights=None)
from sklearn.metrics import accuracy_score
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
输出:
LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.912
如上输出显示,投票分类器的结果优于单个的分类器。
硬投票分类器:根据每个单独分类器分类的类别个数,预测类别投票最多的即为最终的输出类别。
软投票分类器:根据每个单独分类器分类的类别概率,然后对每个类别的概率值求平均,最终输出概率最大的那个类别。
通常情况下软投票分类器的效果要好于硬投票分类器。sklearn中可以指定超参数voting="hard"或voting="soft"来选择使用硬投票还是软投票分类器。如果使用软投票分类器,需要保证每个单独的分类器支持输出类别的概率值。
当SVC作为投票分类器其中一个分类器时,SVC默认不输出概率值,需要手动指定超参数probability=True,此时SVC通过交叉验证的方式计算每个类别的概率值,这样就会导致算法时间增长。
下面使用软投票分类器:
log_clf = LogisticRegression(solver="lbfgs",random_state=42)
rnd_clf = RandomForestClassifier(n_estimators=100,random_state=42)
svm_clf = SVC(gamma="scale",probability=True, random_state=42)
voting_clf = VotingClassifier(estimators=[("lr",log_clf),("rf",rnd_clf),("svc",svm_clf)],
voting="soft")
voting_clf.fit(X_train, y_train)
输出:
VotingClassifier(estimators=[('lr',
LogisticRegression(C=1.0, class_weight=None,
dual=False, fit_intercept=True,
intercept_scaling=1,
l1_ratio=None, max_iter=100,
multi_class='warn',
n_jobs=None, penalty='l2',
random_state=42,
solver='lbfgs', tol=0.0001,
verbose=0, warm_start=False)),
('rf',
RandomForestClassifier(bootstrap=True,
class_weight=None,
criterion='gini',
m...
n_estimators=100,
n_jobs=None,
oob_score=False,
random_state=42, verbose=0,
warm_start=False)),
('svc',
SVC(C=1.0, cache_size=200, class_weight=None,
coef0=0.0, decision_function_shape='ovr',
degree=3, gamma='scale', kernel='rbf',
max_iter=-1, probability=True,
random_state=42, shrinking=True, tol=0.001,
verbose=False))],
flatten_transform=True, n_jobs=None, voting='soft',
weights=None)
for clf in (log_clf, rnd_clf, svm_clf, voting_clf):
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print(clf.__class__.__name__, accuracy_score(y_test, y_pred))
输出:
LogisticRegression 0.864
RandomForestClassifier 0.896
SVC 0.896
VotingClassifier 0.92
如上输出可以看出,硬投票分类器准确率是91.2%,而软投票分类器准确率达92%。说明相同条件下,软投票分类器效果优于硬投票分类器。
2. Bagging and Pasting
正如前如述,实现集成算法有两种,一种是类似上面所分析投票分类器,即集成了不同的算法。
另一种是训练集的不同子集上使用相同的算法。如果子集抽样是可放回的,则为bagging, 如果子集抽样是不可放回的,则为pasting。
综合不同训练子集上训练的模型得出最终的结果。如果是分类任务,就取众数为最终的输出类别;如果是回归任务,就取平均值为最终的输出值。
注意:集成算法中每个单独的模型可以并行地训练,可以分布于不同的CPU上同时进行,因此集成算法具有较好的扩展性。
2.1 使用sklearn实现bagging和pasting
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
bag_clf = BaggingClassifier(DecisionTreeClassifier(random_state=42),
n_estimators=500, max_samples=100,
bootstrap=True,random_state=42)
bag_clf.fit(X_train, y_train)
输出:
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None,
criterion='gini',
max_depth=None,
max_features=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort=False,
random_state=42,
splitter='best'),
bootstrap=True, bootstrap_features=False, max_features=1.0,
max_samples=100, n_estimators=500, n_jobs=None,
oob_score=False, random_state=42, verbose=0,
warm_start=False)
y_pred = bag_clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
输出:
0.904
tree_clf = DecisionTreeClassifier(random_state=42)
tree_clf.fit(X_train, y_train)
y_pred_tree = tree_clf.predict(X_test)
print(accuracy_score(y_test, y_pred_tree))
输出:
0.856
如上输出所示,bagging集成了500颗决策树,每颗树使用随机抽样100个训练集样本进行训练,最终获得了90.4%的准确率。然而,单独使用决策树时准确率只有85.6%,可见集成算法有其独特的优势。
注意:sklearn中BaggingClassifier会自动判断所用的算法是否支持输出类别概率值,如果支持则会自动选择软投票分类器输出最终结果。同时可能通过n_jobs指定训练时使用CPU的核数,指定-1时表示尽可能使用闲置的CPU。
from matplotlib.colors import ListedColormap
def plot_decision_boundary(clf, X, y, axes=[-1.5,2.45,-1,1.5],alpha=0.5,contour=True):
x1s = np.linspace(axes[0],axes[1],100)
x2s = np.linspace(axes[2],axes[3],100)
x1, x2 = np.meshgrid(x1s, x2s)
X_new = np.c_[x1.ravel(), x2.ravel()]
y_pred = clf.predict(X_new).reshape(x1.shape)
custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap)
if contour:
custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)
plt.plot(X[:,0][y==0], X[:,1][y==0],"yo",alpha=alpha)
plt.plot(X[:,0][y==1], X[:,1][y==1],"bs",alpha=alpha)
plt.axis(axes)
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$x_2$",fontsize=18, rotation=0)
fix, axes = plt.subplots(ncols=2, figsize=(12,5), sharey=True)
plt.sca(axes[0])
plot_decision_boundary(tree_clf, X, y)
plt.title("Decision Tree",fontsize=14)
plt.sca(axes[1])
plot_decision_boundary(bag_clf, X, y)
plt.ylabel("")
plt.title("Decision Trees with Bagging",fontsize=14)
plt.tight_layout()
plt.show()
输出:
如上输出所示,左图为单个决策树的分类边界,右图为Bagging的分类边界,很明显Bagging的效果要好,泛化性更强。左图可能将大部分样本进行了准确地分类,但很明显有过拟合的现象。
bagging和pasting抽样方式的不同造成bagging训练子集可能具有更多的多样性,因此最终的模型偏置可能比pasting的大一些。更多的多样性使得各个子模型之间相关性很弱,也就是说最终集成模型的偏差很小。这就是为什么通常情况下bagging效果要比pasting要好。
2.2 模型评估
对于bagging,训练集中某些样本可能被多次抽中,而有些样本可能一次也没抽中。根据抽样概率,可能只有63%的样本被抽样到不同的子集中,而剩余37%可能一次也没被抽到,而这37%的样本被称为out-of-bag(oob)。
注意:关于抽样63%和37%的关系。对于样本集A有放回地抽样N次,则任意一个样本没有被抽中的概率为(1 - 1/N)^N。当N足够大时此概率就接近于exp(-1)=0.36787944117144,约等于0.37,反过来说任意一个样本被抽中的概率则为1-0.37=0.63。
因此,在理论上这些oob样本没有参与训练,因此可以用来评估模型而不需要再单独划分验证集。sklearn中可能指定超参数oob_score=True让模型自动使用oob样本进行模型评估:
bag_clf = BaggingClassifier(DecisionTreeClassifier(random_state=42),
n_estimators=500,bootstrap=True,
oob_score=True, random_state=40)
bag_clf.fit(X_train, y_train)
输出:
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None,
criterion='gini',
max_depth=None,
max_features=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort=False,
random_state=42,
splitter='best'),
bootstrap=True, bootstrap_features=False, max_features=1.0,
max_samples=1.0, n_estimators=500, n_jobs=None,
oob_score=True, random_state=40, verbose=0, warm_start=False)
bag_clf.oob_score_
输出:
0.9013333333333333
bag_clf.oob_decision_function_
输出:
array([[0.31746032, 0.68253968],
[0.34117647, 0.65882353],
[1. , 0. ],
[0. , 1. ],
[0. , 1. ],
[0.08379888, 0.91620112],
[0.31693989, 0.68306011],
[0.02923977, 0.97076023],
[0.97687861, 0.02312139],
[0.97765363, 0.02234637],
[0.74404762, 0.25595238],
[0. , 1. ],
[0.71195652, 0.28804348],
[0.83957219, 0.16042781],
[0.97777778, 0.02222222],
[0.0625 , 0.9375 ],
[0. , 1. ],
[0.97297297, 0.02702703],
[0.95238095, 0.04761905],
[1. , 0. ],
[0.01704545, 0.98295455],
[0.38947368, 0.61052632],
[0.88700565, 0.11299435],
[1. , 0. ],
[0.96685083, 0.03314917],
[0. , 1. ],
[0.99428571, 0.00571429],
[1. , 0. ],
[0. , 1. ],
[0.64804469, 0.35195531],
[0. , 1. ],
[1. , 0. ],
[0. , 1. ],
[0. , 1. ],
[0.13402062, 0.86597938],
[1. , 0. ],
[0. , 1. ],
[0.36065574, 0.63934426],
[0. , 1. ],
[1. , 0. ],
[0.27093596, 0.72906404],
[0.34146341, 0.65853659],
[1. , 0. ],
[1. , 0. ],
[0. , 1. ],
[1. , 0. ],
[1. , 0. ],
[0. , 1. ],
[1. , 0. ],
[0.00531915, 0.99468085],
[0.98265896, 0.01734104],
[0.91428571, 0.08571429],
[0.97282609, 0.02717391],
[0.97029703, 0.02970297],
[0. , 1. ],
[0.06134969, 0.93865031],
[0.98019802, 0.01980198],
[0. , 1. ],
[0. , 1. ],
[0. , 1. ],
[0.97790055, 0.02209945],
[0.79473684, 0.20526316],
[0.41919192, 0.58080808],
[0.99473684, 0.00526316],
[0. , 1. ],
[0.67613636, 0.32386364],
[1. , 0. ],
[1. , 0. ],
[0.87356322, 0.12643678],
[1. , 0. ],
[0.56140351, 0.43859649],
[0.16304348, 0.83695652],
[0.67539267, 0.32460733],
[0.90673575, 0.09326425],
[0. , 1. ],
[0.16201117, 0.83798883],
[0.89005236, 0.10994764],
[1. , 0. ],
[0. , 1. ],
[0.995 , 0.005 ],
[0. , 1. ],
[0.07272727, 0.92727273],
[0.05418719, 0.94581281],
[0.29533679, 0.70466321],
[1. , 0. ],
[0. , 1. ],
[0.81871345, 0.18128655],
[0.01092896, 0.98907104],
[0. , 1. ],
[0. , 1. ],
[0.22513089, 0.77486911],
[1. , 0. ],
[0. , 1. ],
[0. , 1. ],
[0. , 1. ],
[0.9368932 , 0.0631068 ],
[0.76536313, 0.23463687],
[0. , 1. ],
[1. , 0. ],
[0.17127072, 0.82872928],
[0.65306122, 0.34693878],
[0. , 1. ],
[0.03076923, 0.96923077],
[0.49444444, 0.50555556],
[1. , 0. ],
[0.02673797, 0.97326203],
[0.98870056, 0.01129944],
[0.23121387, 0.76878613],
[0.5 , 0.5 ],
[0.9947644 , 0.0052356 ],
[0.00555556, 0.99444444],
[0.98963731, 0.01036269],
[0.25641026, 0.74358974],
[0.92972973, 0.07027027],
[1. , 0. ],
[1. , 0. ],
[0. , 1. ],
[0. , 1. ],
[0.80681818, 0.19318182],
[1. , 0. ],
[0.0106383 , 0.9893617 ],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[0.98181818, 0.01818182],
[1. , 0. ],
[0.01036269, 0.98963731],
[0.97752809, 0.02247191],
[0.99453552, 0.00546448],
[0.01960784, 0.98039216],
[0.18367347, 0.81632653],
[0.98387097, 0.01612903],
[0.29533679, 0.70466321],
[0.98295455, 0.01704545],
[0. , 1. ],
[0.00561798, 0.99438202],
[0.75138122, 0.24861878],
[0.38624339, 0.61375661],
[0.42708333, 0.57291667],
[0.86315789, 0.13684211],
[0.92964824, 0.07035176],
[0.05699482, 0.94300518],
[0.82802548, 0.17197452],
[0.01546392, 0.98453608],
[0. , 1. ],
[0.02298851, 0.97701149],
[0.96721311, 0.03278689],
[1. , 0. ],
[1. , 0. ],
[0.01041667, 0.98958333],
[0. , 1. ],
[0.0326087 , 0.9673913 ],
[0.01020408, 0.98979592],
[1. , 0. ],
[1. , 0. ],
[0.93785311, 0.06214689],
[1. , 0. ],
[1. , 0. ],
[0.99462366, 0.00537634],
[0. , 1. ],
[0.38860104, 0.61139896],
[0.32065217, 0.67934783],
[0. , 1. ],
[0. , 1. ],
[0.31182796, 0.68817204],
[1. , 0. ],
[1. , 0. ],
[0. , 1. ],
[1. , 0. ],
[0.00588235, 0.99411765],
[0. , 1. ],
[0.98387097, 0.01612903],
[0. , 1. ],
[0. , 1. ],
[1. , 0. ],
[0. , 1. ],
[0.62264151, 0.37735849],
[0.92344498, 0.07655502],
[0. , 1. ],
[0.99526066, 0.00473934],
[1. , 0. ],
[0.98888889, 0.01111111],
[0. , 1. ],
[0. , 1. ],
[1. , 0. ],
[0.06451613, 0.93548387],
[1. , 0. ],
[0.05154639, 0.94845361],
[0. , 1. ],
[1. , 0. ],
[0. , 1. ],
[0.03278689, 0.96721311],
[1. , 0. ],
[0.95808383, 0.04191617],
[0.79532164, 0.20467836],
[0.55665025, 0.44334975],
[0. , 1. ],
[0.18604651, 0.81395349],
[1. , 0. ],
[0.93121693, 0.06878307],
[0.97740113, 0.02259887],
[1. , 0. ],
[0.00531915, 0.99468085],
[0. , 1. ],
[0.44623656, 0.55376344],
[0.86363636, 0.13636364],
[0. , 1. ],
[0. , 1. ],
[1. , 0. ],
[0.00558659, 0.99441341],
[0. , 1. ],
[0.96923077, 0.03076923],
[0. , 1. ],
[0.21649485, 0.78350515],
[0. , 1. ],
[1. , 0. ],
[0. , 1. ],
[0. , 1. ],
[0.98477157, 0.01522843],
[0.8 , 0.2 ],
[0.99441341, 0.00558659],
[0. , 1. ],
[0.08379888, 0.91620112],
[0.98984772, 0.01015228],
[0.01142857, 0.98857143],
[0. , 1. ],
[0.02747253, 0.97252747],
[1. , 0. ],
[0.79144385, 0.20855615],
[0. , 1. ],
[0.90804598, 0.09195402],
[0.98387097, 0.01612903],
[0.20634921, 0.79365079],
[0.19767442, 0.80232558],
[1. , 0. ],
[0. , 1. ],
[0. , 1. ],
[0. , 1. ],
[0.20338983, 0.79661017],
[0.98181818, 0.01818182],
[0. , 1. ],
[1. , 0. ],
[0.98969072, 0.01030928],
[0. , 1. ],
[0.48663102, 0.51336898],
[1. , 0. ],
[0. , 1. ],
[1. , 0. ],
[0. , 1. ],
[0. , 1. ],
[0.07821229, 0.92178771],
[0.11176471, 0.88823529],
[0.99415205, 0.00584795],
[0.03015075, 0.96984925],
[1. , 0. ],
[0.40837696, 0.59162304],
[0.04891304, 0.95108696],
[0.51595745, 0.48404255],
[0.51898734, 0.48101266],
[0. , 1. ],
[1. , 0. ],
[0. , 1. ],
[0. , 1. ],
[0.59903382, 0.40096618],
[0. , 1. ],
[1. , 0. ],
[0.24157303, 0.75842697],
[0.81052632, 0.18947368],
[0.08717949, 0.91282051],
[0.99453552, 0.00546448],
[0.82142857, 0.17857143],
[0. , 1. ],
[0. , 1. ],
[0.125 , 0.875 ],
[0.04712042, 0.95287958],
[0. , 1. ],
[1. , 0. ],
[0.89150943, 0.10849057],
[0.1978022 , 0.8021978 ],
[0.95238095, 0.04761905],
[0.00515464, 0.99484536],
[0.609375 , 0.390625 ],
[0.07692308, 0.92307692],
[0.99484536, 0.00515464],
[0.84210526, 0.15789474],
[0. , 1. ],
[0.99484536, 0.00515464],
[0.95876289, 0.04123711],
[0. , 1. ],
[0. , 1. ],
[1. , 0. ],
[0. , 1. ],
[1. , 0. ],
[0.26903553, 0.73096447],
[0.98461538, 0.01538462],
[1. , 0. ],
[0. , 1. ],
[0.00574713, 0.99425287],
[0.85142857, 0.14857143],
[0. , 1. ],
[1. , 0. ],
[0.76506024, 0.23493976],
[0.8969697 , 0.1030303 ],
[1. , 0. ],
[0.73333333, 0.26666667],
[0.47727273, 0.52272727],
[0. , 1. ],
[0.92473118, 0.07526882],
[0. , 1. ],
[1. , 0. ],
[0.87709497, 0.12290503],
[1. , 0. ],
[1. , 0. ],
[0.74752475, 0.25247525],
[0.09146341, 0.90853659],
[0.44329897, 0.55670103],
[0.22395833, 0.77604167],
[0. , 1. ],
[0.87046632, 0.12953368],
[0.78212291, 0.21787709],
[0.00507614, 0.99492386],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[0. , 1. ],
[0.02884615, 0.97115385],
[0.96571429, 0.03428571],
[0.93478261, 0.06521739],
[1. , 0. ],
[0.49756098, 0.50243902],
[1. , 0. ],
[0. , 1. ],
[1. , 0. ],
[0.01604278, 0.98395722],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[0. , 1. ],
[0.96987952, 0.03012048],
[0. , 1. ],
[0.05747126, 0.94252874],
[0. , 1. ],
[0. , 1. ],
[1. , 0. ],
[1. , 0. ],
[0. , 1. ],
[0.98989899, 0.01010101],
[0.01675978, 0.98324022],
[1. , 0. ],
[0.13541667, 0.86458333],
[0. , 1. ],
[0.00546448, 0.99453552],
[0. , 1. ],
[0.41836735, 0.58163265],
[0.11309524, 0.88690476],
[0.22110553, 0.77889447],
[1. , 0. ],
[0.97647059, 0.02352941],
[0.22826087, 0.77173913],
[0.98882682, 0.01117318],
[0. , 1. ],
[0. , 1. ],
[1. , 0. ],
[0.96428571, 0.03571429],
[0.33507853, 0.66492147],
[0.98235294, 0.01764706],
[1. , 0. ],
[0. , 1. ],
[0.99465241, 0.00534759],
[0. , 1. ],
[0.06043956, 0.93956044],
[0.97619048, 0.02380952],
[1. , 0. ],
[0.03108808, 0.96891192],
[0.57291667, 0.42708333]])
y_pred = bag_clf.predict(X_test)
print(accuracy_score(y_test, y_pred))
输出:
0.912
3. 随机Patches和随机子空间
BaggingClassifier类也可以对特征进行采样,可能通过两个超参数指定:max_features和bootstrap_features,作用与样本抽样类似。
在训练集上同时对样本和特征进行抽样的方式叫随机Patches方法。
而设置超参数bootstrap=False, max_samples=1.0, bootstrap_features=True, max_features设置小于1.0时叫做随机子空间方法。
4. 随机森林
随机森林是使用bagging方法将多个决策树的集成的算法,通常将max_samples设置为训练集的大小。
4.1 sklearn实现随机森林
sklearn中实现随机森林有两种方法,一种是使用BaggingClassifier类,传入算法参数为DecisionTreeClassifier;另一种是使用RandomForestClassifier类,相比之下这个类使用更方便,并且对其中的决策树做了一些优化。
下面是通过不同方法实现的随机森林:
# 使用BaggingClassifier类
bag_clf = BaggingClassifier(DecisionTreeClassifier(splitter="random",
max_leaf_nodes=16,
random_state=42),
n_estimators=500,
max_samples=1.0,
bootstrap=True,
random_state=42)
bag_clf.fit(X_train, y_train)
输出:
BaggingClassifier(base_estimator=DecisionTreeClassifier(class_weight=None,
criterion='gini',
max_depth=None,
max_features=None,
max_leaf_nodes=16,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort=False,
random_state=42,
splitter='random'),
bootstrap=True, bootstrap_features=False, max_features=1.0,
max_samples=1.0, n_estimators=500, n_jobs=None,
oob_score=False, random_state=42, verbose=0,
warm_start=False)
y_pred_bag = bag_clf.predict(X_test)
# 使用RandomForestClassifier类
from sklearn.ensemble import RandomForestClassifier
rnd_clf = RandomForestClassifier(n_estimators=500, max_leaf_nodes=16, random_state=42)
rnd_clf.fit(X_train, y_train)
输出:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=16,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=500,
n_jobs=None, oob_score=False, random_state=42, verbose=0,
warm_start=False)
y_pred_rf = rnd_clf.predict(X_test)
np.sum(y_pred_bag == y_pred_rf) / len(y_pred_bag)
输出:
0.976
以上输出表明,两种方法的预测结果差别很小。
如下结果显示种方法的准确率:
print(accuracy_score(y_test, y_pred_bag))
print(accuracy_score(y_test, y_pred_rf))
输出:
0.92
0.912
观察可以发现,似乎bagging方法的准确率稍高。
4.2 极度随机树(Extra-Trees)
在每个节点上不再是单纯地搜索最佳拆分的阈值,而是使用随机的阈值,这种随机森林就叫极度随机树集成模型。这种模型通常比普通的随机森林运行速度快,因为迭代遍历搜索阈值的方法非常耗时。
到底随机森林和极度随机树哪种算法模型效果好,无从得知,需要通过交叉验证证明。
4.3 特征重要性
随机森林的另一个主要功能是用于特征重要性的评估。sklearn中实现特征重要性评估是通过遍历所有树并统计具有降低了平均不纯度的特征的节点个数,最终进行归一化使其和为1。
sklearn中使用feature_importances_变量访问其结果:
from sklearn.datasets import load_iris
iris = load_iris()
rnd_clf = RandomForestClassifier(n_estimators=500, random_state=42)
rnd_clf.fit(iris["data"], iris["target"])
输出:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=500,
n_jobs=None, oob_score=False, random_state=42, verbose=0,
warm_start=False)
for name, score in zip(iris["feature_names"], rnd_clf.feature_importances_):
print(name,": ", score, sep="")
输出:
sepal length (cm): 0.11249225099876375
sepal width (cm): 0.02311928828251033
petal length (cm): 0.4410304643639577
petal width (cm): 0.4233579963547682
如上输出为iris数据集中特征的重要性,观察可以发现,花瓣长度和宽度重要性较大,而花萼长度和宽度重要性低。
下面展示MNIST数据集中每个像素的重要性:
from sklearn.datasets import fetch_openml
mnist = fetch_openml("mnist_784",version=1)
mnist.target = mnist.target.astype(np.uint8)
rnd_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rnd_clf.fit(mnist["data"], mnist["target"])
输出:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=100,
n_jobs=None, oob_score=False, random_state=42, verbose=0,
warm_start=False)
def plot_digit(data):
image = data.reshape(28,28)
plt.imshow(image, cmap=mpl.cm.hot, interpolation="nearest")
plt.axis("off")
plot_digit(rnd_clf.feature_importances_)
cbar = plt.colorbar(ticks=[rnd_clf.feature_importances_.min(),
rnd_clf.feature_importances_.max()])
cbar.ax.set_yticklabels(["Not important","Very important"])
plt.show()
输出:
如上输出显示,越是中间的像素重要性越高。
5. Boosting
将多个弱学习器集成到一起变成强学习的集成思想。大部分boosting算法的思想是连续地训练模型,使得当前模型试图去纠正上一个模型的误差。最常见两个boosting算法是AdaBoost和Gradient Boosting。
5.1 AdaBoost
AdaBoost是Adaptive Boosting的缩写,即自适应提升器。当前模型纠正上一个模型的方法之一是更加关注上一个模型预测错误的样本,如果增加关注呢?可以在下一个模型训练时对这些样本提升权重。
这种序列化训练的集成算法有一个缺点就是不能并行地训练子模型,因此比较耗时。
from sklearn.ensemble import AdaBoostClassifier
ada_clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),
n_estimators=200,
algorithm="SAMME.R",
learning_rate=0.5,
random_state=42)
ada_clf.fit(X_train, y_train)
输出:
AdaBoostClassifier(algorithm='SAMME.R',
base_estimator=DecisionTreeClassifier(class_weight=None,
criterion='gini',
max_depth=1,
max_features=None,
max_leaf_nodes=None,
min_impurity_decrease=0.0,
min_impurity_split=None,
min_samples_leaf=1,
min_samples_split=2,
min_weight_fraction_leaf=0.0,
presort=False,
random_state=None,
splitter='best'),
learning_rate=0.5, n_estimators=200, random_state=42)
plot_decision_boundary(ada_clf, X, y)
输出:
如上输出所示为构建AdaBoost训练模型的决策边界。
使用了200个决策树桩。决策树桩是最大深度为1的决策树,由1个根节点和两个叶子节点组成。
SAMME:Stagewise Additive Modeling using a Multiclass Exponential loss function,多类指数损失函数阶段加法建模,即为多分类版本的AdaBoost。当SAMME只有两个类别时就变成AdaBoost了。如果使用的子模型可以输出类别概率,则变成SAMME.R,其中R代表Real,通常通过概率进行预测的结果比类别投票的结果要准确。
m = len(X_train)
fix, axes = plt.subplots(ncols=2, figsize=(12,5),sharey=True)
for subplot, learning_rate in ((0,1),(1,0.5)):
sample_weight = np.ones(m)
plt.sca(axes[subplot])
for i in range(5):
svm_clf = SVC(kernel="rbf", C=0.05, gamma="scale", random_state=42)
svm_clf.fit(X_train, y_train, sample_weight=sample_weight)
y_pred = svm_clf.predict(X_train)
sample_weight[y_pred != y_train] *= (1+learning_rate)
plot_decision_boundary(svm_clf, X, y, alpha=0.2)
plt.title("learning_rate={}".format(learning_rate),fontsize=16)
if subplot ==0:
plt.text(-0.6, -0.5, "1",fontsize=16)
plt.text(-0.6, -0.1, "2",fontsize=16)
plt.text(-0.6, 0.2, "3",fontsize=16)
plt.text(-0.6, 0.45, "4",fontsize=16)
plt.text(-0.6, 0.9, "5",fontsize=16)
else:
plt.ylabel("")
plt.tight_layout()
plt.show()
输出:
如上图所示为使用SVM算法连续训练的5次模型的决策边界结果,训练集样本的初始权重都一样,为1/m,随着模型的训练,误差分类的样本权重提高,最终训练得到一个较好的结果。
AdaBoost这种不断迭代更新权重的方式有点类似于梯度下降。
5.2 Gradient Boosting
Gradient Boosting是另一个常用的boosting算法,与AdaBoost不同,Gradient Boosting当前模型去拟合上一个模型产生的残差。
GBRT:以决策树算法为基础实现的Gradient Boosting模型。
Gradient Boosting更擅长做回归任务:
np.random.seed(42)
X = np.random.rand(100,1) - 0.5
y = 3 * X[:,0]**2 + 0.05 * np.random.randn(100)
X.shape, y.shape
输出:
((100, 1), (100,))
from sklearn.tree import DecisionTreeRegressor
tree_reg1 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg1.fit(X,y)
输出:
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=42, splitter='best')
y2 = y - tree_reg1.predict(X) # 计算残差
tree_reg2 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg2.fit(X, y2) # 利用残差进行训练
输出:
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=42, splitter='best')
y3 = y2 - tree_reg2.predict(X) # 计算残差
tree_reg3 = DecisionTreeRegressor(max_depth=2, random_state=42)
tree_reg3.fit(X, y3) # 利用残差进行训练
输出:
DecisionTreeRegressor(criterion='mse', max_depth=2, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=42, splitter='best')
X_new = np.array([[0.8]])
y_pred = sum(tree.predict(X_new) for tree in (tree_reg1, tree_reg2, tree_reg3))
y_pred
输出:
array([0.75026781])
如上输出即为三棵决策树综合的结果。
注意:不用求平均值,因为第二棵和第三棵树拟合的是残差。
def plot_predictions(regressors, X, y, axes, label=None, style="r-",data_style="b.",data_label=None):
x1 = np.linspace(axes[0], axes[1],500)
y_pred = sum(regressor.predict(x1.reshape(-1,1)) for regressor in regressors)
plt.plot(X[:,0], y, data_style, label=data_label)
plt.plot(x1, y_pred, style, linewidth=2, label=label)
if label or data_label:
plt.legend(fontsize=16,loc="upper center")
plt.axis(axes)
plt.figure(figsize=(12,12))
plt.subplot(3,2,1)
plot_predictions([tree_reg1], X, y, axes=[-0.5,0.5,-0.1,0.8],label="$h_1(x_1)$",style="g-",data_label="Training set")
plt.title("Residuals and tree predictions",fontsize=16)
plt.ylabel("$y$",fontsize=16, rotation=0)
plt.subplot(3,2,2)
plot_predictions([tree_reg1], X, y, axes=[-0.5,0.5,-0.1,0.8],label="$h(x_1)=h_1(x_1)$",data_label="Training set")
plt.title("Ensemble predictions", fontsize=16)
plt.ylabel("$y$",fontsize=16, rotation=0)
plt.subplot(3,2,3)
plot_predictions([tree_reg2],X,y2,axes=[-0.5, 0.5, -0.5, 0.5],label="$h_2(x_1)$",style="g-",data_style="k+",data_label="Residuals")
plt.ylabel("$y-h_1(x_1)$",fontsize=16)
plt.subplot(3,2,4)
plot_predictions([tree_reg1, tree_reg2],X,y,axes=[-0.5, 0.5, -0.1, 0.8],label="$h(x_1)=h_1(x_1)+h_2(x_1)$")
plt.ylabel("$y$",fontsize=16, rotation=0)
plt.subplot(3,2,5)
plot_predictions([tree_reg3],X,y3,axes=[-0.5, 0.5, -0.5, 0.5],label="$h_3(x_1)$",style="g-",data_style="k+")
plt.ylabel("$y-h_1(x_1)-h_2(x_1)$",fontsize=16)
plt.xlabel("$x_1$",fontsize=16)
plt.subplot(3,2,6)
plot_predictions([tree_reg1, tree_reg2, tree_reg3],X,y,axes=[-0.5, 0.5, -0.1, 0.8],label="$h(x_1)=h_1(x_1)+h_2(x_1)+h_3(x_1)$")
plt.xlabel("$x_1$",fontsize=16)
plt.ylabel("$y$",fontsize=16, rotation=0)
plt.tight_layout()
plt.show()
输出:
如上图所示,左边一列的三个图是连续训练的三个决策树模型,右边一列的三个图是集成模型。
以上是手动完成GBRT的构建,sklearn中可以用GradientBoostingRegressor类很容易完成GBRT模型的构建,其参数类似于RandomForestRegressor:
from sklearn.ensemble import GradientBoostingRegressor
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=3, learning_rate=1.0, random_state=42)
gbrt.fit(X,y)
输出:
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
learning_rate=1.0, loss='ls', max_depth=2,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=3,
n_iter_no_change=None, presort='auto',
random_state=42, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0, warm_start=False)
gbrt_slow = GradientBoostingRegressor(max_depth=2,n_estimators=200,learning_rate=0.1,random_state=42)
gbrt_slow.fit(X,y)
输出:
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
learning_rate=0.1, loss='ls', max_depth=2,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=200,
n_iter_no_change=None, presort='auto',
random_state=42, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0, warm_start=False)
fix, axes = plt.subplots(ncols=2, figsize=(12,5),sharey=True)
plt.sca(axes[0])
plot_predictions([gbrt],X,y,axes=[-0.5,0.5,-0.1,0.8],label="Ensemble predictions")
plt.ylabel("$y$",fontsize=16,rotation=0)
plt.xlabel("$x_1$",fontsize=16)
plt.title("learning_rate={},n_estimators={}".format(gbrt.learning_rate,gbrt.n_estimators),fontsize=18)
plt.sca(axes[1])
plot_predictions([gbrt_slow],X,y,axes=[-0.5,0.5,-0.1,0.8],label="Ensemble predictions")
plt.xlabel("$x_1$",fontsize=16)
plt.title("learning_rate={},n_estimators={}".format(gbrt_slow.learning_rate,gbrt_slow.n_estimators),fontsize=18)
plt.tight_layout()
plt.show()
输出:
如上输出所示为使用sklearn中GradientBoostingRegressor类实现的模型,观察可以发现左边集成模型中树太少,没有很好地拟合数据;而右边集成模型树太多,很明显处理过拟合状态。
GBRT中怎样才能找到适合的树的个数呢?可以使用提前终止训练(early stopping)的方法。请看下面的例子:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
X_train, X_val, y_train, y_val = train_test_split(X, y, random_state=49)
gbrt = GradientBoostingRegressor(max_depth=2, n_estimators=120, random_state=42)
gbrt.fit(X_train, y_train)
errors = [mean_squared_error(y_val, y_pred) for y_pred in gbrt.staged_predict(X_val)]
best_n_estimators = np.argmin(errors)+1
gbrt_best = GradientBoostingRegressor(max_depth=2,n_estimators=best_n_estimators, random_state=42)
gbrt_best.fit(X_train, y_train)
输出:
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
learning_rate=0.1, loss='ls', max_depth=2,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=56,
n_iter_no_change=None, presort='auto',
random_state=42, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0, warm_start=False)
min_error = np.min(errors)
plt.figure(figsize=(12,5))
plt.subplot(121)
plt.plot(errors, "b.-")
plt.plot([best_n_estimators, best_n_estimators],[0,min_error],"k--")
plt.plot([0,120],[min_error,min_error],"k--")
plt.plot(best_n_estimators,min_error,"ro")
plt.text(best_n_estimators, min_error*1.2, "Minimum", ha="center", fontsize=14)
plt.axis([0,120,0,0.006])
plt.xlabel("Number of trees",fontsize=16)
plt.ylabel("Error",fontsize=16)
plt.title("Validation error",fontsize=18)
plt.subplot(122)
plot_predictions([gbrt_best],X,y,axes=[-0.5,0.5,-0.1,0.8])
plt.title("Best model (%d trees)" % best_n_estimators, fontsize=18)
plt.ylabel("$y$",fontsize=16,rotation=0)
plt.xlabel("$x_1$",fontsize=16)
plt.tight_layout()
plt.show()
输出:
如上输出所示,左图为GBRT验证错误率变化情况,红点位置代表模型误差最低的时候,再往后错误率又上升了,红点处代表的决策树的个数为56。 右图为根据左图每个子模型的错误率得到决策树最佳个数56训练的模型。
使用上面这种方式(先求得最佳个数,才根据最佳个数训练模型)显得有些啰嗦,可以在GradientBoostingRegressor类中指定超参数warm_start=True,就可以实现early stopping,在最佳位置停止 训练并保存模型:
gbrt = GradientBoostingRegressor(max_depth=2, warm_start=True, random_state=42)
min_val_error = float("inf")
error_going_up = 0
for n_estimators in range(1,120):
gbrt.n_estimators = n_estimators
gbrt.fit(X_train, y_train)
y_pred = gbrt.predict(X_val)
val_error = mean_squared_error(y_val, y_pred)
if val_error < min_val_error:
min_val_error = val_error
error_going_up = 0
else:
error_going_up += 1
if error_going_up == 5:
break
print(gbrt.n_estimators)
输出:
61
print("Minimum validation MSE:", min_val_error)
输出:
Minimum validation MSE: 0.002712853325235463
GradientBoostingRegressor类也支持subsample超参数,表示每个树训练使用数据集中数据的比例,例如subsample=0.25表示每棵树从训练集中随机选取25%的数据进行训练,这样做的好处之一是加快了训练速度。
Gradient Boosting也可以通过loss超参数指定训练时使用的损失函数。
5.3 XGBoost
XGBoost库:由Tianqi Chen开发,特点是快、可扩展和易用性。
import xgboost
xgb_reg = xgboost.XGBRegressor(random_state=42)
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_val)
val_error = mean_squared_error(y_val, y_pred)
print("Validataion MSE:",val_error)
输出:
Validataion MSE: 0.00400040950714611
xgb_reg.fit(X_train, y_train,eval_set=[(X_val, y_val)],early_stopping_rounds=2)
y_pred = xgb_reg.predict(X_val)
val_error = mean_squared_error(y_val, y_pred)
print("Validataion MSE:",val_error)
输出:
[0] validation_0-rmse:0.22834
Will train until validation_0-rmse hasn't improved in 2 rounds.
[1] validation_0-rmse:0.16224
[2] validation_0-rmse:0.11843
[3] validation_0-rmse:0.08760
[4] validation_0-rmse:0.06848
[5] validation_0-rmse:0.05709
[6] validation_0-rmse:0.05297
[7] validation_0-rmse:0.05129
[8] validation_0-rmse:0.05155
[9] validation_0-rmse:0.05211
Stopping. Best iteration:
[7] validation_0-rmse:0.05129
Validataion MSE: 0.0026308690413069744
%timeit xgboost.XGBRegressor().fit(X_train, y_train)
输出:
10.8 ms ± 1.15 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit GradientBoostingRegressor().fit(X_train, y_train)
输出:
11.7 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
如上输出显示,xgboost的运行速度稍快于Gradient Boosting。
6. Stacking
Stacking:stacked generalization。Stacking主要思想是根据多个子模型的结果训练模型并输出最终结果,而不是简单地将多个子模型结果进行平均或取众数的操作。
sklearn中没有Stacking相关的工具,可以使用开源工具brew等实现Stacking。