【ML】集成模型(Ensemble)接口及案例
Navigator
Ensemble methods
集成模型的思想是组合一系列基分类器(base estimators
)进行预测,从而提升模型整体的泛化能力(generalizability
)和鲁棒性(robustness
). 有两类集成方法比较常用.
averaging methods
, 独立构建一系列基分类器的结果取均值,总体上而言,由于减小了方差,该分类器的效果要好于单个分类器. 代表方法有Bagging methods和Forests of randomized trees.
boosting methods
, 序贯建立基分类器,每次加入的基分类器的目标是减少当前集成分类器的偏差(bias
),这样可以将若干个弱模型(weak models
)组成为一个强模型(powerful ensemble
). 代表方法有AdaBoost, Gradient Tree Boosting.
Bagging meta-estimator
In ensemble algorithms, bagging methods form a class of algorithms which build serveral instances of
black-box
estimator on random subsets of the original training set and then aggregate their individual predictions to form a final prediction. These methods are used as a way to reduce the variance of base estimator. As they provide a way to reduce overfitting, bagging methods work best with strong and complex models(e.g., fully developed decision trees), in contrast with boosting methods which usually work best with weak models(e.g., shallow decision trees).
在scikit-learn
中,BaggingClassifier/BaggingRegressor
提供了Bagging方法,其中参数max_samples
和max_features
控制采样的尺寸, bootstrap
和bootstrap_features
控制是否进行有放回或者无放回采样,参数oob_score
查看模型的样本外表现.
下面代码集成了KNN
分类器,设置每次采样的最大采集样本为70%,最大采集特征为70%
from sklearn.ensemble import BaggingClassifier as BC
from sklearn.neighbors import KNeighborsClassifier as KN
bagging = BC(KN(), max_samples=0.7, max_features=0.7)
Forests of randomized trees
在sklearn.ensemble
中包含两种平均化算法,分别基于随机森林(random forests
)和极端随机树(Extra tree
)
Random Forests
由于单棵决策树的方差较大且倾向于过拟合,加入的随机性的森林可以克服过拟合,降低方差,但是整体预测的偏差值会上升.
In contrast to the original publication, the scikit-learn implementation combines classifiers by averaging their probabilistic prediction, instead of letting each classifier vote for a single class.
Extremely Randomized Trees
极端随机树(ET
)整体结构与随机森林相似,但是在模型中加入了更多的随机性,包括特征随机,参数随机,模型随机以及分裂随机.
def Bagging_models():
X, y = make_blobs(n_samples=10000, n_features=10, centers=100, random_state=729)
clf = DTC(max_depth=None, min_samples_split=2, random_state=729) # 决策树
scores = CVS(clf, X, y, cv=5) # 进行5折交叉验证
print(scores.mean())
clf = RFC(n_estimators=10, max_depth=None, min_samples_split=2, random_state=729) # 随机森林
scores = CVS(clf, X, y, cv=5)
print(scores.mean())
clf = ETC(n_estimators=10, max_depth=None, min_samples_split=2, random_state=729) # 极端随机树
scores = CVS(clf, X, y, cv=5)
print(scores.mean())
Parameters
对模型效果影响最为明显的参数是n_estimators
和max_features
,一般的经验法则设置max_feaures=None
处理回归问题而max_features='sqrt'
处理分类问题.
模型的空间复杂度为
O
(
M
×
N
×
log
(
N
)
)
\mathcal{O}(M\times N \times \log(N))
O(M×N×log(N))
其中
M
M
M表示基分类器的数量,
N
N
N为样本的数量.
Parallelization
模型可以并行构建子树且进行并行计算,通过n_jobs
参数设置.
AdaBoost
AdaBoost也可以在分类问题1和回归问题中使用.
def AdaBoost_models():
X, y = load_iris(return_X_y=True)
clf = ABC(n_estimators=100)
scores = CVS(clf, X, y, cv=5)
print(scores.mean())
Gradient Tree Boosting
GBDT是对任意可微的损失函数的boosting模型的泛化. 可以同时使用在分类问题和回归问题上,模型中最重要的参数是n_estimators
和learning_rate
.
Scikit-learn 0.21 introduces two new experimental implementations of gradient boosting trees, namely
HistGradientBoostingClassifier
andHistGradientBoostingRegressor
,inspired byLightGBM
.
Classification
GBDT支持二分类和多分类的问题
def GBC_models():
X, y = make_hastie_10_2(random_state=729)
X_train, X_test = X[:2000], X[2000:]
y_train, y_test = y[:2000], y[2000:]
clf = GBC(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0).fit(X_train, y_train)
print(clf.score(X_test, y_test))
在多分类问题中,每次迭代都需要建立n_classes
棵回归树,所以基分类器的数量达到了n_classes * n_estimators
个,在类别较多的情况下,推荐使用HistGradientBoostingClassifier
.
Regression
在回归问题中,GBDT可以设置不同的loss function
def GBR_models():
X, y = make_friedman1(n_samples=1200, noise=1.0, random_state=729)
X_train, X_test = X[:200], X[200:]
y_train, y_test = y[:200], y[200:]
clf = GBR(n_estimators=100, learning_rate=0.1, max_depth=1, random_state=0, loss='ls').fit(X_train, y_train)
print(mse(y_test, clf.predict(X_test)))
Case:Multi-class AdaBoosted Decision Trees
分类数据集的特征为10维标准正态分布以及3个类别的标签组成. 对比了模型中内嵌算法SAMME
和SAMME.R
,SAMME.R
使用概率估计值更新叠加模型,而SAMME
仅使用类别.
def multi_class_adaboost_demo():
X, y=make_gaussian_quantiles(n_samples=13000, n_features=10, n_classes=3, random_state=1)
n_split = 3000
X_train, X_test = X[:n_split], X[n_split:]
y_train, y_test = y[:n_split], y[n_split:]
# SAMME.R
bdt_real = ABC(DTC(max_depth=2), n_estimators=600, learning_rate=1)
# SAMME
bdt_dis = ABC(DTC(max_depth=2), n_estimators=600, learning_rate=1.5, algorithm='SAMME')
bdt_real.fit(X_train, y_train)
bdt_dis.fit(X_train, y_train)
# 统计错误率
real_test_errors = []
dis_test_errors = []
for real_test_pred, dis_test_pred in zip(bdt_real.staged_predict(X_test), bdt_dis.staged_predict(X_test)):
real_test_errors.append(1-ACC(real_test_pred, y_test))
dis_test_errors.append(1-ACC(dis_test_pred, y_test))
n = len(real_test_errors)
plt.plot(range(n), real_test_errors, 'r--', lw=1, alpha=1, label='SAMME.R')
plt.plot(range(n), dis_test_errors, 'g-', lw=1, alpha=0.7, label='SAMME')
plt.legend()
plt.show()
References
Ensemble methods user guide
sci-kit learn adaboost docs
随机森林和极端随机树
上一篇: 模型融合-Stacking