【ML】集成模型(Ensemble)接口及案例

程序员文章站 2022-07-14 14:48:39

...

Navigator

Ensemble methods
Bagging meta-estimator
Forests of randomized trees
AdaBoost
Gradient Tree Boosting
- Classification
- Regression
Case：Multi-class AdaBoosted Decision Trees
References

Ensemble methods

集成模型的思想是组合一系列基分类器(base estimators)进行预测，从而提升模型整体的泛化能力(generalizability)和鲁棒性(robustness). 有两类集成方法比较常用.

averaging methods, 独立构建一系列基分类器的结果取均值，总体上而言，由于减小了方差，该分类器的效果要好于单个分类器. 代表方法有Bagging methods和Forests of randomized trees.

boosting methods, 序贯建立基分类器，每次加入的基分类器的目标是减少当前集成分类器的偏差(bias)，这样可以将若干个弱模型(weak models)组成为一个强模型(powerful ensemble). 代表方法有AdaBoost, Gradient Tree Boosting.

Bagging meta-estimator

In ensemble algorithms, bagging methods form a class of algorithms which build serveral instances of black-box estimator on random subsets of the original training set and then aggregate their individual predictions to form a final prediction. These methods are used as a way to reduce the variance of base estimator. As they provide a way to reduce overfitting, bagging methods work best with strong and complex models(e.g., fully developed decision trees), in contrast with boosting methods which usually work best with weak models(e.g., shallow decision trees).

在scikit-learn中，BaggingClassifier/BaggingRegressor提供了Bagging方法，其中参数max_samples和max_features控制采样的尺寸, bootstrap和bootstrap_features控制是否进行有放回或者无放回采样，参数oob_score查看模型的样本外表现.
下面代码集成了KNN分类器，设置每次采样的最大采集样本为70%，最大采集特征为70%

from sklearn.ensemble import BaggingClassifier as BC
from sklearn.neighbors import KNeighborsClassifier as KN
bagging = BC(KN(), max_samples=0.7, max_features=0.7)

Forests of randomized trees

在sklearn.ensemble中包含两种平均化算法，分别基于随机森林(random forests)和极端随机树(Extra tree)

Random Forests

由于单棵决策树的方差较大且倾向于过拟合，加入的随机性的森林可以克服过拟合，降低方差，但是整体预测的偏差值会上升.

In contrast to the original publication, the scikit-learn implementation combines classifiers by averaging their probabilistic prediction, instead of letting each classifier vote for a single class.

Extremely Randomized Trees

极端随机树(ET)整体结构与随机森林相似，但是在模型中加入了更多的随机性，包括特征随机，参数随机，模型随机以及分裂随机.

def Bagging_models():
    X, y = make_blobs(n_samples=10000, n_features=10, centers=100, random_state=729)
    clf = DTC(max_depth=None, min_samples_split=2, random_state=729) # 决策树
    scores = CVS(clf, X, y, cv=5) # 进行5折交叉验证
    print(scores.mean())

    clf = RFC(n_estimators=10, max_depth=None, min_samples_split=2, random_state=729) # 随机森林
    scores = CVS(clf, X, y, cv=5)
    print(scores.mean())

    clf = ETC(n_estimators=10, max_depth=None, min_samples_split=2, random_state=729) # 极端随机树
    scores = CVS(clf, X, y, cv=5)
    print(scores.mean())

Parameters

对模型效果影响最为明显的参数是n_estimators和max_features，一般的经验法则设置max_feaures=None处理回归问题而max_features='sqrt'处理分类问题.
模型的空间复杂度为
O ( M × N × log ⁡ ( N ) ) \mathcal{O}(M\times N \times \log(N)) O(M×N×log(N))
其中 M M M表示基分类器的数量， N N N为样本的数量.

Parallelization

模型可以并行构建子树且进行并行计算，通过n_jobs参数设置.

AdaBoost

AdaBoost也可以在分类问题¹和回归问题中使用.

def AdaBoost_models():
    X, y = load_iris(return_X_y=True)
    clf = ABC(n_estimators=100)
    scores = CVS(clf, X, y, cv=5)
    print(scores.mean())

Gradient Tree Boosting

GBDT是对任意可微的损失函数的boosting模型的泛化. 可以同时使用在分类问题和回归问题上，模型中最重要的参数是n_estimators和learning_rate.

Scikit-learn 0.21 introduces two new experimental implementations of gradient boosting trees, namely HistGradientBoostingClassifier and HistGradientBoostingRegressor，inspired by LightGBM.

Classification

GBDT支持二分类和多分类的问题

def GBC_models():
    X, y = make_hastie_10_2(random_state=729)
    X_train, X_test = X[:2000], X[2000:]
    y_train, y_test = y[:2000], y[2000:]
    clf = GBC(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0).fit(X_train, y_train)
    print(clf.score(X_test, y_test))

在多分类问题中，每次迭代都需要建立n_classes棵回归树，所以基分类器的数量达到了n_classes * n_estimators个，在类别较多的情况下，推荐使用HistGradientBoostingClassifier.

Regression

在回归问题中，GBDT可以设置不同的loss function

def GBR_models():
    X, y = make_friedman1(n_samples=1200, noise=1.0, random_state=729)
    X_train, X_test = X[:200], X[200:]
    y_train, y_test = y[:200], y[200:]
    clf = GBR(n_estimators=100, learning_rate=0.1, max_depth=1, random_state=0, loss='ls').fit(X_train, y_train)
    print(mse(y_test, clf.predict(X_test)))

Case：Multi-class AdaBoosted Decision Trees

分类数据集的特征为10维标准正态分布以及3个类别的标签组成. 对比了模型中内嵌算法SAMME和SAMME.R，SAMME.R使用概率估计值更新叠加模型，而SAMME仅使用类别.

def multi_class_adaboost_demo():
    X, y=make_gaussian_quantiles(n_samples=13000, n_features=10, n_classes=3, random_state=1)
    n_split = 3000
    X_train, X_test = X[:n_split], X[n_split:]
    y_train, y_test = y[:n_split], y[n_split:]

    # SAMME.R
    bdt_real = ABC(DTC(max_depth=2), n_estimators=600, learning_rate=1)
    # SAMME
    bdt_dis = ABC(DTC(max_depth=2), n_estimators=600, learning_rate=1.5, algorithm='SAMME')
    bdt_real.fit(X_train, y_train)
    bdt_dis.fit(X_train, y_train)

    # 统计错误率
    real_test_errors = []
    dis_test_errors = []

    for real_test_pred, dis_test_pred in zip(bdt_real.staged_predict(X_test), bdt_dis.staged_predict(X_test)):
        real_test_errors.append(1-ACC(real_test_pred, y_test))
        dis_test_errors.append(1-ACC(dis_test_pred, y_test))

    n = len(real_test_errors)
    plt.plot(range(n), real_test_errors, 'r--', lw=1, alpha=1, label='SAMME.R')
    plt.plot(range(n), dis_test_errors, 'g-', lw=1, alpha=0.7, label='SAMME')

    plt.legend()
    plt.show()