欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

XGBoost中如何防止过拟合

程序员文章站 2024-03-15 11:35:11
...

过拟合问题是在使用复杂的非线性学习算法时会经常碰到(例如gradient boosting算法),在前面的博客中,我们也已经详细的讲述了过拟合问题。
在本博客中,主要讲述XGBoost算法用Early Stopping方法避免过拟合

项目中用到的数据集:
Pima Indians Diabetes Data Set(皮马印第安人糖尿病数据集)
数据集的内容是皮马人的医疗记录,以及过去5年内是否有糖尿病。所有的数据都是数字,问题是(是否有糖尿病是1或0),是二分类问题。数据有8个属性,2个类别(0/1)
  【1】Pregnancies:怀孕次数
  【2】Glucose:葡萄糖
  【3】BloodPressure:血压 (mm Hg)
  【4】SkinThickness:皮层厚度 (mm)
  【5】Insulin:胰岛素 2小时血清胰岛素(mu U / ml
  【6】BMI:体重指数 (体重/身高)^2
  【7】DiabetesPedigreeFunction:糖尿病谱系功能
  【8】Age:年龄 (岁)
   # 类别
  【9】Outcome:类标变量 (0或1)

下面这段代码将会在67%的数据集上训练模型,并且在每一轮迭代中使用剩下的33%数据来评估模型的性能。每次迭代都会输出分类错误,最终将会输出最后的分类准确率

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 17 10:28:59 2019

@author: ZQQ
"""

# monitor training performance
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")

# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]

# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)

# fit model no training data
model = XGBClassifier()
eval_set = [(X_test, y_test)]
model.fit(X_train, y_train, eval_metric="error", eval_set=eval_set, verbose=True)

# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]

# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

运行结果:
XGBoost中如何防止过拟合
观察所有的输出,我们可以看到,在训练快要结束时测试集上的模型性能的变化是平缓的,甚至变得更差。

绘制学习曲线可视化训练过程:
提取出模型在测试数据集上的表现并绘制成曲线,从而更好地观察到在整个训练过程中学习曲线是如何变化的。
python3代码如下:

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 17 10:45:21 2019

@author: ZQQ
"""

# plot learning curve
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from matplotlib import pyplot
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)
# fit model no training data
model = XGBClassifier()
eval_set = [(X_train, y_train), (X_test, y_test)]
model.fit(X_train, y_train, eval_metric=["error", "logloss"], eval_set=eval_set, verbose=True)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
# retrieve performance metrics
results = model.evals_result()
epochs = len(results['validation_0']['error'])
x_axis = range(0, epochs)

# plot log loss
fig1, ax = pyplot.subplots()
ax.plot(x_axis, results['validation_0']['logloss'], label='Train')
ax.plot(x_axis, results['validation_1']['logloss'], label='Test')
ax.legend()
pyplot.ylabel('Log Loss')
pyplot.title('XGBoost Log Loss')
pyplot.show()

# plot classification error
fig2, ax = pyplot.subplots()
ax.plot(x_axis, results['validation_0']['error'], label='Train')
ax.plot(x_axis, results['validation_1']['error'], label='Test')
ax.legend()
pyplot.ylabel('Classification Error')
pyplot.title('XGBoost Classification Error')
pyplot.show()

运行结果:
XGBoost中如何防止过拟合
XGBoost中如何防止过拟合
XGBoost中如何防止过拟合
第一张图表示的是模型在每一轮迭代中在两个数据集上的对数损失;
第二张图表示分类错误率;
从第一张图来看,20轮迭代过后loss上升,似乎有机会可以进行Early Stopping,大约在20到40轮迭代时比较合适。
从第二张图可以得到相似的结果,大概在40轮迭代时效果比较理想,后面Error开始上升。

在XGBoost中进行Early Stopping(提前终止)
XGBoost提供了在指定轮数完成后提前停止训练的功能。
除了提供用于评估每轮迭代中的评价指标和数据集之外,还需要指定一个窗口大小,意味着连续这么多轮迭代中模型的效果没有提升。这是通过early_stopping_rounds参数来设置的。
直接上python3代码:

# -*- coding: utf-8 -*-
"""
Created on Wed Jul 17 11:13:15 2019

@author: ZQQ
"""

# early stopping
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
# fit model no training data
model = XGBClassifier()
eval_set = [(X_test, y_test)]
model.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

运行结果:
XGBoost中如何防止过拟合
我们可以看到模型在迭代到42轮时停止了训练,在32轮迭代后观察到了最好的效果。
通常将early_stopping_rounds设置为一个与总训练轮数相关的函数(本例中是10%),或者通过观察学习曲线来设置使得训练过程包含拐点,这两种方法都是不错的选择。

后期将代码放在github上,包括数据集。(此数据集是公开数据集,也可以自行下载)

参考和引用:

https://www.cnblogs.com/xxtalhr/p/10859517.html

https://machinelearningmastery.com/avoid-overfitting-by-early-stopping-with-xgboost-in-python/

https://coolboygym.github.io/2018/12/15/early-stop-in-xgboost/


仅用来个人学习和分享,如若侵权,留言立删。

尊重他人知识产权,不做拿来主义者!

喜欢的可以关注我哦QAQ,

你的关注就是我write博文的动力。