欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  IT编程

Xgboost 小白实战初探

程序员文章站 2022-05-31 11:01:20
记录学到的Xgboost实战过程,因为anaconda自带的库没有Xgboost,所以要先下载下来,方法是打开anaconda prompt终端,输入pip install xgboostXgboost是Boosting算法的其中一种,Boosting算法的思想是将许多弱分类器集成在一起,形成一个强分类器。因为Xgboost是一种提升树模型,所以它是将许多树模型集成在一起,形成一个很强的分类器。而所用到的树模型则是CART回归树模型。下面进行数据集实战,其实都是老套路了:导库读数据集分离特征...

记录学到的Xgboost实战过程,因为anaconda自带的库没有Xgboost,所以要先下载下来,方法是打开anaconda prompt终端,输入
pip install xgboost

Xgboost是Boosting算法的其中一种,Boosting算法的思想是将许多弱分类器集成在一起,形成一个强分类器。因为Xgboost是一种提升树模型,所以它是将许多树模型集成在一起,形成一个很强的分类器。而所用到的树模型则是CART回归树模型。

下面进行数据集实战,其实都是老套路了:

  1. 导库
  2. 读数据集
  3. 分离特征和lablel
  4. 划分测试集和训练集
  5. 拿到模型
  6. fit一下
  7. 预测
  8. 精度
import xgboost  #导入xgboost库
# First XGBoost model for Pima Indians dataset
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# load data读数据
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]  #特征
Y = dataset[:,8]  #lablel
# split data into train and test sets
seed = 7   #让每次随机的结果是一样的
test_size = 0.33  #33%当成测试集

#划分测试集、训练集
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
# fit model no training data
model = XGBClassifier() #把模型拿到手
model.fit(X_train, y_train) #老套路fit一下

# make predictions for test data 最后再预测
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions 看下精度
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 74.02%

xgboost是每次加一个模型,如果想看下每次加上一个模型后得到的结果,就加上一句
eval_set = [(X_test, y_test)]
改一句
model.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True)
就可以了

#与上面的不同是每次都输出迭代的效果
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
seed = 7
test_size = 0.33

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
# fit model no training data
model = XGBClassifier()

eval_set = [(X_test, y_test)] #构造测试集,得到每加一个模型的效果

#10:如果连续10次效果没有提升,就停止
model.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

[0] validation_0-logloss:0.60491
Will train until validation_0-logloss hasn’t improved in 10 rounds.
[1] validation_0-logloss:0.55934
[2] validation_0-logloss:0.53068
[3] validation_0-logloss:0.51795
[4] validation_0-logloss:0.51153
[5] validation_0-logloss:0.50935
[6] validation_0-logloss:0.50818
[7] validation_0-logloss:0.51097
[8] validation_0-logloss:0.51760
[9] validation_0-logloss:0.51912
[10] validation_0-logloss:0.52503
[11] validation_0-logloss:0.52697
[12] validation_0-logloss:0.53335
[13] validation_0-logloss:0.53905
[14] validation_0-logloss:0.54546
[15] validation_0-logloss:0.54613
[16] validation_0-logloss:0.54982
Stopping. Best iteration:
[6] validation_0-logloss:0.50818

Accuracy: 74.41%

现在想看下特征的重要程度,也很简单

from numpy import loadtxt
from xgboost import XGBClassifier
from xgboost import plot_importance  #用来看特征的重要程度
from matplotlib import pyplot
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
y = dataset[:,8]
# fit model no training data
model = XGBClassifier()
model.fit(X, y)
# plot feature importance
plot_importance(model)
pyplot.show()

Xgboost 小白实战初探

机器学习肯定是要调参的,此时只要用GridSearchCV这个库就行了,以调学习率为例:

#调参,看哪个参数有最好的结果
# eg:Tune learning_rate
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# grid search
model = XGBClassifier()
learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]
param_grid = dict(learning_rate=learning_rate)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)  #交叉验证
grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold)
grid_result = grid_search.fit(X, Y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
params = grid_result.cv_results_['params']
for mean, param in zip(means, params):
    print("%f  with: %r" % (mean, param))

Best: -0.530152 using {‘learning_rate’: 0.01}
-0.689563 with: {‘learning_rate’: 0.0001}
-0.660868 with: {‘learning_rate’: 0.001}
-0.530152 with: {‘learning_rate’: 0.01}
-0.552723 with: {‘learning_rate’: 0.1}
-0.653341 with: {‘learning_rate’: 0.2}
-0.718789 with: {‘learning_rate’: 0.3}

我们平常能调节的参数:
1.learning rate
2.tree
max_depth
min_child_weight
subsample, colsample_bytree
gamma
3.正则化参数
lambda
alpha

此处大致总结一下xgboost包含的的参数,别人有总结过比较详细的各参数意思:

https://www.cnblogs.com/wj-1314/p/9402324.html

xgb1 = XGBClassifier(
 learning_rate =0.1, #低一些
 n_estimators=1000,  #树多少个
 max_depth=5,  #树的深度
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

本文地址:https://blog.csdn.net/qq_43653405/article/details/107569186