Xgboost 小白实战初探
记录学到的Xgboost实战过程,因为anaconda自带的库没有Xgboost,所以要先下载下来,方法是打开anaconda prompt终端,输入pip install xgboost
Xgboost是Boosting算法的其中一种,Boosting算法的思想是将许多弱分类器集成在一起,形成一个强分类器。因为Xgboost是一种提升树模型,所以它是将许多树模型集成在一起,形成一个很强的分类器。而所用到的树模型则是CART回归树模型。
下面进行数据集实战,其实都是老套路了:
- 导库
- 读数据集
- 分离特征和lablel
- 划分测试集和训练集
- 拿到模型
- fit一下
- 预测
- 精度
import xgboost #导入xgboost库
# First XGBoost model for Pima Indians dataset
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# load data读数据
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8] #特征
Y = dataset[:,8] #lablel
# split data into train and test sets
seed = 7 #让每次随机的结果是一样的
test_size = 0.33 #33%当成测试集
#划分测试集、训练集
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
# fit model no training data
model = XGBClassifier() #把模型拿到手
model.fit(X_train, y_train) #老套路fit一下
# make predictions for test data 最后再预测
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions 看下精度
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
Accuracy: 74.02%
xgboost是每次加一个模型,如果想看下每次加上一个模型后得到的结果,就加上一句eval_set = [(X_test, y_test)]
改一句model.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True)
就可以了
#与上面的不同是每次都输出迭代的效果
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)
# fit model no training data
model = XGBClassifier()
eval_set = [(X_test, y_test)] #构造测试集,得到每加一个模型的效果
#10:如果连续10次效果没有提升,就停止
model.fit(X_train, y_train, early_stopping_rounds=10, eval_metric="logloss", eval_set=eval_set, verbose=True)
# make predictions for test data
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
[0] validation_0-logloss:0.60491
Will train until validation_0-logloss hasn’t improved in 10 rounds.
[1] validation_0-logloss:0.55934
[2] validation_0-logloss:0.53068
[3] validation_0-logloss:0.51795
[4] validation_0-logloss:0.51153
[5] validation_0-logloss:0.50935
[6] validation_0-logloss:0.50818
[7] validation_0-logloss:0.51097
[8] validation_0-logloss:0.51760
[9] validation_0-logloss:0.51912
[10] validation_0-logloss:0.52503
[11] validation_0-logloss:0.52697
[12] validation_0-logloss:0.53335
[13] validation_0-logloss:0.53905
[14] validation_0-logloss:0.54546
[15] validation_0-logloss:0.54613
[16] validation_0-logloss:0.54982
Stopping. Best iteration:
[6] validation_0-logloss:0.50818
Accuracy: 74.41%
现在想看下特征的重要程度,也很简单
from numpy import loadtxt
from xgboost import XGBClassifier
from xgboost import plot_importance #用来看特征的重要程度
from matplotlib import pyplot
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
y = dataset[:,8]
# fit model no training data
model = XGBClassifier()
model.fit(X, y)
# plot feature importance
plot_importance(model)
pyplot.show()
机器学习肯定是要调参的,此时只要用GridSearchCV
这个库就行了,以调学习率为例:
#调参,看哪个参数有最好的结果
# eg:Tune learning_rate
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
# grid search
model = XGBClassifier()
learning_rate = [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3]
param_grid = dict(learning_rate=learning_rate)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7) #交叉验证
grid_search = GridSearchCV(model, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold)
grid_result = grid_search.fit(X, Y)
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
params = grid_result.cv_results_['params']
for mean, param in zip(means, params):
print("%f with: %r" % (mean, param))
Best: -0.530152 using {‘learning_rate’: 0.01}
-0.689563 with: {‘learning_rate’: 0.0001}
-0.660868 with: {‘learning_rate’: 0.001}
-0.530152 with: {‘learning_rate’: 0.01}
-0.552723 with: {‘learning_rate’: 0.1}
-0.653341 with: {‘learning_rate’: 0.2}
-0.718789 with: {‘learning_rate’: 0.3}
我们平常能调节的参数:
1.learning rate
2.tree
max_depth
min_child_weight
subsample, colsample_bytree
gamma
3.正则化参数
lambda
alpha
此处大致总结一下xgboost包含的的参数,别人有总结过比较详细的各参数意思:
https://www.cnblogs.com/wj-1314/p/9402324.html
xgb1 = XGBClassifier(
learning_rate =0.1, #低一些
n_estimators=1000, #树多少个
max_depth=5, #树的深度
min_child_weight=1,
gamma=0,
subsample=0.8,
colsample_bytree=0.8,
objective= 'binary:logistic',
nthread=4,
scale_pos_weight=1,
seed=27)
本文地址:https://blog.csdn.net/qq_43653405/article/details/107569186
下一篇: 春秋时期越国美女西施的邻居:东施生平简介
推荐阅读
-
Vuex的初探与实战小结
-
小白学 Python 爬虫:自动化测试框架 Selenium 从入门到实战
-
Vuex的初探与实战小结
-
六年实战百度竞价账户搭建经验 小白直接拿去用吧
-
ASP.NET Core 实战:Linux 小白的 .NET Core 部署之路
-
实战from GBDT to Xgboost
-
零基础小白用户也能看懂!推荐给Java从业者的框架之JFinal(附实战) javajava框架jfinal
-
html+css+js适合前端小白的实战全解(超详细)——2048小游戏(一)
-
王通:从小白到实战高手,需要掌握哪些营销技能?
-
html+css+js适合前端小白的实战全解(超详细)——2048小游戏(二)