模型调优
程序员文章站
2022-07-23 08:02:15
k折交叉验证 第一步,不重复抽样将原始数据随机分为 k 份。第二步,每一次挑选其中 1 份作为测试集,剩余 k-1 份作为训练集用于模型训练。第三步,重复第二步 k 次,这样每个子集都有一次机会作为测试集,其余机会作为训练集。在每个训练集上训练后得到一个模型,用这个模型在相应的测试集上测试,计算并保 ......
k折交叉验证
第一步,不重复抽样将原始数据随机分为 k 份。
第二步,每一次挑选其中 1 份作为测试集,剩余 k-1 份作为训练集用于模型训练。
第三步,重复第二步 k 次,这样每个子集都有一次机会作为测试集,其余机会作为训练集。
在每个训练集上训练后得到一个模型,
用这个模型在相应的测试集上测试,计算并保存模型的评估指标,
第四步,计算 k 组测试结果的平均值作为模型精度的估计,并作为当前 k 折交叉验证下模型的性能指标。
在这里我们采用5折交叉验证
网格搜索
gridsearchcv,它存在的意义就是自动调参,只要把参数输进去,就能给出最优化的结果和参数。但是这个方法适合于小数据集,一旦数据的量级上去了,很难得出结果。
import pandas as pd import numpy as np import matplotlib.pyplot as plt from xgboost import xgbclassifier from sklearn.metrics import roc_auc_score from sklearn.metrics import accuracy_score from sklearn.metrics import precision_score from sklearn.metrics import recall_score from sklearn.metrics import f1_score from sklearn.metrics import roc_curve,auc from sklearn.model_selection import train_test_split,gridsearchcv from sklearn.ensemble import randomforestclassifier from sklearn.ensemble import gradientboostingclassifier from lightgbm import lgbmclassifier from sklearn.preprocessing import standardscaler from sklearn.linear_model import logisticregression from sklearn.tree import decisiontreeclassifier from sklearn import svm data_all = pd.read_csv('d:\\data_all.csv',encoding ='gbk') x = data_all.drop(['status'],axis = 1) y = data_all['status'] x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3,random_state=2018) #数据标准化 scaler = standardscaler() scaler.fit(x_train) x_train = scaler.transform(x_train) x_test = scaler.transform(x_test) #lr lr = logisticregression(random_state = 2018) param = {'c':[1e-3,0.01,0.1,1,10,100,1e3], 'penalty':['l1', 'l2']} grid = gridsearchcv(estimator=lr, param_grid=param, scoring='roc_auc', cv=5) grid.fit(x_train,y_train) print(grid.best_params_) print( grid.best_score_) print(grid.score(x_test,y_test)) #decisiontree dt = decisiontreeclassifier(random_state = 2018) param = {'criterion':['gini','entropy'],'splitter':['best','random'],'max_depth':[2,4,6,8],'max_features':['sqrt','log2',none]} grid = gridsearchcv(estimator = dt, param_grid=param, scoring='roc_auc', cv=5) print(grid.best_params_) print( grid.best_score_) print(grid.score(x_test,y_test)) #svm svc = svm.svc(random_state = 2018) param = {'c':[1e-2, 1e-1, 1, 10],'kernel':['linear','poly','rbf','sigmoid']} grid = gridsearchcv(estimator = svc, param_grid=param, scoring='roc_auc', cv=5) print(grid.best_params_) print( grid.best_score_) print(grid.score(x_test,y_test)) #randomforest rft = randomforestclassifier() param = {'n_estimators':[10,20,50,100],'criterion':['gini','entropy'],'max_depth':[2,4,6,8,10,none],'max_features':['sqrt','log2',none]} grid = gridsearchcv(estimator = rft, param_grid=param, scoring='roc_auc', cv=5) print(grid.best_params_) print( grid.best_score_) print(grid.score(x_test,y_test)) #gbdt gb = gradientboostingclassifier() param = {'max_features':['sqrt','log2',none],'learning_rate':[0.01,0.1,0.5,1],'n_estimators':range(20,200,20),'subsample':[0.2,0.5,0.7,1.0]} grid = gridsearchcv(estimator = gb, param_grid=param, scoring='roc_auc', cv=5) print(grid.best_params_) print( grid.best_score_) print(grid.score(x_test,y_test)) #xgboost xgb_c = xgbclassifier() param = {'n_estimators':range(20,200,20),'max_depth':[2,6,10],'reg_lambda':[0.2,0.5,1]} grid = gridsearchcv(estimator = xgb_c, param_grid=param, scoring='roc_auc', cv=5) print(grid.best_params_) print( grid.best_score_) print(grid.score(x_test,y_test)) #lightgbm lgbm_c = lgbmclassifier() param = {'learning_rate': [0.2,0.5,0.7], 'max_depth': range(1,10,2), 'n_estimators':range(20,100,10)} grid = gridsearchcv(estimator = lgbm_c, param_grid=param, scoring='roc_auc', cv=5) grid.fit(x_train,y_train) print(grid.best_params_) print( grid.best_score_) print(grid.score(x_test,y_test))
上一篇: scrapy 组合多个页面的数据一并存储
下一篇: 假正经