机器学习4集成算法与随机森林
程序员文章站
2022-07-13 08:53:11
...
理论上越多的树效果会越好,但实际上基本超过一定数量就差不多上下浮动了
Bagging:训练多个分类器取平均
代表:随机森林
随机:数据采样随机,特征选择随机
森林:很多个决策树并行放在一起
特征重要性衡量:
破坏某列特征,与原来情况进行比较
理论上越多的树效果会越好,但实际上基本超过一定数量就差不多上下浮动了
Boosting:从弱学习器开始加强,通过加权进行训练
代表:AdaBoost
Adaboost会根据前一次的分类效果调整数据权重
解释:如果某一个数据在这次分错了,那么在下一次我就会给它更大的权重
最终的结果:每个分类器根据自身的准确性来确定各自的权重,再合体
Stacking:聚合多个分类或回归模型(可以分阶段来做)
可以堆叠各种各样的分类器
第一阶段得出各自结果 第二阶段再用前一阶段结果训练
随机森林案例:
预测titanic获救情况
import pandas
titanic = pandas.read_csv("titanic_train.csv")
print( titanic.describe() )
数据填充:
titanic['age'] = titanci['age'].fillna( tutanic['age'].median() )
数据映射:(将性别male,female映射为0,1)
print titanic["sex"].unique()
titanic.loc[titanic["sex"] == "male","sex"] = 0
titanic.loc[titanic["sex"] == "female","sex"] = 1
利用线性回归进行预测
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import KFold
predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"] #特征项
alg = LinearRegression()
kf = KFold(titanic.shape[0],N_folds=3,random_state=1)
predictions = []
for train,test in kf:
train_predictors = (titanic[predictors].iloc[train,:])
train_target = titanic["Survived"].iloc[train]
alg.fit(train_predictors,train_target)
test_predictions = alg.predict(titanic[predictors].iloc[test,:])
predictions.append(test_predictions)
import numpy as np
predictions = np.concatenate(predictions,axis=0)
predictions[predictions > 0.5] = 1
predictions[predictions <=0.5] = 0
accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)
随机森林进行预测
from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier
predictors=["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]
alg = RandomForestClassifier(random_state=1,n_estimators=10,min_samples_split=2,min_samples_leaf=1)
kf = cross_validation.KFold(titanic.shape[0],n_folds=3,random_state=1)
scores = cross_validation.cross_val_score(alg,titanic[predictors],titanic["Survived"],cv=kf)
print(scores.mean())
换参数后:
alg = RandomForestClassifier(random_state=1,n_estimators=50,min_samples_split=4,min_samples_leaf=2)
kf = cross_validation.KFold(titanic.shape[0],n_folds=3,random_state=1)
scores = cross_validation.cross_val_score(alg,titanic[predictors],titanic["Survived"],cv=kf)
print(scores.mean())
上一篇: tensorflow图像预处理
下一篇: 集成算法