欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

机器学习4集成算法与随机森林

程序员文章站 2022-07-13 08:53:11
...

理论上越多的树效果会越好,但实际上基本超过一定数量就差不多上下浮动了

Bagging:训练多个分类器取平均
代表:随机森林
随机:数据采样随机,特征选择随机
森林:很多个决策树并行放在一起

特征重要性衡量:
破坏某列特征,与原来情况进行比较
理论上越多的树效果会越好,但实际上基本超过一定数量就差不多上下浮动了

Boosting:从弱学习器开始加强,通过加权进行训练
代表:AdaBoost
Adaboost会根据前一次的分类效果调整数据权重
解释:如果某一个数据在这次分错了,那么在下一次我就会给它更大的权重
最终的结果:每个分类器根据自身的准确性来确定各自的权重,再合体

Stacking:聚合多个分类或回归模型(可以分阶段来做)
可以堆叠各种各样的分类器
第一阶段得出各自结果 第二阶段再用前一阶段结果训练

随机森林案例:
预测titanic获救情况

import pandas 
titanic = pandas.read_csv("titanic_train.csv")
print( titanic.describe() )

数据填充:
titanic['age'] = titanci['age'].fillna( tutanic['age'].median() )

数据映射:(将性别male,female映射为0,1)
print titanic["sex"].unique()
titanic.loc[titanic["sex"] == "male","sex"] = 0
titanic.loc[titanic["sex"] == "female","sex"] = 1

利用线性回归进行预测

from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import KFold

predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]   #特征项
alg = LinearRegression()
kf = KFold(titanic.shape[0],N_folds=3,random_state=1)

predictions = []
for train,test in kf:
    train_predictors = (titanic[predictors].iloc[train,:])
    train_target = titanic["Survived"].iloc[train]
    alg.fit(train_predictors,train_target)
    test_predictions = alg.predict(titanic[predictors].iloc[test,:])
    predictions.append(test_predictions)

import numpy as np
predictions = np.concatenate(predictions,axis=0)
predictions[predictions > 0.5] = 1
predictions[predictions <=0.5] = 0
accuracy = sum(predictions[predictions == titanic["Survived"]]) / len(predictions)

随机森林进行预测

from sklearn import cross_validation
from sklearn.ensemble import RandomForestClassifier

predictors=["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked"]
alg = RandomForestClassifier(random_state=1,n_estimators=10,min_samples_split=2,min_samples_leaf=1)
kf = cross_validation.KFold(titanic.shape[0],n_folds=3,random_state=1)
scores = cross_validation.cross_val_score(alg,titanic[predictors],titanic["Survived"],cv=kf)
print(scores.mean())

换参数后:

alg = RandomForestClassifier(random_state=1,n_estimators=50,min_samples_split=4,min_samples_leaf=2)
kf = cross_validation.KFold(titanic.shape[0],n_folds=3,random_state=1)
scores = cross_validation.cross_val_score(alg,titanic[predictors],titanic["Survived"],cv=kf)
print(scores.mean())