预处理(1):python实现用随机森林评估特征的重要性
程序员文章站
2022-07-14 13:40:50
...
预处理(1):python实现用随机森林评估特征的重要性
随机森林根据森林中所有决策树计算平均不纯度的减少来测量特征的重要性,而不作任何数据是线性可分或不可分的假设。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
df_wine = pd.read_csv("xxx\\wine.data",
header=None)
df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
'Alcalinity of ash', 'Magnesium', 'Total phenols',
'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
'Proline']
# print(df_wine['Class label'])
# print('Class labels', np.unique(df_wine['Class label']))
# print(df_wine.head())
X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3,
random_state=0,
stratify=y)
mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)
stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)
feat_labels = df_wine.columns[1:]
forest = RandomForestClassifier(n_estimators=500,
random_state=1)
forest.fit(X_train, y_train)
importances = forest.feature_importances_
print(importances)
indices = np.argsort(importances)[::-1]
for f in range(X_train.shape[1]):
print("%2d) %-*s %f" % (f + 1, 60,
feat_labels[indices[f]],
importances[indices[f]]))
plt.title('Feature Importance')
plt.bar(range(X_train.shape[1]),
importances[indices],
align='center')
plt.xticks(range(X_train.shape[1]),
feat_labels[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.tight_layout()
#plt.savefig('images/04_09.png', dpi=300)
plt.show()
# 为了总结特征的重要值和随机森林,值得一提的是scikit-learn也实现了Select-FromModel对象,可以在模型拟合后,根据用户指定的阈值选择特征
sfm = SelectFromModel(forest, threshold=0.1, prefit=True) # prefit:预设模型是否期望直接传递给构造函数
X_selected = sfm.transform(X_train)
print('Number of features that meet this threshold criterion:',
X_selected.shape[1])
for f in range(X_selected.shape[1]):
print("%2d) %-*s %f" % (f + 1, 30,
feat_labels[indices[f]],
importances[indices[f]]))
运行结果:
[0.11852942 0.02564836 0.01327854 0.02236594 0.03135708 0.05087243
0.17475098 0.01335393 0.02556988 0.1439199 0.058739 0.13616194
0.1854526 ]
- Proline 0.185453
- Flavanoids 0.174751
- Color intensity 0.143920
- OD280/OD315 of diluted wines 0.136162
- Alcohol 0.118529
- Hue 0.058739
- Total phenols 0.050872
- Magnesium 0.031357
- Malic acid 0.025648
- Proanthocyanins 0.025570
- Alcalinity of ash 0.022366
- Nonflavanoid phenols 0.013354
- Ash 0.013279
Number of features that meet this threshold criterion: 5 - Proline 0.185453
- Flavanoids 0.174751
- Color intensity 0.143920
- OD280/OD315 of diluted wines 0.136162
- Alcohol 0.118529
运行结果图:
把葡萄酒数据集中不同的特征按其相对重要性进行排序,请注意,特征重要性值被正常化所以总和为1