欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

预处理(1):python实现用随机森林评估特征的重要性

程序员文章站 2022-07-14 13:40:50
...

预处理(1):python实现用随机森林评估特征的重要性

随机森林根据森林中所有决策树计算平均不纯度的减少来测量特征的重要性,而不作任何数据是线性可分或不可分的假设。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel


df_wine = pd.read_csv("xxx\\wine.data",
                      header=None)

df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                   'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                   'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
                   'Proline']

# print(df_wine['Class label'])
# print('Class labels', np.unique(df_wine['Class label']))
# print(df_wine.head())

X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values

X_train, X_test, y_train, y_test = train_test_split(X, y,
                     test_size=0.3,
                     random_state=0,
                     stratify=y)

mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)

stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)


feat_labels = df_wine.columns[1:]

forest = RandomForestClassifier(n_estimators=500,
                                random_state=1)

forest.fit(X_train, y_train)
importances = forest.feature_importances_
print(importances)

indices = np.argsort(importances)[::-1]

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 60,
                            feat_labels[indices[f]],
                            importances[indices[f]]))

plt.title('Feature Importance')
plt.bar(range(X_train.shape[1]),
        importances[indices],
        align='center')

plt.xticks(range(X_train.shape[1]),
           feat_labels[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.tight_layout()
#plt.savefig('images/04_09.png', dpi=300)
plt.show()

# 为了总结特征的重要值和随机森林,值得一提的是scikit-learn也实现了Select-FromModel对象,可以在模型拟合后,根据用户指定的阈值选择特征
sfm = SelectFromModel(forest, threshold=0.1, prefit=True)  # prefit:预设模型是否期望直接传递给构造函数
X_selected = sfm.transform(X_train)
print('Number of features that meet this threshold criterion:',
      X_selected.shape[1])

for f in range(X_selected.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30,
                            feat_labels[indices[f]],
                            importances[indices[f]]))

运行结果:
[0.11852942 0.02564836 0.01327854 0.02236594 0.03135708 0.05087243
0.17475098 0.01335393 0.02556988 0.1439199 0.058739 0.13616194
0.1854526 ]

  1. Proline 0.185453
  2. Flavanoids 0.174751
  3. Color intensity 0.143920
  4. OD280/OD315 of diluted wines 0.136162
  5. Alcohol 0.118529
  6. Hue 0.058739
  7. Total phenols 0.050872
  8. Magnesium 0.031357
  9. Malic acid 0.025648
  10. Proanthocyanins 0.025570
  11. Alcalinity of ash 0.022366
  12. Nonflavanoid phenols 0.013354
  13. Ash 0.013279
    Number of features that meet this threshold criterion: 5
  14. Proline 0.185453
  15. Flavanoids 0.174751
  16. Color intensity 0.143920
  17. OD280/OD315 of diluted wines 0.136162
  18. Alcohol 0.118529

运行结果图:
把葡萄酒数据集中不同的特征按其相对重要性进行排序,请注意,特征重要性值被正常化所以总和为1
预处理(1):python实现用随机森林评估特征的重要性

备注:代码为《python机器学习》(原书第2版)机械工业出版社,书籍中示例代码,学习过程中整理,现分享出来,供大家学习参考。