预处理（1）：python实现用随机森林评估特征的重要性

程序员文章站 2022-07-14 13:40:50

...

预处理（1）：python实现用随机森林评估特征的重要性

随机森林根据森林中所有决策树计算平均不纯度的减少来测量特征的重要性，而不作任何数据是线性可分或不可分的假设。

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel


df_wine = pd.read_csv("xxx\\wine.data",
                      header=None)

df_wine.columns = ['Class label', 'Alcohol', 'Malic acid', 'Ash',
                   'Alcalinity of ash', 'Magnesium', 'Total phenols',
                   'Flavanoids', 'Nonflavanoid phenols', 'Proanthocyanins',
                   'Color intensity', 'Hue', 'OD280/OD315 of diluted wines',
                   'Proline']

# print(df_wine['Class label'])
# print('Class labels', np.unique(df_wine['Class label']))
# print(df_wine.head())

X, y = df_wine.iloc[:, 1:].values, df_wine.iloc[:, 0].values

X_train, X_test, y_train, y_test = train_test_split(X, y,
                     test_size=0.3,
                     random_state=0,
                     stratify=y)

mms = MinMaxScaler()
X_train_norm = mms.fit_transform(X_train)
X_test_norm = mms.transform(X_test)

stdsc = StandardScaler()
X_train_std = stdsc.fit_transform(X_train)
X_test_std = stdsc.transform(X_test)


feat_labels = df_wine.columns[1:]

forest = RandomForestClassifier(n_estimators=500,
                                random_state=1)

forest.fit(X_train, y_train)
importances = forest.feature_importances_
print(importances)

indices = np.argsort(importances)[::-1]

for f in range(X_train.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 60,
                            feat_labels[indices[f]],
                            importances[indices[f]]))

plt.title('Feature Importance')
plt.bar(range(X_train.shape[1]),
        importances[indices],
        align='center')

plt.xticks(range(X_train.shape[1]),
           feat_labels[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.tight_layout()
#plt.savefig('images/04_09.png', dpi=300)
plt.show()

# 为了总结特征的重要值和随机森林，值得一提的是scikit-learn也实现了Select-FromModel对象，可以在模型拟合后，根据用户指定的阈值选择特征
sfm = SelectFromModel(forest, threshold=0.1, prefit=True)  # prefit：预设模型是否期望直接传递给构造函数
X_selected = sfm.transform(X_train)
print('Number of features that meet this threshold criterion:',
      X_selected.shape[1])

for f in range(X_selected.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30,
                            feat_labels[indices[f]],
                            importances[indices[f]]))

运行结果：
[0.11852942 0.02564836 0.01327854 0.02236594 0.03135708 0.05087243
0.17475098 0.01335393 0.02556988 0.1439199 0.058739 0.13616194
0.1854526 ]