特征重要性评估及筛选
程序员文章站
2022-07-14 13:41:32
...
feature_results = pd.DataFrame({'feature': list(train_features.columns),
'importance': model.feature_importances_})
feature_results = feature_results.sort_values('importance',ascending=False).reset_index(drop=True)
from IPython.core.pylabtools import figsize
figsize(12, 10)
plt.style.use('ggplot')
feature_results.loc[:9, :].plot(x = 'feature', y = 'importance', edgecolor = 'k',
kind = 'barh', color = 'blue')
plt.xlabel('Relative Importance', fontsize = 18); plt.ylabel('')
plt.title('Feature Importances from Random Forest', size = 26)
cumulative_importances = np.cumsum(feature_results['importance'])
plt.figure(figsize = (20, 6))
plt.plot(list(range(feature_results.shape[0])), cumulative_importances.values, 'b-')
plt.hlines(y=0.95, xmin=0, xmax=feature_results.shape[0], color='r', linestyles='dashed')
# plt.xticks(list(range(feature_results.shape[0])), feature_results.feature, rotation=60)
plt.xlabel('Feature', fontsize = 18)
plt.ylabel('Cumulative importance', fontsize = 18)
plt.title('Cumulative Importances', fontsize = 26)
most_num_importances = np.where(cumulative_importances > 0.95)[0][0] + 1
print('Number of features for 95% importance: ', most_num_importances)
Number of features for 95% importance: 13
- 基于重要性来进行特征选择
most_important_features = feature_results['feature'][:13]
indices = [list(train_features.columns).index(x) for x in most_important_features]
X_reduced = X[:, indices]
X_test_reduced = X_test[:, indices]
print('Most import training features shape: ', X_reduced.shape)
print('Most import testing features shape: ', X_test_reduced.shape)
Most import training features shape: (6622, 13)
Most import testing features shape: (2839, 13)
上一篇: 可视化随机森林的特征重要性
下一篇: 特征输出重要性的排序