机器学习特征工程-特征选择

程序员文章站 2024-03-08 09:14:51

...

逻辑：
机器学习特征工程-特征选择

$\color{DodgerBlue}{1. 特征选择(Feature Selection)}$

在机器学习流程中，前期获取“足量”的训练数据是至关重要的一个步骤。“足量”包括两个方面：一个是特征层面，另一个是训练数据量。但是并不是所有的特征都会用于模型训练，主要是因为不必要的特征不仅会降低训练速度、降低模型的可解释性，还会降低其在测试集上的泛化能力。

需要从收集到的特征中查找和选择出有用的特征，这个过程就是特征选择。目前sklearn的feature_selection模块提供了一些选择方法，比如PCA,LDA等方法。对于特征选择的常用欧冠方法：
1. 具有高缺失值百分比的特征；
2. 共线性(高相关度)的特征；
3. 基于树模型中特征重要性为0的特征；
4. 特征重要性较低的特征(一般取重要性TopK特征)
5. 具有单个唯一值(unique value)的特征。
目前在github上有激情昂扬的小伙伴开发一个一个特征选择的包：feature_selector

github地址：https://github.com/WillKoehrsen/feature-selector

不能登录github的同学，我已经放在百度云：

链接：https://pan.baidu.com/s/1qaZYtaQRM2bNPmNf2RldkQ

提取码：vk1p

$\color{DodgerBlue}{2. 安装过程}$

pip install feature-selector

安装过程比较曲折，历经坎坷之后总算安装成功了。

$\color{DodgerBlue}{3. 特征处理函数}$

identify_missing：给定缺失阈值，查找缺失特征；
identify_single_unique：查找出单一值特征；
identify_collinear：查找出特征性比较强的特征；
identify_zero_importance：查找出0重要性的特征；
identify_low_importance：查找地重要性的特征

# 数据获取
import pandas as pd
from sklearn.datasets import load_diabetes
import numpy as np
# 采用空气质量数据
# url = 'https://archive.ics.uci.edu/ml/datasets/Air+Quality#'
raw_df = pd.read_csv('data/test.csv', sep=',')
y = raw_df['sale_cnt_all_1']
X = raw_df.drop(['sale_cnt_all_1', 
                 'goods_detail_uv_all_1', 
                 'goods_detail_pv_all_1',
                'site_id', 'goods_id', 'sku_cate_id', 'dt'], axis=1)

X.replace(0, np.nan, inplace=True)
y = y.apply(lambda x: 1 if x>0 else 0)

# 导入包，创建FeatureSelector对象
from feature_selector import FeatureSelector
fs = FeatureSelector(data = X, labels=y)

$\color{DodgerBlue}{4. Missing Values}$

寻找出任何缺失百分比大于给定阈值(threshold)的特征。

missing_threshold = 0.6
fs.identify_missing(missing_threshold=missing_threshold)
missing_features = fs.ops['missing']
print(missing_features)
fs.plot_missing()
print(fs.missing_stats)

机器学习特征工程-特征选择

$\color{DodgerBlue}{5. Single Unique Value}$

寻找出只有单一值得特征

fs.identify_single_unique()
single_unique = fs.ops['single_unique']
print(single_unique)
fs.plot_unique()

机器学习特征工程-特征选择

$\color{DodgerBlue}{6. Collinear _(highly correlated) Features}$

查找出高度相关的特征

correlation_threshold = 0.975
fs.identify_collinear(correlation_threshold=correlation_threshold)
corrected_features = fs.ops['collinear']
print(corrected_features)
fs.plot_collinear()

机器学习特征工程-特征选择

fs.plot_collinear(plot_all=True)

机器学习特征工程-特征选择

fs.record_collinear.head()

机器学习特征工程-特征选择

$\color{DodgerBlue}{7. Zero Importance Features}$

查找出零重要度的特征，内部采用监督学习来得到特征重要性。内部默认才用了梯度增强学习算法-lightGBM；为了减少特征重要性的方差，模型默认训练10次(times=10)；为了寻找模型的最优基学习器的数据(n_estimators)，模型在验证集(valide_size=0.15)上运用了early stopping 优化方法；

同时还可以设置以下参数：

task：回归(regression)或者分类(classification);
eval_metric:用于arly stopping优化的模型评估指标，回归采用 $R^2$ ，分类采用auc；
n_iterations：模型训练的迭代次数；最终的特征重要性是n次迭代的平均值；
early_stopping：在训练模型时，是否用early_stopping优化方法；当模型在验证集上的指标不在提升时，模型提前停止训练防止陷入过拟合。

fs.identify_zero_importance(task='regression', eval_metric='r2', 
                           n_iterations=10, early_stopping=True)

one_hot_features = fs.one_hot_features
base_features = fs.base_features
print('one hot features are: {}'.format(one_hot_features))
print('Original features are: {}'.format(base_features))

机器学习特征工程-特征选择

zero_importance_features = fs.ops['zero_importance']
print(zero_importance_features)
fs.plot_feature_importances(threshold=0.99, plot_n=12)

机器学习特征工程-特征选择

fs.feature_importances

机器学习特征工程-特征选择

$\color{DodgerBlue}{8. Low Importance Features}$

这个方法必须在Zero Importance Features执行之后进行，因为这个特征重要性的输入也依赖于梯度提升算法。

低重要性的特征是指特征权重占比小于所有特征权重的某一个给定值。在使用该方法时，除了要求首先执行identify_zero_importance，还需要设定cumulative_importance参数；

建议：在使用identify_zero_importance和Low Importance Features时，需要设置多组参数进行交叉验证获取结果。

fs.identify_low_importance(cumulative_importance=0.99)

机器学习特征工程-特征选择

low_importance_features = fs.ops['low_importance']
print(low_importance_features)

机器学习特征工程-特征选择

$\color{DodgerBlue}{9. Removing Features}$

当我们获取了重要性比较低的特征以及其单一值特征等，我们就可以删除这些特征。有两种方法可以删除这些特征：

获取removal_ops字典中的特征，然后循环逐一删除；
用remove方法传入我们需要删除的特征；

注意：这删除这些特征时，首先要确认清楚这些特征的信息。

train_no_missing = fs.remove(methods=['missing'])
train_no_missing_zero = fs.remove(methods=['missing', 'zero_importance'])

机器学习特征工程-特征选择

all_to_remove = fs.check_removal()

$\color{DodgerBlue}{10. 一次性执行所有的方法}$

如果我们不行一个一个来执行上述方法，我们可以通过identify_all来一次执行所有的方法，当然前提是我们为每一个方法设定参数值(用字典来传递)。

fs = FeatureSelector(data=X, labels=y)
fs.identify_all(selection_params={
    'missing_threshold': 0.6,
    'correlation_threshold': 0.98,
    'task': 'classification',
    'eval_metric': 'auc',
    'cumulative_importance': 0.99
    })

train_removed_all_once = fs.remove(methods = 'all', keep_one_hot = True)

相关标签：算法技术栈人工智能系列机器学习1-机器学习的基本概念

上一篇：学习：练习题整理1

下一篇： Java中Hashtable类与HashMap类的区别详解

机器学习特征工程-特征选择

$\color{DodgerBlue}{1. 特征选择(Feature Selection)}$

$\color{DodgerBlue}{2. 安装过程}$

$\color{DodgerBlue}{3. 特征处理函数}$

$\color{DodgerBlue}{4. Missing Values}$

$\color{DodgerBlue}{5. Single Unique Value}$

$\color{DodgerBlue}{6. Collinear _(highly correlated) Features}$

$\color{DodgerBlue}{7. Zero Importance Features}$

$\color{DodgerBlue}{8. Low Importance Features}$

$\color{DodgerBlue}{9. Removing Features}$

$\color{DodgerBlue}{10. 一次性执行所有的方法}$