调研auto-sklearn的类别不平衡处理策略

程序员文章站 2022-06-12 16:58:14

...

auto-sklearn对于类别不平衡的处理策略有2类，我们分别讨论。

文章目录

调整sample_weights
调整class_weights

设置`balanced`
设置`dict`

调整sample_weights

首先列出有哪些算法采用sample_weights策略：

clf_ = ['adaboost', 'random_forest', 'extra_trees', 'sgd', 'passive_aggressive']
pre_ = []

为什么用这些算法呢？比如说'adaboost', 'random_forest', 'extra_trees'都是tree-based的算法，是支持class_weights的，为什么用样本权重呢？

注释写的很清楚，我们一起来读一下：

Classifiers which require sample weights:
We can have adaboost in here, because in the fit method,
the sample weights are normalized:
adaboost code
Have RF and ET in here because they emit a warning if class_weights
are used together with warmstarts

adaboost给的理由我不是很懂，RF和ET的意思是，这些算法都采用了迭代式训练(iterative fit，即保持数超参不变增加bagging树的数量)，但是如果用这种训练(warm_start = True)会导致报警告。

具体的计算步骤：

if len(Y.shape) > 1:
    offsets = [2 ** i for i in range(Y.shape[1])]
    Y_ = np.sum(Y * offsets, axis=1)
else:
    Y_ = Y

unique, counts = np.unique(Y_, return_counts=True)
# This will result in an average weight of 1!
cw = 1 / (counts / np.sum(counts)) / 2
if len(Y.shape) == 2:
    cw /= Y.shape[1]

sample_weights = np.ones(Y_.shape)

for i, ue in enumerate(unique):
    mask = Y_ == ue
    sample_weights[mask] *= cw[i]

但是这样的代码不够直观，我们放到IPython中交互式演算一下：

>>> import numpy as np
>>> Y_=np.array([1,1,2,2,2,2,2])
>>> unique, counts = np.unique(Y_, return_counts=True)
>>> cw = 1 / (counts / np.sum(counts)) / 2
>>> cw
Out[6]: array([1.75, 0.7 ])
>>> 1 / (counts / np.sum(counts)) 
Out[7]: array([3.5, 1.4])
>>> (counts / np.sum(counts)) 
Out[8]: array([0.28571429, 0.71428571])
>>> sample_weights = np.ones(Y_.shape)
>>> for i, ue in enumerate(unique):
    mask = Y_ == ue
    sample_weights[mask] *= cw[i]
    
>>> sample_weights
Out[10]: array([1.75, 1.75, 0.7 , 0.7 , 0.7 , 0.7 , 0.7 ])
>>> sample_weights.mean()
Out[11]: 1.0000000000000002

从数学上，设label向量为 $Y$ ， $Y$ 有 $N$ 种取值，分别是 $Q_i , Q_2 ... Q_N$ ，每个取值的占比为 $\frac{|Q_i|}{|Y|}$ ，每种取值的权重为 $1/\frac{|Q_i|}{|Y|}/2$ 即 $\frac{|Y|}{2\times |Q_i|}$ 。

我在草稿纸上分析完了，这个 $2$ 的来历是auto-sklearn只考虑了2分类。

如果要推广到多分类，每个取值的占比为 $\frac{|Y|}{N\times |Q_i|}$ ，这时的均值为：

$\mu=\sum_i^N \sum_j^{Q_i} \frac{1}{ |Y|} \times \frac{|Y|}{N |Q_i|} = \sum_i^N \frac{1}{N}=1$

调整class_weights

设置`balanced`

Classifiers which can adjust sample weights themselves via the
argument class_weight

clf_ = ['decision_tree', 'liblinear_svc',
        'libsvm_svc']
pre_ = ['liblinear_svc_preprocessor',
        'extra_trees_preproc_for_classification']
if classifier in clf_:
    init_params['classifier:class_weight'] = 'balanced'
if preprocessor in pre_:
    init_params['preprocessor:class_weight'] = 'balanced'

设置`dict`

clf_ = ['ridge']
if classifier in clf_:
    class_weights = {}

    unique, counts = np.unique(Y, return_counts=True)
    cw = 1. / counts
    cw = cw / np.mean(cw)

    for i, ue in enumerate(unique):
        class_weights[ue] = cw[i]

    if classifier in clf_:
        init_params['classifier:class_weight'] = class_weights

调研auto-sklearn的类别不平衡处理策略

文章目录

调整sample_weights

调整class_weights

设置balanced

设置dict