欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

调研auto-sklearn的类别不平衡处理策略

程序员文章站 2022-06-12 16:58:14
...

auto-sklearn对于类别不平衡的处理策略有2类,我们分别讨论。

调整sample_weights

首先列出有哪些算法采用sample_weights策略:

clf_ = ['adaboost', 'random_forest', 'extra_trees', 'sgd', 'passive_aggressive']
pre_ = []

为什么用这些算法呢?比如说'adaboost', 'random_forest', 'extra_trees'都是tree-based的算法,是支持class_weights的,为什么用样本权重呢?

注释写的很清楚,我们一起来读一下:

Classifiers which require sample weights:
We can have adaboost in here, because in the fit method,
the sample weights are normalized:
adaboost code
Have RF and ET in here because they emit a warning if class_weights
are used together with warmstarts

adaboost给的理由我不是很懂,RF和ET的意思是,这些算法都采用了迭代式训练(iterative fit,即保持数超参不变增加bagging树的数量),但是如果用这种训练(warm_start = True)会导致报警告。

具体的计算步骤:

if len(Y.shape) > 1:
    offsets = [2 ** i for i in range(Y.shape[1])]
    Y_ = np.sum(Y * offsets, axis=1)
else:
    Y_ = Y

unique, counts = np.unique(Y_, return_counts=True)
# This will result in an average weight of 1!
cw = 1 / (counts / np.sum(counts)) / 2
if len(Y.shape) == 2:
    cw /= Y.shape[1]

sample_weights = np.ones(Y_.shape)

for i, ue in enumerate(unique):
    mask = Y_ == ue
    sample_weights[mask] *= cw[i]

但是这样的代码不够直观,我们放到IPython中交互式演算一下:

>>> import numpy as np
>>> Y_=np.array([1,1,2,2,2,2,2])
>>> unique, counts = np.unique(Y_, return_counts=True)
>>> cw = 1 / (counts / np.sum(counts)) / 2
>>> cw
Out[6]: array([1.75, 0.7 ])
>>> 1 / (counts / np.sum(counts)) 
Out[7]: array([3.5, 1.4])
>>> (counts / np.sum(counts)) 
Out[8]: array([0.28571429, 0.71428571])
>>> sample_weights = np.ones(Y_.shape)
>>> for i, ue in enumerate(unique):
    mask = Y_ == ue
    sample_weights[mask] *= cw[i]
    
>>> sample_weights
Out[10]: array([1.75, 1.75, 0.7 , 0.7 , 0.7 , 0.7 , 0.7 ])
>>> sample_weights.mean()
Out[11]: 1.0000000000000002

从数学上,设label向量为YYYYNN种取值,分别是Qi,Q2...QNQ_i , Q_2 ... Q_N,每个取值的占比为QiY\frac{|Q_i|}{|Y|},每种取值的权重为 1/QiY/21/\frac{|Q_i|}{|Y|}/2Y2×Qi\frac{|Y|}{2\times |Q_i|}

我在草稿纸上分析完了,这个22的来历是auto-sklearn只考虑了2分类。

如果要推广到多分类,每个取值的占比为 YN×Qi\frac{|Y|}{N\times |Q_i|},这时的均值为:

μ=iNjQi1Y×YNQi=iN1N=1\mu=\sum_i^N \sum_j^{Q_i} \frac{1}{ |Y|} \times \frac{|Y|}{N |Q_i|} = \sum_i^N \frac{1}{N}=1

调整class_weights

设置balanced

Classifiers which can adjust sample weights themselves via the
argument class_weight

clf_ = ['decision_tree', 'liblinear_svc',
        'libsvm_svc']
pre_ = ['liblinear_svc_preprocessor',
        'extra_trees_preproc_for_classification']
if classifier in clf_:
    init_params['classifier:class_weight'] = 'balanced'
if preprocessor in pre_:
    init_params['preprocessor:class_weight'] = 'balanced'

设置dict

clf_ = ['ridge']
if classifier in clf_:
    class_weights = {}

    unique, counts = np.unique(Y, return_counts=True)
    cw = 1. / counts
    cw = cw / np.mean(cw)

    for i, ue in enumerate(unique):
        class_weights[ue] = cw[i]

    if classifier in clf_:
        init_params['classifier:class_weight'] = class_weights
相关标签: automl