调研auto-sklearn的类别不平衡处理策略
auto-sklearn对于类别不平衡的处理策略有2类,我们分别讨论。
调整sample_weights
首先列出有哪些算法采用sample_weights
策略:
clf_ = ['adaboost', 'random_forest', 'extra_trees', 'sgd', 'passive_aggressive']
pre_ = []
为什么用这些算法呢?比如说'adaboost', 'random_forest', 'extra_trees'
都是tree-based
的算法,是支持class_weights
的,为什么用样本权重呢?
注释写的很清楚,我们一起来读一下:
Classifiers which require sample weights:
We can have adaboost in here, because in the fit method,
the sample weights are normalized:
adaboost code
Have RF and ET in here because they emit a warning if class_weights
are used together with warmstarts
adaboost
给的理由我不是很懂,RF和ET的意思是,这些算法都采用了迭代式训练(iterative fit
,即保持数超参不变增加bagging树的数量),但是如果用这种训练(warm_start = True
)会导致报警告。
具体的计算步骤:
if len(Y.shape) > 1:
offsets = [2 ** i for i in range(Y.shape[1])]
Y_ = np.sum(Y * offsets, axis=1)
else:
Y_ = Y
unique, counts = np.unique(Y_, return_counts=True)
# This will result in an average weight of 1!
cw = 1 / (counts / np.sum(counts)) / 2
if len(Y.shape) == 2:
cw /= Y.shape[1]
sample_weights = np.ones(Y_.shape)
for i, ue in enumerate(unique):
mask = Y_ == ue
sample_weights[mask] *= cw[i]
但是这样的代码不够直观,我们放到IPython
中交互式演算一下:
>>> import numpy as np
>>> Y_=np.array([1,1,2,2,2,2,2])
>>> unique, counts = np.unique(Y_, return_counts=True)
>>> cw = 1 / (counts / np.sum(counts)) / 2
>>> cw
Out[6]: array([1.75, 0.7 ])
>>> 1 / (counts / np.sum(counts))
Out[7]: array([3.5, 1.4])
>>> (counts / np.sum(counts))
Out[8]: array([0.28571429, 0.71428571])
>>> sample_weights = np.ones(Y_.shape)
>>> for i, ue in enumerate(unique):
mask = Y_ == ue
sample_weights[mask] *= cw[i]
>>> sample_weights
Out[10]: array([1.75, 1.75, 0.7 , 0.7 , 0.7 , 0.7 , 0.7 ])
>>> sample_weights.mean()
Out[11]: 1.0000000000000002
从数学上,设label向量为,有种取值,分别是,每个取值的占比为,每种取值的权重为 即。
我在草稿纸上分析完了,这个的来历是auto-sklearn
只考虑了2
分类。
如果要推广到多分类,每个取值的占比为 ,这时的均值为:
调整class_weights
设置balanced
Classifiers which can adjust sample weights themselves via the
argumentclass_weight
clf_ = ['decision_tree', 'liblinear_svc',
'libsvm_svc']
pre_ = ['liblinear_svc_preprocessor',
'extra_trees_preproc_for_classification']
if classifier in clf_:
init_params['classifier:class_weight'] = 'balanced'
if preprocessor in pre_:
init_params['preprocessor:class_weight'] = 'balanced'
设置dict
clf_ = ['ridge']
if classifier in clf_:
class_weights = {}
unique, counts = np.unique(Y, return_counts=True)
cw = 1. / counts
cw = cw / np.mean(cw)
for i, ue in enumerate(unique):
class_weights[ue] = cw[i]
if classifier in clf_:
init_params['classifier:class_weight'] = class_weights
上一篇: 食疗让老年朋友“排得畅”
下一篇: 感冒头痛怎么办 背后风门按一按