VoVNet:一种实时高效的目标检测Backbone网络【pytorch代码详解】
2.Factors of Efficient Network Design
在设计轻量级网络时,FLOPs和模型参数是主要考虑因素,但是减少模型大小和FLOPs不等同于减少推理时间和降低能耗。比如ShuffleNetv2与MobileNetv2在相同的FLOPs下,前者在GPU上速度更快。所以除了FLOPs和模型大小外,还需要考虑其他因素对能耗和模型推理速度的影响。这里考虑两个重要的因素:内存访问成本(Memory Access Cost,MAC)和GPU计算效率。
2.1. Memory Access Cost
对于CNN,能耗在内存访问而不是计算上。影响MAC的主要是是内存占用(intermediate activation memory footprint),它主要受卷积核和feature map大小的影响。c为输入输出通道。
2.2. GPUComputation
通过减少FLOP是来加速的前提是,每个flop point的计算效率是一致的。
GPU特性:
(1)擅长parallel computation,tensor越大,GPU使用效率越高。
(2)把大的卷积操作拆分成碎片的小操作将不利于GPU计算。
因此,设计layer数量少的网络是更好的选择。mobileNet使用1x1卷积来减少计算量,不过这不利于GPU计算。为了衡量GPU使用效率,我们使用Flops/s指标。
3. Proposed Method
3.1. Rethinking Dense Connection
当固定卷积层参数量B=k^2hwcico的时候,MAC可以表示如下,根据均值不等式,可以知道当输入和输出的channel数相同时MAC才取下界,此时的设计是最高效的。
使用Dense connect模块,输出的channel size不变,但是输入的channel size一直在线性增加,因此DenseNet有很高的MAC。bottleneck connection同样不利于GPU计算。在模型很大的时候,计算量随着深度指数二阶增长。bottleneck 把一个3x3的卷积分成了两个计算,相当于增加了一次序列运算。
3.2. One Shot Aggregation
对于DenseNet来说,其核心模块就是Dense
Block,如下图1a所示,这种密集连接会聚合前面所有的layer,这导致每个layer的输入channel数线性增长。受限于FLOPs和模型参数,每层layer的输出channel数是固定大小,这带来的问题就是输入和输出channel数不一致,如前面所述,此时的MAC不是最优的。另外,由于输入channel数较大,DenseNet采用了1x1卷积层先压缩特征,这个额外层的引入对GPU高效计算不利。所以,虽然DenseNet的FLOPs和模型参数都不大,但是推理却并不高效,当输入较大时往往需要更多的显存和推理时间。
DenseNet的一大问题就是密集连接太重了,而且每个layer都会聚合前面层的特征,其实造成的是特征冗余,而且从模型weights的L1范数会发现中间层对最后的分类层贡献较少,这不难理解,因为后面的特征其实已经学习到了这些中间层的核心信息。这种信息冗余反而是可以优化的方向,据此这里提出了OSA(One-Shot Aggregation)模块,如图1b所示,简单来说,就是只在最后一次性聚合前面所有的layer。这一改动将会解决DenseNet前面所述的问题,因为每个layer的输入channel数是固定的,这里可以让输出channel数和输入一致而取得最小的MAC,而且也不再需要1x1卷积层来压缩特征,所以OSA模块是GPU计算高效的。
那么OSA模块效果如何,论文中拿DenseNet-40来做对比,DenseBlock层数是12,OSA模块也设计为12层,但是保持和Dense
Block类似的参数大小和计算量,此时OSA模块的输出将更大。最终发现在CIFAR-10数据集上acc仅比DenseNet下降了1.2%。但是如果将OSA模块的层数降至5,而提升layer的通道数为43,会发现与DenseNet-40模型效果相当。这说明DenseNet中很多中间特征可能是冗余的。尽管OSA模块性能没有提升,但是MAC低且计算更高效,这对于目标检测非常重要,因为检测模型一般的输入都是较大的。
3.3. Configuration of VoVNet
VoVNet由OSA模块构成,主要有三种不同的配置,如下表所示。VoVNet首先是一个由3个3x3卷积层构成的stem block,然后4个阶段的OSA模块,每个stage的最后会采用一个stride为2的3x3 max pooling层进行降采样,模型最终的output stride是32。与其他网络类似,每次降采样后都会提升特征的channel数。VoVNet-27-slim是一个轻量级模型,而VoVNet-39/57在stage4和stage5包含更多的OSA模块,所以模型更大。
4. Experiments
相比于DenseNet-67,PeleeNet减少了Flops,但是推断速度没有提升,与之相反,VoVNet-27-slim稍微增加了Flops,而推断速度提升了一倍。同时,VoVNet-27-slim的精度比其他模型都高。VoVNet-27-slim的内存占用、能耗、GPU使用效率都是最好的。相比其他模型,VoVNet做到了准确率和效率的均衡,提升了目标检测的整体性能。
import torch
import torch.nn as nn
import torch.nn.functional as F
from collections import OrderedDict
__all__ = ['VoVNet', 'vovnet27_slim', 'vovnet39', 'vovnet57']
model_urls = {
'vovnet39': 'https://dl.dropbox.com/s/1lnzsgnixd8gjra/vovnet39_torchvision.pth?dl=1',
'vovnet57': 'https://dl.dropbox.com/s/6bfu9gstbwfw31m/vovnet57_torchvision.pth?dl=1'
}
def conv3x3(in_channels, out_channels, module_name, postfix,
stride=1, groups=1, kernel_size=3, padding=1):
"""3x3 convolution with padding"""
return [
('{}_{}/conv'.format(module_name, postfix),
nn.Conv2d(in_channels, out_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
groups=groups,
bias=False)),
('{}_{}/norm'.format(module_name, postfix),
nn.BatchNorm2d(out_channels)),
('{}_{}/relu'.format(module_name, postfix),
nn.ReLU(inplace=True)),
]
def conv1x1(in_channels, out_channels, module_name, postfix,
stride=1, groups=1, kernel_size=1, padding=0):
"""1x1 convolution"""
return [
('{}_{}/conv'.format(module_name, postfix),
nn.Conv2d(in_channels, out_channels,
kernel_size=kernel_size,
stride=stride,
padding=padding,
groups=groups,
bias=False)),
('{}_{}/norm'.format(module_name, postfix),
nn.BatchNorm2d(out_channels)),
('{}_{}/relu'.format(module_name, postfix),
nn.ReLU(inplace=True)),
]
class _OSA_module(nn.Module):
def __init__(self,
in_ch,
stage_ch,
concat_ch,
layer_per_block,
module_name,
identity=False):
super(_OSA_module, self).__init__()
self.identity = identity
self.layers = nn.ModuleList()
in_channel = in_ch
for i in range(layer_per_block):
self.layers.append(nn.Sequential(
OrderedDict(conv3x3(in_channel, stage_ch, module_name, i))))
in_channel = stage_ch
# feature aggregation
in_channel = in_ch + layer_per_block * stage_ch
print("inch:",in_ch,"layer_per_block:",layer_per_block,"stage:",stage_ch)
self.concat = nn.Sequential(
OrderedDict(conv1x1(in_channel, concat_ch, module_name, 'concat')))
def forward(self, x):
identity_feat = x
output = []
output.append(x)
for layer in self.layers:
x = layer(x)
output.append(x)
print("output:",output)
x = torch.cat(output, dim=1) #按列拼
xt = self.concat(x)
if self.identity:
xt = xt + identity_feat
return xt
class _OSA_stage(nn.Sequential):
def __init__(self,
in_ch,
stage_ch,
concat_ch,
block_per_stage,
layer_per_block,
stage_num):
super(_OSA_stage, self).__init__()
if not stage_num == 2:
self.add_module('Pooling',
nn.MaxPool2d(kernel_size=3, stride=2, ceil_mode=True)) #向上取整,默认padding为0
module_name = f'OSA{stage_num}_1'
self.add_module(module_name,
_OSA_module(in_ch,
stage_ch,
concat_ch,
layer_per_block,
module_name))
for i in range(block_per_stage-1):
module_name = f'OSA{stage_num}_{i+2}'
self.add_module(module_name,
_OSA_module(concat_ch,
stage_ch,
concat_ch,
layer_per_block,
module_name,
identity=True))
class VoVNet(nn.Module):
def __init__(self,
config_stage_ch,
config_concat_ch,
block_per_stage,
layer_per_block,
num_classes=1000):
super(VoVNet, self).__init__()
# Stem module
stem = conv3x3(3, 64, 'stem', '1', 2)
stem += conv3x3(64, 64, 'stem', '2', 1)
stem += conv3x3(64, 128, 'stem', '3', 2)
self.add_module('stem', nn.Sequential(OrderedDict(stem)))
stem_out_ch = [128]
in_ch_list = stem_out_ch + config_concat_ch[:-1]
self.stage_names = []
for i in range(4): #num_stages
name = 'stage%d' % (i+2)
self.stage_names.append(name)
self.add_module(name,
_OSA_stage(in_ch_list[i],
config_stage_ch[i],
config_concat_ch[i],
block_per_stage[i],
layer_per_block,
i+2))
self.classifier = nn.Linear(config_concat_ch[-1], num_classes)
for m in self.modules():
if isinstance(m, nn.Conv2d):
nn.init.kaiming_normal_(m.weight)
elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
nn.init.constant_(m.weight, 1)
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.Linear):
nn.init.constant_(m.bias, 0)
def forward(self, x):
x = self.stem(x)
for name in self.stage_names:
x = getattr(self, name)(x)
x = F.adaptive_avg_pool2d(x, (1, 1)).view(x.size(0), -1)
x = self.classifier(x)
return x
def _vovnet(arch,
config_stage_ch,
config_concat_ch,
block_per_stage,
layer_per_block,
pretrained,
progress,
**kwargs):
model = VoVNet(config_stage_ch, config_concat_ch,
block_per_stage, layer_per_block,
**kwargs)
if pretrained:
state_dict = torch.hub.load_state_dict_from_url(model_urls[arch],
progress=progress)
model.load_state_dict(state_dict)
return model
def vovnet57(pretrained=False, progress=True, **kwargs):
r"""Constructs a VoVNet-57 model as described in
`"An Energy and GPU-Computation Efficient Backbone Networks"
<https://arxiv.org/abs/1904.09730>`_.
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
progress (bool): If True, displays a progress bar of the download to stderr
"""
return _vovnet('vovnet57', [128, 160, 192, 224], [256, 512, 768, 1024],
[1,1,4,3], 5, pretrained, progress, **kwargs)
def vovnet39(pretrained=False, progress=True, **kwargs):
r"""Constructs a VoVNet-39 model as described in
`"An Energy and GPU-Computation Efficient Backbone Networks"
<https://arxiv.org/abs/1904.09730>`_.
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
progress (bool): If True, displays a progress bar of the download to stderr
"""
return _vovnet('vovnet39', [128, 160, 192, 224], [256, 512, 768, 1024],
[1,1,2,2], 5, pretrained, progress, **kwargs)
def vovnet27_slim(pretrained=False, progress=True, **kwargs):
r"""Constructs a VoVNet-39 model as described in
`"An Energy and GPU-Computation Efficient Backbone Networks"
<https://arxiv.org/abs/1904.09730>`_.
Args:
pretrained (bool): If True, returns a model pre-trained on ImageNet
progress (bool): If True, displays a progress bar of the download to stderr
"""
return _vovnet('vovnet27_slim', [64, 80, 96, 112], [128, 256, 384, 512],
[1,1,1,1], 5, pretrained, progress, **kwargs)
if __name__ == '__main__':
model = vovnet39()
print(model)