欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

训练技巧之百万级类别的分类模型的拆分训练

程序员文章站 2022-06-02 14:53:07
...

训练技巧之百万级类别的分类模型的拆分训练

1. 背景

很多人脸识别算法都是以分类的方式进行训练的,分类的训练方式中存在一个很大的问题,就是模型的最后一个全连接层的参数量太大了,以512为特征为例:

类别数参数矩阵尺寸参数矩阵大小(MB)

  • 100w类别——1953MB
  • 200w类别——3906MB
  • 500w类别——9765MB

类别再多的话,1080TI这种消费级的GPU就装不下了,更不用说还有forward/backward的中间结果需要占据额外的显存。

现在的开源数据越来越多,就算没有自己的数据,靠开源数据也能把类别数量堆到100万了,这种条件下,在单卡难以训练,需要进行模型拆分。

2. 模型拆分

最容易想到的拆分方式就是拆分最大的那个fc层。

class facemodel(torch.nn.Module):
    def __init__(self,num_classes):
        super(facemodel,self).__init__()
        # backbone放在GPU-0
        self.backbone = resnet50().to(torch.device("cuda:0"))
        self.backbone.fc = torch.nn.Linear(2048, 512,bias=True).to(torch.device("cuda:0"))
        self.fc1 = torch.nn.Linear(512, int(num_classes / 6)).to(torch.device("cuda:0"))
        # 将fc拆掉一部分放在GPU-1,考虑到forward/backward,需要多拆一点
        self.fc2 = torch.nn.Linear(512, num_classes - int(num_classes / 6)).to(torch.device("cuda:1"))
    def forward(self,x):
        x = self.backbone(x)
        x1 = self.fc1(x)
        x2 = self.fc2(x.to(torch.device("cuda:1")))
        return torch.cat([x1,x2.to(torch.device("cuda:0"))],dim = 1) # 传回GPU-0,便于计算loss

以一个200万类别的模型为例:

net = facemodel(2000000)
summary(net,(3,224,224))

模型参数量如下:

================================================================
Total params: 1,050,557,120
Trainable params: 1,050,557,120
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.57
Forward/backward pass size (MB): 301.82
Params size (MB): 4007.56
Estimated Total Size (MB): 4309.95
----------------------------------------------------------------

理论上在单卡可以跑(11178 - 4007.56) / (301.82) = 23.76个batch,双卡就是47.52个batch。

下面试试在双卡可以跑多大的batch_size。

此时在两个GPU上的显存分配为:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   59C    P8    20W / 250W |   1531MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 29%   52C    P8    19W / 250W |   3841MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     19447      C   /home/dai/py36env/bin/python                1521MiB |
|    1     19447      C   /home/dai/py36env/bin/python                3831MiB |
+-----------------------------------------------------------------------------+

尝试batch_size=64:

batch_size = 64
img = torch.ones(batch_size,3,224,224).cuda()
out = net(img)
label = torch.ones(batch_size).long().to(torch.device("cuda:0"))
loss = torch.nn.CrossEntropyLoss()(out,label)
loss.backward()
loss.item()

使用64的batch_size进行反向传播之后,得到的GPU显存占用情况如下:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   73C    P2    84W / 250W |   9855MiB / 11178MiB |     56%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
|  0%   61C    P2    79W / 250W |   7505MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     19963      C   /home/dai/py36env/bin/python                9845MiB |
|    1     19963      C   /home/dai/py36env/bin/python                7495MiB |
+-----------------------------------------------------------------------------+

可见拆分模型后,可以以更大的batch_size进行训练。

但是从上面的显存占用情况可以看出一个问题:两个GPU中的forward/backward显存增长幅度不同,GPU利用率差别也很大。这样容易造成显存浪费,而且长期一个GPU干活一个GPU围观的情况也容易把其中一个GPU搞坏。

为了解决这个问题,可以尝试更细致的模型拆分。

3. 更细致的拆分

我们可以把resnet50的backbone部分也拆分到两个GPU上:

class face_model(torch.nn.Module):
    def __init__(self,num_classes):
        super(face_model,self).__init__()
        backbone = resnet50()
        self.bottom = torch.nn.Sequential(
                backbone.conv1,backbone.bn1, backbone.relu, backbone.maxpool
            ).to(torch.device("cuda:0"))
        self.layer1 = backbone.layer1.to(torch.device("cuda:0"))
        self.layer2 = backbone.layer2.to(torch.device("cuda:0"))
        self.layer3 = backbone.layer3.to(torch.device("cuda:1"))
        self.layer4 = backbone.layer4.to(torch.device("cuda:1"))
        self.avgpool = backbone.avgpool.to(torch.device("cuda:1"))
        self.fc = torch.nn.Linear(2048, 512,bias=True).to(torch.device("cuda:1"))
        self.fc1 = torch.nn.Linear(in_features=512, out_features = int(num_classes / 2),bias=True).to(torch.device("cuda:0"))
        self.fc2 = torch.nn.Linear(in_features=512, out_features = num_classes - int(num_classes / 2),bias=True).to(torch.device("cuda:1"))
    def forward(self,x):
        x = x.to(torch.device("cuda:0"))
        x = self.bottom(x)
        x = self.layer1(x)
        x = self.layer2(x)
        x = x.to(torch.device("cuda:1"))
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.avgpool(x).squeeze(3).squeeze(2)
        x = self.fc(x)
        x2 = self.fc2(x)
        x1 = self.fc1(x.to(torch.device("cuda:0")))
        return torch.cat([x1,x2.to(torch.device("cuda:0"))],dim = 1)
net = face_model(2000000)

注意网络及tensor的迁移要使用to(device),不要用cuda(GPUID)

空载情况下的显存占用比较均衡:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   64C    P2    76W / 250W |   2539MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 29%   62C    P2    80W / 250W |   2625MiB / 11178MiB |     62%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      9574      C   /home/dai/py36env/bin/python                2529MiB |
|    1      9574      C   /home/dai/py36env/bin/python                2615MiB |
+-----------------------------------------------------------------------------+

但是用64的batchsize一跑起来就变成这样了:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   67C    P2    81W / 250W |  10945MiB / 11178MiB |     78%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 31%   62C    P2    81W / 250W |   6315MiB / 11178MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      9574      C   /home/dai/py36env/bin/python               10935MiB |
|    1      9574      C   /home/dai/py36env/bin/python                6305MiB |
+-----------------------------------------------------------------------------+

显存和负载都显得很不均衡,我认为这个情况可以通过两种手段解决:

  • 将fc层中更多的权重迁移到GPU1;
  • 将loss计算分配到两个GPU上进行。

4. 在两个GPU上计算loss

人脸识别里面的loss计算往往比较复杂,所以这种负载不均衡的情况会变得更加明显,为了缓解这种情况,

class face_model(torch.nn.Module):
    def __init__(self,num_classes):
        super(face_model,self).__init__()
        backbone = resnet50()
        self.bottom = torch.nn.Sequential(
                backbone.conv1,backbone.bn1, backbone.relu, backbone.maxpool
            ).to(torch.device("cuda:0"))
        self.layer1 = backbone.layer1.to(torch.device("cuda:0"))
        self.layer2 = backbone.layer2.to(torch.device("cuda:0"))
        self.layer3 = backbone.layer3.to(torch.device("cuda:1"))
        self.layer4 = backbone.layer4.to(torch.device("cuda:1"))
        self.avgpool = backbone.avgpool.to(torch.device("cuda:1"))
        self.fc = torch.nn.Linear(2048, 512,bias=True).to(torch.device("cuda:1"))
        self.fc1 = torch.nn.Linear(in_features=512, out_features = int(num_classes / 2),bias=True).to(torch.device("cuda:0"))
        self.fc2 = torch.nn.Linear(in_features=512, out_features = num_classes - int(num_classes / 2),bias=True).to(torch.device("cuda:1"))
    def forward(self,x,label):
        x = x.to(torch.device("cuda:0"))
        x = self.bottom(x)
        x = self.layer1(x)
        x = self.layer2(x)
        x = x.to(torch.device("cuda:1"))
        x = self.layer3(x)
        x = self.layer4(x)
        x = self.avgpool(x).squeeze(3).squeeze(2)
        x = self.fc(x)
        x2 = self.fc2(x)
        x1 = self.fc1(x.to(torch.device("cuda:0")))
        x = torch.cat([x1,x2.to(torch.device("cuda:0"))],dim = 1)
        loss1 = torch.nn.CrossEntropyLoss()(x[:len(label)//2],label[:len(label)//2].to(torch.device("cuda:0")))
        loss2 = torch.nn.CrossEntropyLoss()(x[len(label)//2:].to(torch.device("cuda:1")),label[len(label)//2:].to(torch.device("cuda:1")))
        return (loss1 + loss2.to(torch.device("cuda:0"))) / 2
net = face_model(2000000)

从下面的GPU信息可以看到,将loss分散之后,显存分配情况有了少许改善,GPU的利用率看起来也正常了一些。

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.78       Driver Version: 410.78       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   86C    P2   166W / 250W |  10701MiB / 11178MiB |     43%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:04:00.0 Off |                  N/A |
| 34%   62C    P2    81W / 250W |   7053MiB / 11178MiB |     74%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     11743      C   /home/dai/py36env/bin/python               10691MiB |
|    1     11743      C   /home/dai/py36env/bin/python                7043MiB |
+-----------------------------------------------------------------------------+

5. 模型速度问题

将模型拆分之后,多了很多数据传输的操作,模型的训练速度自然是会下降不少的。可以利用PyTorch的前后端异步特性对速度进行优化,具体参考:模型并行最佳实践(PyTorch)

6. 最后

训练技巧之百万级类别的分类模型的拆分训练

相关标签: 训练技巧