欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页

pytorch遇到的bug记录—2

程序员文章站 2022-07-13 12:59:31
...

使用Pytorch的torch.nn.DataParallel 进行多GPU训练时遇到的一个bug

问题描述:

我定义的网络模型除了原始的参数之外,还自定义了一组参数,这组参数也要参与训练,但是我在使用torch.nn.DataParallel 进行多GPU训练时,出现bug如下:

Traceback (most recent call last):
  File "train_search.py", line 741, in <module>
    architecture.main()
  File "train_search.py", line 358, in main
    train_acc,loss, error_loss, resource_loss, trainable_filter_number,model_performance = self.train(self.all_epochs, logging)
  File "train_search.py", line 527, in train
    logits, model_property, _ = self.model(input)
  File "/home/dhb/jupyter notebook/distiller/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dhb/jupyter notebook/distiller/env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
    outputs = self.parallel_apply(replicas, inputs, kwargs)
  File "/home/dhb/jupyter notebook/distiller/env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
    return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
  File "/home/dhb/jupyter notebook/distiller/env/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
    output.reraise()
  File "/home/dhb/jupyter notebook/distiller/env/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
  File "/home/dhb/jupyter notebook/distiller/env/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
  File "/home/dhb/jupyter notebook/distiller/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/dhb/jupyter notebook/Project/Project_11_DCPS/model_search.py", line 377, in forward
    weight = self._get_weights(alpha)
  File "/home/dhb/jupyter notebook/Project/Project_11_DCPS/model_search.py", line 585, in _get_weights
    return self._get_categ_mask_TGF_coding2(log_alpha)
  File "/home/dhb/jupyter notebook/Project/Project_11_DCPS/model_search.py", line 563, in _get_categ_mask_TGF_coding2
    (1-TG_alpha[0])*(1-TG_alpha[1])*(1-TG_alpha[2])
RuntimeError: CUDA error: an illegal memory access was encountered

经过一番探索发现,原因出在,输入input经过torch.nn.DataParallel后被复制分配到了不同的GPU,但是我定义的那部分参数始终在GPU0上。
这部分参数我定义形式如下:

self.alpha_parameter = [cell.weight,cell2.weight,...,celln.weight]

self.alpha_parameter 是list格式,其中的每个元素是nn.parameter的格式。

而我要在forward函数中调用self.alpha_parameter,也就是说,但input被torch.nn.DataParallel分配到不同的GPU上时,self.alpha_parameter也必须被复制分配到不同的GPU上。而这里出现的问题是self.alpha_parameter没有被复制到不同的GPU上。

问题解决:

要解决上述问题,就要知道,torch.nn.DataParallel的机制,该部分的介绍可以参考这篇博客,下述是从该博客的截图:
pytorch遇到的bug记录—2

可以看到,普通的list保存的参数是不会被torch.nn.DataParallel认可并复制到不同GPU上的,那么这里需要用到nn.ParameterList() 来存储这些定义的参数,这样就可以被torch.nn.DataParallel认可并复制到不同GPU了。

问题总结:

对pytorch的机制还不够熟悉,还需要多加学习!

参考:

  1. 解决了PyTorch 使用torch.nn.DataParallel 进行多GPU训练的一个BUG:模型(参数)和数据不在相同设备上