pytorch遇到的bug记录—2
使用Pytorch的torch.nn.DataParallel 进行多GPU训练时遇到的一个bug
问题描述:
我定义的网络模型除了原始的参数之外,还自定义了一组参数,这组参数也要参与训练,但是我在使用torch.nn.DataParallel 进行多GPU训练时,出现bug如下:
Traceback (most recent call last):
File "train_search.py", line 741, in <module>
architecture.main()
File "train_search.py", line 358, in main
train_acc,loss, error_loss, resource_loss, trainable_filter_number,model_performance = self.train(self.all_epochs, logging)
File "train_search.py", line 527, in train
logits, model_property, _ = self.model(input)
File "/home/dhb/jupyter notebook/distiller/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/dhb/jupyter notebook/distiller/env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 152, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/dhb/jupyter notebook/distiller/env/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 162, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/dhb/jupyter notebook/distiller/env/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 85, in parallel_apply
output.reraise()
File "/home/dhb/jupyter notebook/distiller/env/lib/python3.7/site-packages/torch/_utils.py", line 385, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in replica 0 on device 0.
Original Traceback (most recent call last):
File "/home/dhb/jupyter notebook/distiller/env/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
output = module(*input, **kwargs)
File "/home/dhb/jupyter notebook/distiller/env/lib/python3.7/site-packages/torch/nn/modules/module.py", line 541, in __call__
result = self.forward(*input, **kwargs)
File "/home/dhb/jupyter notebook/Project/Project_11_DCPS/model_search.py", line 377, in forward
weight = self._get_weights(alpha)
File "/home/dhb/jupyter notebook/Project/Project_11_DCPS/model_search.py", line 585, in _get_weights
return self._get_categ_mask_TGF_coding2(log_alpha)
File "/home/dhb/jupyter notebook/Project/Project_11_DCPS/model_search.py", line 563, in _get_categ_mask_TGF_coding2
(1-TG_alpha[0])*(1-TG_alpha[1])*(1-TG_alpha[2])
RuntimeError: CUDA error: an illegal memory access was encountered
经过一番探索发现,原因出在,输入input经过torch.nn.DataParallel后被复制分配到了不同的GPU,但是我定义的那部分参数始终在GPU0上。
这部分参数我定义形式如下:
self.alpha_parameter = [cell.weight,cell2.weight,...,celln.weight]
self.alpha_parameter 是list格式,其中的每个元素是nn.parameter的格式。
而我要在forward函数中调用self.alpha_parameter,也就是说,但input被torch.nn.DataParallel分配到不同的GPU上时,self.alpha_parameter也必须被复制分配到不同的GPU上。而这里出现的问题是self.alpha_parameter没有被复制到不同的GPU上。
问题解决:
要解决上述问题,就要知道,torch.nn.DataParallel的机制,该部分的介绍可以参考这篇博客,下述是从该博客的截图:
可以看到,普通的list保存的参数是不会被torch.nn.DataParallel认可并复制到不同GPU上的,那么这里需要用到nn.ParameterList()
来存储这些定义的参数,这样就可以被torch.nn.DataParallel认可并复制到不同GPU了。
问题总结:
对pytorch的机制还不够熟悉,还需要多加学习!
参考:
上一篇: Java导出Excel(多sheet导出)EasyPoi
下一篇: jQuery方法---隐藏与显示