2021-02-05

pytorch-notices

pytorch 踩过的坑

.cuda()数据和模型的差异

nn.Module.cuda() 和 Tensor.cuda() 的作用效果差异，无论是对于模型还是数据，cuda()函数都能实现从CPU到GPU的内存迁移，但是他们的作用效果有所不同。
对于nn.Module:
model = model.cuda()和model.cuda()能够达到一样的效果，即对model自身进行的内存迁移。
对于Tensor:
调用tensor.cuda()只是返回这个tensor对象在GPU内存上的拷贝，而不会对自身进行改变。因此必须对tensor进行重新赋值，即tensor=tensor.cuda().
例子：

model = create_a_model() 
tensor = torch.ones([3,3,3,3]) 
model.cuda() 
tensor.cuda() 
model(tensor)    # 会报错 
tensor = tensor.cuda() 
model(tensor)    # 正常运行

torch.Tensor.detach()的使用

detach()的官方说明如下：
Returns a new Tensor, detached from the current graph. The result will never require gradient.
假设有模型A和模型B，我们需要将A的输出作为B的输入，但训练时我们只训练模型B. 那么可以这样做：
input_B = output_A.detach()
它可以使两个计算图的梯度传递断开，从而实现我们所需的功能。

使用nn.Dataparallel 数据不在同一个gpu上

model=nn.DataParallel(model)
问题：但是一次同事训练基于光流检测的实验时发现 data not in same cuda,做代码review时候，打印每个节点tensor，cuda里的数据竟然没有分布在同一个gpu上
解决：最终解决方案是在数据，吐出后统一进行执行.cuda()将数据归入到同一个cuda流中解决了该问题。

pytorch model load可能会踩到的坑：

如果使用了nn.Dataparallel 进行多卡训练在读入模型时候要注意加.module，代码如下:

def get_model(self):
  if self.nGPU == 1:         
      return self.model     
  else:         
      return self.model.module

参考

Title:pytorch-notices

Author:zzm

Created:2021-02-05, 17:44:44

Updated:2024-11-13, 16:12:24

Full URL:http://blog.zhengmingz.top/2021/02/05/AI-pytorch-notices/

License: "CC BY-NC-SA 4.0" Keep Link & Author if Distribute.