Skip to content

Multi GPU dataparallel crash #1779

@devroy73

Description

@devroy73

🐛 Bug

Model I am using (Bert, XLNet....): GPT2

Language I am using the model on (English, Chinese....): English

The problem arise when using:

  • the official example scripts: run_lm_finetuning
  • my own modified scripts: (give details)

The tasks I am working on is:

  • an official GLUE/SQUaD task: (give the name)
  • my own task or dataset: I am finetuning the GPT2 model with a dataset that I have used in the past to fine-tune the BERT model among others

To Reproduce

Steps to reproduce the behavior:

raceback (most recent call last):████████████████████████████████████████████████████████████████████████████████████████████████████████████▉| 7587/7588 [1:42:01<00:00, 1.26it/s]
File "run_lm_finetuning.py", line 551, in
main()
File "run_lm_finetuning.py", line 503, in main
global_step, tr_loss = train(args, train_dataset, model, tokenizer)
File "run_lm_finetuning.py", line 228, in train
outputs = model(inputs, masked_lm_labels=labels) if args.mlm else model(inputs, labels=labels)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in call
result = self.forward(*input, **kwargs)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward
return self.gather(outputs, self.output_device)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in gather
return gather(outputs, output_device, dim=self.dim)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
res = gather_map(outputs)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
return type(out)(map(gather_map, zip(*outputs)))
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
return Gather.apply(target_device, dim, *outputs)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 68, in forward
return comm.gather(inputs, ctx.dim, ctx.target_device)
File "/home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/cuda/comm.py", line 165, in gather
return torch._C._gather(tensors, dim, destination)
**RuntimeError: Gather got an input of invalid size: got [2, 2, 12, 1024, 64], but expected [2, 3, 12, 1024, 64] (gather at /opt/conda/conda-bld/pytorch_1565272279342/work/torch/csrc/cuda/comm.cpp:226)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x47 (0x7f47a13d5e37 in /home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: torch::cuda::gather(c10::ArrayRefat::Tensor, long, c10::optional) + 0x3c7 (0x7f4720c61327 in /home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch.so)
frame #2: + 0x5fa742 (0x7f47a420f742 in /home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: + 0x1c8316 (0x7f47a3ddd316 in /home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

frame #14: THPFunction_apply(_object
, _object
) + 0x98f (0x7f47a40024bf in /home/dev/deeplearning/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/lib/libtorch_python.so)

Expected behavior

it crashes at the last batch all the time. I expect it to move to the next epoch

Environment

  • OS:
  • Python version: 3.6
  • PyTorch version: 1.2
  • PyTorch Transformers version (or branch):
  • Using GPU ? 4 Quadro 8000
  • Distributed of parallel setup ? parallel
  • Any other relevant information:

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions