List index out of range in onmt.utils.distributed.all_reduce_and_rescale_tensors:51

When training a model across multiple GPUs using `parallel_mode="data_parallel"`, if a "CUDA out of memory" error occurs, an exception is triggered in the `onmt.utils.distributed.all_reduce_and_rescale_tensors` function:

```python
buffer_t = (
        tensors[0].new(math.ceil(buffer_size / tensors[0].element_size())).zero_()
)
```
The issue arises because, in the event of an "OOM" error, gradients are not computed, leading to an empty list for the "tensors" parameter.

The solution is to provide a list with tensors filled with zeros. This ensures that the `torch.distributed.all_reduce` function can continue functioning without getting stuck indefinitely. However, the drawback is that accumulated gradients are still divided by the total number of GPUs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

List index out of range in onmt.utils.distributed.all_reduce_and_rescale_tensors:51 #2549

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

List index out of range in onmt.utils.distributed.all_reduce_and_rescale_tensors:51 #2549

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions