When training a model across multiple GPUs using parallel_mode="data_parallel", if a "CUDA out of memory" error occurs, an exception is triggered in the onmt.utils.distributed.all_reduce_and_rescale_tensors function:
buffer_t = (
tensors[0].new(math.ceil(buffer_size / tensors[0].element_size())).zero_()
)
The issue arises because, in the event of an "OOM" error, gradients are not computed, leading to an empty list for the "tensors" parameter.
The solution is to provide a list with tensors filled with zeros. This ensures that the torch.distributed.all_reduce function can continue functioning without getting stuck indefinitely. However, the drawback is that accumulated gradients are still divided by the total number of GPUs.