-
Notifications
You must be signed in to change notification settings - Fork 153
Open
Description
Hi,
To run torch-based model training, it's easy to trigger the problem "No space left on device" which is caused by default small shared memory (64M) when docker starts container.
RuntimeError: DataLoader worker (pid 729) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
Traceback (most recent call last):
File "/srv/hops/anaconda/envs/hopsworks_environment/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File "/srv/hops/anaconda/envs/hopsworks_environment/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/srv/hops/anaconda/envs/hopsworks_environment/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 618, in reduce_storage
fd, size = storage._share_fd_cpu_()
File "/srv/hops/anaconda/envs/hopsworks_environment/lib/python3.10/site-packages/torch/storage.py", line 451, in wrapper
return fn(self, *args, **kwargs)
File "/srv/hops/anaconda/envs/hopsworks_environment/lib/python3.10/site-packages/torch/storage.py", line 526, in _share_fd_cpu_
return super()._share_fd_cpu_(*args, **kwargs)
RuntimeError: unable to write to file </torch_739_144252761_4>: No space left on device (28)In k8s Deployment of Jupyter and Job, you can set the mountPath of /dev/shm to one bigger emptyDir volume.
For example,
- volumes:
- name: dshm
emptyDir:
medium: "Memory"
sizeLimit: "32Gi"
.......
- name: dshm
mountPath: /dev/shm
However, it would be better to make the sizeLimit configurable for end user when they configure Jupyter or Job execution environment. It's also helpful to make it a configurable option in project settings.
Thanks!
Metadata
Metadata
Assignees
Labels
No labels