Skip to content

Make /dev/shm configurable #1557

@wenijinew

Description

@wenijinew

Hi,

To run torch-based model training, it's easy to trigger the problem "No space left on device" which is caused by default small shared memory (64M) when docker starts container.

RuntimeError: DataLoader worker (pid 729) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
Traceback (most recent call last):
  File "/srv/hops/anaconda/envs/hopsworks_environment/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/srv/hops/anaconda/envs/hopsworks_environment/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/srv/hops/anaconda/envs/hopsworks_environment/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 618, in reduce_storage
    fd, size = storage._share_fd_cpu_()
  File "/srv/hops/anaconda/envs/hopsworks_environment/lib/python3.10/site-packages/torch/storage.py", line 451, in wrapper
    return fn(self, *args, **kwargs)
  File "/srv/hops/anaconda/envs/hopsworks_environment/lib/python3.10/site-packages/torch/storage.py", line 526, in _share_fd_cpu_
    return super()._share_fd_cpu_(*args, **kwargs)
RuntimeError: unable to write to file </torch_739_144252761_4>: No space left on device (28)

In k8s Deployment of Jupyter and Job, you can set the mountPath of /dev/shm to one bigger emptyDir volume.
For example,

      - volumes:
        - name: dshm
          emptyDir:
            medium: "Memory"
            sizeLimit: "32Gi"
      .......
            - name: dshm
              mountPath: /dev/shm

However, it would be better to make the sizeLimit configurable for end user when they configure Jupyter or Job execution environment. It's also helpful to make it a configurable option in project settings.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions