Skip to content

GPU provisioning respected by SkyPilot job, but not isolated in SSH #8116

@vim-hjk

Description

@vim-hjk

Environment

Server infra: Minikube Kubernetes
SkyPilot component tested: API Server deployment + Job execution
GPU hardware on node: NVIDIA RTX 4090 × 2

Description

When testing with the following SkyPilot task config, jobs work as expected:

  • Job runs on exactly 1 GPU
  • Kubernetes Pod resource limit is also set to 1 GPU (nvidia.com/gpu: 1)

Task Config Used (Quickstart example):

resources:
  # Optional; if left out, automatically pick the cheapest cloud.
  infra: aws
  accelerators: RTX4090:1

# Working directory (optional) containing the project codebase.
# Its contents are synced to ~/sky_workdir/ on the cluster.
workdir: .

# Typical use: pip install -r requirements.txt
# Invoked under the workdir (i.e., can use its files).
setup: |
  echo "Running setup."

# Typical use: make use of resources, such as running training.
# Invoked under the workdir (i.e., can use its files).
run: |
  echo "Hello, SkyPilot!"
  conda env list

Problem

Although job scheduling respects the GPU provisioning limit, isolation is not enforced at the user session or container runtime level:

  • After connecting to the SkyPilot cluster via SSH, running nvidia-smi shows both GPUs (GPU0, GPU1).
  • When executing a Python process inside the provisioned Pod container via SSH session, running DDP (Distributed Data Parallel) or other multi-GPU utilizing Python commands ends up using all 2 GPUs, not the 1 GPU provisioned or limited by SkyPilot. (Is this related to the NVIDIA GPU Operator?)

Has anyone encountered the same or a similar issue during local Kubernetes testing (like Minikube)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions