-
Notifications
You must be signed in to change notification settings - Fork 861
Open
Description
Environment
Server infra: Minikube Kubernetes
SkyPilot component tested: API Server deployment + Job execution
GPU hardware on node: NVIDIA RTX 4090 × 2
Description
When testing with the following SkyPilot task config, jobs work as expected:
- Job runs on exactly 1 GPU
- Kubernetes Pod resource limit is also set to 1 GPU (nvidia.com/gpu: 1)
Task Config Used (Quickstart example):
resources:
# Optional; if left out, automatically pick the cheapest cloud.
infra: aws
accelerators: RTX4090:1
# Working directory (optional) containing the project codebase.
# Its contents are synced to ~/sky_workdir/ on the cluster.
workdir: .
# Typical use: pip install -r requirements.txt
# Invoked under the workdir (i.e., can use its files).
setup: |
echo "Running setup."
# Typical use: make use of resources, such as running training.
# Invoked under the workdir (i.e., can use its files).
run: |
echo "Hello, SkyPilot!"
conda env listProblem
Although job scheduling respects the GPU provisioning limit, isolation is not enforced at the user session or container runtime level:
- After connecting to the SkyPilot cluster via SSH, running nvidia-smi shows both GPUs (GPU0, GPU1).
- When executing a Python process inside the provisioned Pod container via SSH session, running DDP (Distributed Data Parallel) or other multi-GPU utilizing Python commands ends up using all 2 GPUs, not the 1 GPU provisioned or limited by SkyPilot. (Is this related to the NVIDIA GPU Operator?)
Has anyone encountered the same or a similar issue during local Kubernetes testing (like Minikube)?
Metadata
Metadata
Assignees
Labels
No labels