-
Notifications
You must be signed in to change notification settings - Fork 6.9k
Description
What happened + What you expected to happen
Hello, I'm trying to deploy one ray service on an AKS with multiple model deployed through LLMConfig.
No matter the resource allocation I give in the ray_actor_options, when I deploy the service, Ray will always ask for more than 1 GPU and thus never allocating resources to my services.
I tried to reduce to use different sizes in ray_actor_options (it was 0.9 for the LLM model first time I had this issue, then reduced to 0.5, but nothing changed), however no matter how many fraction of GPUs I give to the two deployments, there will be always a placement that will sum up more than 1 gpu, thus making everything stuck.
Versions / Dependencies
stock rayproject/ray-llm:2.46.0-py311-cu124
Ray 2.46.0
py311
cu124
Differences in libraries from original image:
"vllm>=0.8.5" "transformers>=4.56.0"
Hardware:
A100 node pool on an AKS on azure
Reproduction script
This is my python
from ray import serve
from ray.serve.llm import LLMConfig, build_openai_app
# =========================
# Qwen3 8B Chat Model
# =========================
chat_llm = LLMConfig(
model_loading_config={
"model_id": "Qwen/Qwen3-8B",
},
engine_kwargs={
"max_model_len": 8000, # full long context
"dtype": "bfloat16",
"gpu_memory_utilization": 0.5, # use 50% of A100 GPU memory
"trust_remote_code": True,
"enable_auto_tool_choice": True, # enables automatic tool usage
"tool_call_parser": "hermes", # for function/tool-call reasoning
},
deployment_config={
"ray_actor_options": {
"num_gpus": 0.5,
"num_cpus": 12,
},
"autoscaling_config": {
"min_replicas": 1,
"max_replicas": 1,
"target_ongoing_requests": 64,
},
"max_ongoing_requests": 128,
},
)
# =========================
# Qwen3 0.6B Embedding Model
# =========================
embedding_llm = LLMConfig(
model_loading_config={
"model_id": "Qwen/Qwen3-Embedding-0.6B",
},
engine_kwargs={
"max_model_len": 1000,
"dtype": "bfloat16",
"trust_remote_code": True,
"task": "embed",
},
deployment_config={
"ray_actor_options": {
"num_gpus": 0.1,
"num_cpus": 2,
},
"autoscaling_config": {
"min_replicas": 1,
"max_replicas": 1,
"target_ongoing_requests": 64,
},
"max_ongoing_requests": 128,
},
)
# =========================
# Build one OpenAI-compatible app
# =========================
llm_app = build_openai_app({
"llm_configs": [chat_llm, embedding_llm]
})
This is the ray status I get when I deploy the pod:
(base) ray@ray-qwen3-openai-llm-embed-6nq7t-head-thw9q:/serve_app$ ray status
======== Autoscaler status: 2025-10-30 15:53:43.266595 ========
Node status
---------------------------------------------------------------
Active:
1 headgroup
Idle:
1 gpu-group
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Total Usage:
1.0/24.0 CPU
0.0/1.0 GPU
0B/37.31GiB memory
0B/11.84GiB object_store_memory
From request_resources:
(none)
Pending Demands:
{'CPU': 2.0, 'GPU': 0.1}: 1+ pending tasks/actors (1+ using placement groups)
{'CPU': 12.0, 'GPU': 0.5}: 1+ pending tasks/actors (1+ using placement groups)
{'GPU': 1.1, 'CPU': 2.0} * 1 (PACK): 2+ pending placement groups`
My yaml config file:
apiVersion: ray.io/v1
kind: RayService
metadata:
name: ray-qwen3-openai-llm-embed
spec:
serveConfigV2: |
applications:
- name: qwen3
import_path: serve_qwen3_openai_app:llm_app
route_prefix: "/"
deployments:
# --- Deployment 1: The Chat/LLM Model ---
- name: Qwen3-Chat
# We explicitly define the resources needed for this deployment
ray_actor_options:
num_gpus: 0.6
num_cpus: 12
# --- Deployment 2: The Embedder Model ---
- name: EmbeddingService
num_replicas: 1
ray_actor_options:
num_gpus: 0.1
num_cpus: 2
rayClusterConfig:
rayVersion: "2.46.0"
headGroupSpec:
rayStartParams:
dashboard-host: '0.0.0.0'
template:
metadata:
annotations:
ray.io/disable-probes: "true" # ✅ Prevent operator from overwriting probes
spec:
containers:
- name: ray-head
image: <container_registry>/ray-qwen3-llm-embed-openai:latest
env:
- name: PYTHONPATH
value: /serve_app
command: ["/bin/bash", "-c"]
args:
- |
ray start --head --dashboard-host=0.0.0.0 --port=6379 && \
serve run serve_qwen3_openai_app:llm_app
resources:
limits:
cpu: 4
memory: 8Gi
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265 # Ray dashboard
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
# Dummy probes (won’t be used if annotation disables them)
livenessProbe:
exec:
command: ["/bin/sh", "-c", "echo live"]
initialDelaySeconds: 3600
periodSeconds: 600
timeoutSeconds: 5
failureThreshold: 120
readinessProbe:
exec:
command: ["/bin/sh", "-c", "echo ready"]
initialDelaySeconds: 3600
periodSeconds: 600
timeoutSeconds: 5
failureThreshold: 120
workerGroupSpecs:
- groupName: gpu-group
replicas: 1
rayStartParams:
num-gpus: "1"
#resources: '{"accelerator_type:A100": 1}'
template:
metadata:
annotations:
ray.io/disable-probes: "true" # ✅ Disable probes for worker too
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "present"
effect: "NoSchedule"
containers:
- name: ray-worker
image: <container_registry>/ray-qwen3-llm-embed-openai:latest
env:
- name: PYTHONPATH
value: /serve_app
resources:
limits:
nvidia.com/gpu: "1"
cpu: 20
memory: 32Gi
# Dummy probes (won’t be active due to annotation)
livenessProbe:
exec:
command: ["/bin/sh", "-c", "echo live"]
initialDelaySeconds: 3600
periodSeconds: 600
timeoutSeconds: 5
failureThreshold: 120
readinessProbe:
exec:
command: ["/bin/sh", "-c", "echo ready"]
initialDelaySeconds: 3600
periodSeconds: 600
timeoutSeconds: 5
failureThreshold: 120
Issue Severity
None