Support llm-d for high-performance distributed inference

Hi, SkyPilot team!

I was wondering if there are any plans to support llm-d for high-performance distributed inference.

[llm-d](https://www.google.com/search?q=link) is a high-performance distributed inference framework featuring Intelligent Inference Scheduling, Prefill/Decode (PD) Disaggregation, and Wide Expert-Parallelism (EP). I think combining SkyPilot’s multi-cloud capabilities with such a high-performance distributed framework would be a huge win.

However, SkyPilot seems optimized for scaling inference frameworks like vLLM or SGLang. Adapting it to distributed inference frameworks like Dynamo or llm-d presents some structural challenges.

I’ve been thinking about how to tackle this and came up with two ideas. I'd love your thoughts on these.

**1. Use `sky serve`**
This approach involves deploying llm-d components (inference scheduler, prefill/decode workers) individually using `sky serve`.
- Pros
  - Delegate resource management to SkyPilot and leverage its native autoscaling.
- Concerns
  - Lack of Sidecar Support: Distributed inference often requires sidecars for traffic control. llm-d needs an Envoy proxy running alongside the worker. For example, a single decode pod consists of 2 containers (worker + proxy), but SkyPilot currently follows a “1 task = 1 container” rule.
  - Complex Networking: Components are interconnected in a distributed setup. For instance, the scheduler routes requests based on worker reports, and prefill workers need to transfer KV cache to decode workers.
  - Manual Resources: Additional resources like Gateway and InferencePool would need to be created manually.

**2. Helm Integration**
This approach involves implementing a Helm interface within SkyPilot to deploy llm-d charts directly.
- Pros
  - Allows deploying other Helm charts beyond just llm-d.
- Concerns
  - No Python Client: Helm lacks an official Python client. We'd likely need to use LocalResourcesHandle to execute Helm CLI commands from the API server.
  - Context Management: Major clouds (AWS, GCP) recommend running Helm locally. We need a secure way to switch and manage kubeconfig contexts within the workflow.

I'd appreciate your insights on which approach aligns better with the roadmap.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support llm-d for high-performance distributed inference #8117

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support llm-d for high-performance distributed inference #8117

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions