Skip to content

Support llm-d for high-performance distributed inference #8117

@ibpark-moreh

Description

@ibpark-moreh

Hi, SkyPilot team!

I was wondering if there are any plans to support llm-d for high-performance distributed inference.

llm-d is a high-performance distributed inference framework featuring Intelligent Inference Scheduling, Prefill/Decode (PD) Disaggregation, and Wide Expert-Parallelism (EP). I think combining SkyPilot’s multi-cloud capabilities with such a high-performance distributed framework would be a huge win.

However, SkyPilot seems optimized for scaling inference frameworks like vLLM or SGLang. Adapting it to distributed inference frameworks like Dynamo or llm-d presents some structural challenges.

I’ve been thinking about how to tackle this and came up with two ideas. I'd love your thoughts on these.

1. Use sky serve
This approach involves deploying llm-d components (inference scheduler, prefill/decode workers) individually using sky serve.

  • Pros
    • Delegate resource management to SkyPilot and leverage its native autoscaling.
  • Concerns
    • Lack of Sidecar Support: Distributed inference often requires sidecars for traffic control. llm-d needs an Envoy proxy running alongside the worker. For example, a single decode pod consists of 2 containers (worker + proxy), but SkyPilot currently follows a “1 task = 1 container” rule.
    • Complex Networking: Components are interconnected in a distributed setup. For instance, the scheduler routes requests based on worker reports, and prefill workers need to transfer KV cache to decode workers.
    • Manual Resources: Additional resources like Gateway and InferencePool would need to be created manually.

2. Helm Integration
This approach involves implementing a Helm interface within SkyPilot to deploy llm-d charts directly.

  • Pros
    • Allows deploying other Helm charts beyond just llm-d.
  • Concerns
    • No Python Client: Helm lacks an official Python client. We'd likely need to use LocalResourcesHandle to execute Helm CLI commands from the API server.
    • Context Management: Major clouds (AWS, GCP) recommend running Helm locally. We need a secure way to switch and manage kubeconfig contexts within the workflow.

I'd appreciate your insights on which approach aligns better with the roadmap.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions