Skip to content

[Feature]: Add per-model max_parallel_requests limit #13930

@Swipe4057

Description

@Swipe4057

The Feature

Introduce a new entity called model_max_parallel_requests, allowing users to define the number of parallel requests allowed per key/user individually for each model.

Motivation, pitch

Currently, it's possible to set individual rate limits for each model. Here's an example from the official documentation:

curl --location 'http://0.0.0.0:4000/key/generate' \
--header 'Authorization: Bearer sk-1234' \
--header 'Content-Type: application/json' \
--data '{"model_rpm_limit": {"gpt-4": 2}, "model_tpm_limit": {"gpt-4": 100}}'

However, in user scenarios involving local models where access is provided not only to LLMs but also to other models—such as embeddings or rerankers—using model_rpm_limit and model_tpm_limit may be confusing and difficult for users to manage. A much clearer and more intuitive limitation would be max_parallel_requests. Currently, though, it's not possible to set this parameter per model when generating a key.

LiteLLM is hiring a founding backend engineer, are you interested in joining us and shipping to all our users?

No

Twitter / LinkedIn details

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions