-
-
Notifications
You must be signed in to change notification settings - Fork 5k
Closed as not planned
Labels
Description
The Feature
Introduce a new entity called model_max_parallel_requests, allowing users to define the number of parallel requests allowed per key/user individually for each model.
Motivation, pitch
Currently, it's possible to set individual rate limits for each model. Here's an example from the official documentation:
curl --location 'http://0.0.0.0:4000/key/generate' \
--header 'Authorization: Bearer sk-1234' \
--header 'Content-Type: application/json' \
--data '{"model_rpm_limit": {"gpt-4": 2}, "model_tpm_limit": {"gpt-4": 100}}'However, in user scenarios involving local models where access is provided not only to LLMs but also to other models—such as embeddings or rerankers—using model_rpm_limit and model_tpm_limit may be confusing and difficult for users to manage. A much clearer and more intuitive limitation would be max_parallel_requests. Currently, though, it's not possible to set this parameter per model when generating a key.
LiteLLM is hiring a founding backend engineer, are you interested in joining us and shipping to all our users?
No
Twitter / LinkedIn details
No response
Swipe4057, stanislavvv, R-omk and Thomas-Mildner