-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Description
What happened?
We recently turned on Redis caching to enable Postgres pooling of requests like so:
cache: true
cache_params:
type: redis
host: os.environ/REDIS-HOST
password: os.environ/REDIS-PASSWORD
port: ####
ssl: true
supported_call_types: []
Upon making this change, users of /v1/messages began reporting issues with their keys hitting their maximum_parallel_requests limits. Upon further testing, we have discovered that requests to /v1/messages end up treating max_parallel_requests essentially as a maximum number of requests a key is able to make rather than properly functioning only on parallel requests. Once a key hits it's rate limit, it permanently becomes rate limited. That key will never be functional again if it is only used to hit /v1/messages.
Hitting /chat/completions however will set the timer back to 1 minute before the key's rate limit is reset and it is able to make another set of requests up to their MPR limit.
To recreate follow the following steps on a LiteLLM instance with caching enabled the way shown above:
- Create a new key and set its maximum parallel requests to 5
- Hit /v1/messages with your newly created key 5 times
You should now observe rate limiting occurring forever on that key when you hit /v1/messages - Hit /chat/completions with that key and then wait 60 seconds
- Hit /v1/messages again and it should work again for 5 more requests, upon which is will be rate limited permanently again
Relevant log output
{
"error": {
"message": "429: Rate limit exceeded for api_key: xxxxxxxxx. Limit type: max_parallel_requests. Current limit: 5, Remaining: 0. Limit resets at: 2025-12-01 20:28:17 UTC",
"type": "None",
"param": "None",
"code": "429"
}
}Are you a ML Ops Team?
No
What LiteLLM version are you on ?
v1.79.1
Twitter / LinkedIn details
No response