You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
fix(apscheduler): prevent memory leaks from jitter and frequent job intervals (#15846)
* fix(apscheduler): prevent memory leaks from jitter and frequent job intervals
Fixes critical memory leak in APScheduler that causes 35GB+ memory allocations
during proxy startup and operation. The leak was identified through Memray
analysis showing massive allocations in normalize() and _apply_jitter()
functions.
Key changes:
1. Remove jitter parameters from all scheduled jobs - jitter was causing
expensive normalize() calculations leading to memory explosion
2. Configure AsyncIOScheduler with optimized job_defaults:
- misfire_grace_time: 3600s (increased from 120s) to prevent backlog
calculations that trigger memory leaks
- coalesce: true to collapse missed runs
- max_instances: 1 to prevent concurrent job execution
- replace_existing: true to avoid duplicate jobs on restart
3. Increase minimum job intervals:
- PROXY_BATCH_WRITE_AT: 30s (was 10s)
- add_deployment/get_credentials jobs: 30s (was 10s)
4. Use fixed intervals with small random offsets instead of jitter for
job distribution across workers
5. Explicitly configure jobstores and executors to minimize overhead
6. Disable timezone awareness to reduce computation
Memory impact:
- Before: 35GB with 483M allocations during startup
- After: <1GB with normal allocation patterns
Performance notes:
- Minimum job intervals increased from 10s to 30s (configurable via env vars)
- Jobs can still be distributed across workers using random start offsets
- No functional changes to job behavior, only timing and memory optimization
Testing:
- Added comprehensive test suite for scheduler configuration
- Verified no job execution backlog on startup
- Tested duplicate job prevention with replace_existing
Related issue: Memory leak in production proxy servers with APScheduler
\ud83e\udd16 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <[email protected]>
* docs: update PROXY_BATCH_WRITE_AT default value from 10s to 30s
Update documentation to reflect the new default value for PROXY_BATCH_WRITE_AT
changed in PR #15846. The default was increased from 10 seconds to 30 seconds
to prevent memory leaks in APScheduler.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <[email protected]>
* refactor: Move APScheduler config to constants.py
Address code review feedback from ishaan-jaff:
- Move scheduler configuration variables (coalesce, misfire_grace_time,
max_instances, replace_existing) to litellm/constants.py
- Update all references in proxy_server.py to use the constants
- Improves maintainability and makes configuration values centralized
Requested-by: @ishaan-jaff
Related: #15846
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <[email protected]>
---------
Co-authored-by: Claude <[email protected]>
Copy file name to clipboardExpand all lines: docs/my-website/docs/proxy/config_settings.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -232,7 +232,7 @@ router_settings:
232
232
| max_response_size_mb | int | The maximum size for responses in MB. LLM Responses above this size will not be sent. |
233
233
| proxy_budget_rescheduler_min_time | int | The minimum time (in seconds) to wait before checking db for budget resets. **Default is 597 seconds**|
234
234
| proxy_budget_rescheduler_max_time | int | The maximum time (in seconds) to wait before checking db for budget resets. **Default is 605 seconds**|
235
-
| proxy_batch_write_at | int | Time (in seconds) to wait before batch writing spend logs to the db. **Default is 10 seconds**|
235
+
| proxy_batch_write_at | int | Time (in seconds) to wait before batch writing spend logs to the db. **Default is 30 seconds**|
236
236
| proxy_batch_polling_interval | int | Time (in seconds) to wait before polling a batch, to check if it's completed. **Default is 6000 seconds (1 hour)**|
237
237
| alerting_args | dict | Args for Slack Alerting [Doc on Slack Alerting](./alerting.md)|
238
238
| custom_key_generate | str | Custom function for key generation [Doc on custom key generation](./virtual_keys.md#custom--key-generate)|
@@ -726,7 +726,7 @@ router_settings:
726
726
| PROMPTLAYER_API_KEY | API key for PromptLayer integration
727
727
| PROXY_ADMIN_ID | Admin identifier for proxy server
728
728
| PROXY_BASE_URL | Base URL for proxy service
729
-
| PROXY_BATCH_WRITE_AT | Time in seconds to wait before batch writing spend logs to the database. Default is 10
729
+
| PROXY_BATCH_WRITE_AT | Time in seconds to wait before batch writing spend logs to the database. Default is 30
730
730
| PROXY_BATCH_POLLING_INTERVAL | Time in seconds to wait before polling a batch, to check if it's completed. Default is 6000s (1 hour)
731
731
| PROXY_BUDGET_RESCHEDULER_MAX_TIME | Maximum time in seconds to wait before checking database for budget resets. Default is 605
732
732
| PROXY_BUDGET_RESCHEDULER_MIN_TIME | Minimum time in seconds to wait before checking database for budget resets. Default is 597
0 commit comments