-
-
Notifications
You must be signed in to change notification settings - Fork 4.9k
Description
What happened?
Hi LiteLLM team, we’re seeing a severe performance and cost issue with spend reporting over Postgres/Aurora.
Summary
When querying spend logs over a date range (used by UI spend dashboards / team spend views), LiteLLM generates SQL that casts the startTime column to DATE. This makes the predicate non-sargable, so Postgres can’t use the startTime index and falls back to parallel sequential scans over the entire LiteLLM_SpendLogs table. At scale this causes huge read I/O, temp spill, and multi-second to multi-minute response times.
Culprit SQL (generated by LiteLLM)
We observed (via pg_stat_statements) LiteLLM emitting a query shaped like:
SELECT
t.team_alias,
SUM(s.spend) AS total_spend
FROM
"LiteLLM_SpendLogs" s
LEFT JOIN
"LiteLLM_TeamTable" t ON s.team_id = t.team_id
WHERE
s."startTime"::DATE >= $1::date
AND s."startTime"::DATE <= $2::date
GROUP BY
t.team_alias
ORDER BY
total_spend DESC;The issue is the cast:
s."startTime"::DATEImpact in production
LiteLLM_SpendLogs dominates physical reads by orders of magnitude. Spend dashboards regularly scan the entire log table. Large GROUP BY / ORDER BY steps spill to disk (temp files/bytes soar). On Aurora, this translated into billions of billed Storage IOs. User-visible latency spikes on spend endpoints.
EXPLAIN comparison
Old plan (current generated SQL)
Key properties:
- Parallel Seq Scan on "LiteLLM_SpendLogs"
- ~1.44M shared blocks read from storage per run
- temp spill (external merge) during aggregation/sort
- ~34s runtime for a 30-day window
New plan (index-friendly rewrite)
WHERE s."startTime" >= $1::timestamptz
AND s."startTime" < ($2::timestamptz + interval '1 day')Key properties:
- Parallel Index Scan using LiteLLM_SpendLogs_startTime_idx
- shared reads = 0 blocks (all cache hits)
- no temp spills
- ~2s runtime for same window (~17× faster and essentially eliminates physical reads)
Suggested fix
- Make spend time filtering sargable
- Do not cast
startTimetoDATE. - Convert input dates into timestamp boundaries and filter directly on
startTime:[start_ts, end_ts + 1 day)usinggteandlt
- (Recommended) Avoid aggregating raw SpendLogs for dashboards
- Provide/enable a daily or hourly rollup (materialized view or table).
- Have dashboards query rollups by default, and only fall back to raw SpendLogs for narrow windows.
- Add indexing
CREATE INDEX CONCURRENTLY spendlogs_starttime_team_idx
ON "LiteLLM_SpendLogs" ("startTime", team_id);Relevant log output
-- OLD plan (current LiteLLM-generated SQL with startTime::DATE)
Sort (cost=2110783.71..2110783.72 rows=1 width=13) (actual time=34431.204..34504.340 rows=1 loops=1)
Sort Key: (sum(s.spend)) DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=578208 read=1439685, temp read=4504 written=4531
I/O Timings: shared read=93815.903, temp read=18.249 write=8.678
-> Finalize GroupAggregate (cost=2110614.67..2110783.70 rows=1 width=13) (actual time=34431.197..34504.332 rows=1 loops=1)
Group Key: t.team_alias
Buffers: shared hit=578208 read=1439685, temp read=4504 written=4531
I/O Timings: shared read=93815.903, temp read=18.249 write=8.678
-> Gather Merge (cost=2110614.67..2110783.68 rows=2 width=13) (actual time=34431.189..34504.324 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=578208 read=1439685, temp read=4504 written=4531
I/O Timings: shared read=93815.903, temp read=18.249 write=8.678
-> Partial GroupAggregate (cost=2109614.65..2109783.43 rows=1 width=13) (actual time=34377.370..34377.375 rows=1 loops=3)
Group Key: t.team_alias
Buffers: shared hit=578208 read=1439685, temp read=4504 written=4531
I/O Timings: shared read=93815.903, temp read=18.249 write=8.678
-> Sort (cost=2109614.65..2109670.91 rows=22502 width=13) (actual time=34092.017..34245.229 rows=679649 loops=3)
Sort Key: t.team_alias
Sort Method: external merge Disk: 11576kB
Buffers: shared hit=578208 read=1439685, temp read=4504 written=4531
I/O Timings: shared read=93815.903, temp read=18.249 write=8.678
Worker 0: Sort Method: external merge Disk: 11352kB
Worker 1: Sort Method: external merge Disk: 13104kB
-> Hash Left Join (cost=1.02..2107988.01 rows=22502 width=13) (actual time=729.940..33887.591 rows=679649 loops=3)
Hash Cond: (s.team_id = t.team_id)
Buffers: shared hit=578134 read=1439685
I/O Timings: shared read=93815.903
-> Parallel Seq Scan on "LiteLLM_SpendLogs" s (cost=0.00..2107802.75 rows=22502 width=9) (actual time=725.385..33688.914 rows=679649 loops=3)
Filter: ((("startTime")::date >= '2025-11-01'::date) AND (("startTime")::date <= '2025-11-30'::date))
Rows Removed by Filter: 2931725
Buffers: shared hit=578109 read=1439685
I/O Timings: shared read=93802.327
-> Hash (cost=1.01..1.01 rows=1 width=42) (actual time=0.039..0.040 rows=1 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
Buffers: shared hit=3
-> Seq Scan on "LiteLLM_TeamTable" t (cost=0.00..1.01 rows=1 width=42) (actual time=0.021..0.022 rows=1 loops=3)
Buffers: shared hit=3
Planning Time: 0.219 ms
---
-- NEW plan (sargable timestamp range, uses startTime index)
Sort (cost=745451.67..745451.68 rows=1 width=13) (actual time=1915.971..1971.568 rows=1 loops=1)
Sort Key: (sum(s.spend)) DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=1696428
-> Finalize GroupAggregate (cost=745451.40..745451.66 rows=1 width=13) (actual time=1915.964..1971.560 rows=1 loops=1)
Group Key: t.team_alias
Buffers: shared hit=1696428
-> Gather Merge (cost=745451.40..745451.64 rows=2 width=13) (actual time=1915.951..1971.548 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=1696428
-> Sort (cost=744451.38..744451.39 rows=1 width=13) (actual time=1893.499..1893.503 rows=1 loops=3)
Sort Key: t.team_alias
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=1696428
Worker 0: Sort Method: quicksort Memory: 25kB
Worker 1: Sort Method: quicksort Memory: 25kB
-> Partial HashAggregate (cost=744451.36..744451.37 rows=1 width=13) (actual time=1893.457..1893.460 rows=1 loops=3)
Group Key: t.team_alias
Batches: 1 Memory Usage: 24kB
Buffers: shared hit=1696412
Worker 0: Batches: 1 Memory Usage: 24kB
Worker 1: Batches: 1 Memory Usage: 24kB
-> Hash Left Join (cost=1.46..740288.77 rows=832518 width=13) (actual time=0.355..1645.286 rows=679649 loops=3)
Hash Cond: (s.team_id = t.team_id)
Buffers: shared hit=1696412
-> Parallel Index Scan using "LiteLLM_SpendLogs_startTime_idx" on "LiteLLM_SpendLogs" s (cost=0.44..733471.51 rows=832518 width=9) (actual time=0.298..1406.776 rows=679649 loops=3)
Index Cond: (("startTime" >= '2025-11-01 00:00:00+00'::timestamp with time zone) AND ("startTime" < ('2025-11-30 00:00:00+00'::timestamp with time zone + '1 day'::interval)))
Buffers: shared hit=1696409
-> Hash (cost=1.01..1.01 rows=1 width=42) (actual time=0.044..0.045 rows=1 loops=3)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
Buffers: shared hit=3
-> Seq Scan on "LiteLLM_TeamTable" t (cost=0.00..1.01 rows=1 width=42) (actual time=0.026..0.027 rows=1 loops=3)
Buffers: shared hit=3
Planning Time: 0.214 msAre you a ML Ops Team?
No
What LiteLLM version are you on ?
v1.80.0
Twitter / LinkedIn details
No response