Skip to content

[Bug]: SpendLogs aggregation generates non-sargable startTime::date filter → massive Postgres seq scans & Aurora I/O costs #17487

@kristianmitk

Description

@kristianmitk

What happened?

Hi LiteLLM team, we’re seeing a severe performance and cost issue with spend reporting over Postgres/Aurora.

Summary

When querying spend logs over a date range (used by UI spend dashboards / team spend views), LiteLLM generates SQL that casts the startTime column to DATE. This makes the predicate non-sargable, so Postgres can’t use the startTime index and falls back to parallel sequential scans over the entire LiteLLM_SpendLogs table. At scale this causes huge read I/O, temp spill, and multi-second to multi-minute response times.

Culprit SQL (generated by LiteLLM)

We observed (via pg_stat_statements) LiteLLM emitting a query shaped like:

SELECT
    t.team_alias,
    SUM(s.spend) AS total_spend
FROM
    "LiteLLM_SpendLogs" s
LEFT JOIN
    "LiteLLM_TeamTable" t ON s.team_id = t.team_id
WHERE
    s."startTime"::DATE >= $1::date
    AND s."startTime"::DATE <= $2::date
GROUP BY
    t.team_alias
ORDER BY
    total_spend DESC;

The issue is the cast:

s."startTime"::DATE

Impact in production

LiteLLM_SpendLogs dominates physical reads by orders of magnitude. Spend dashboards regularly scan the entire log table. Large GROUP BY / ORDER BY steps spill to disk (temp files/bytes soar). On Aurora, this translated into billions of billed Storage IOs. User-visible latency spikes on spend endpoints.

EXPLAIN comparison

Old plan (current generated SQL)

Key properties:

  • Parallel Seq Scan on "LiteLLM_SpendLogs"
  • ~1.44M shared blocks read from storage per run
  • temp spill (external merge) during aggregation/sort
  • ~34s runtime for a 30-day window

New plan (index-friendly rewrite)

WHERE s."startTime" >= $1::timestamptz
  AND s."startTime" < ($2::timestamptz + interval '1 day')

Key properties:

  • Parallel Index Scan using LiteLLM_SpendLogs_startTime_idx
  • shared reads = 0 blocks (all cache hits)
  • no temp spills
  • ~2s runtime for same window (~17× faster and essentially eliminates physical reads)

Suggested fix

  1. Make spend time filtering sargable
  • Do not cast startTime to DATE.
  • Convert input dates into timestamp boundaries and filter directly on startTime:
    • [start_ts, end_ts + 1 day) using gte and lt
  1. (Recommended) Avoid aggregating raw SpendLogs for dashboards
  • Provide/enable a daily or hourly rollup (materialized view or table).
  • Have dashboards query rollups by default, and only fall back to raw SpendLogs for narrow windows.
  1. Add indexing
CREATE INDEX CONCURRENTLY spendlogs_starttime_team_idx
ON "LiteLLM_SpendLogs" ("startTime", team_id);

Relevant log output

-- OLD plan (current LiteLLM-generated SQL with startTime::DATE)

Sort  (cost=2110783.71..2110783.72 rows=1 width=13) (actual time=34431.204..34504.340 rows=1 loops=1)
  Sort Key: (sum(s.spend)) DESC
  Sort Method: quicksort  Memory: 25kB
  Buffers: shared hit=578208 read=1439685, temp read=4504 written=4531
  I/O Timings: shared read=93815.903, temp read=18.249 write=8.678
  ->  Finalize GroupAggregate  (cost=2110614.67..2110783.70 rows=1 width=13) (actual time=34431.197..34504.332 rows=1 loops=1)
        Group Key: t.team_alias
        Buffers: shared hit=578208 read=1439685, temp read=4504 written=4531
        I/O Timings: shared read=93815.903, temp read=18.249 write=8.678
        ->  Gather Merge  (cost=2110614.67..2110783.68 rows=2 width=13) (actual time=34431.189..34504.324 rows=3 loops=1)
              Workers Planned: 2
              Workers Launched: 2
              Buffers: shared hit=578208 read=1439685, temp read=4504 written=4531
              I/O Timings: shared read=93815.903, temp read=18.249 write=8.678
              ->  Partial GroupAggregate  (cost=2109614.65..2109783.43 rows=1 width=13) (actual time=34377.370..34377.375 rows=1 loops=3)
                    Group Key: t.team_alias
                    Buffers: shared hit=578208 read=1439685, temp read=4504 written=4531
                    I/O Timings: shared read=93815.903, temp read=18.249 write=8.678
                    ->  Sort  (cost=2109614.65..2109670.91 rows=22502 width=13) (actual time=34092.017..34245.229 rows=679649 loops=3)
                          Sort Key: t.team_alias
                          Sort Method: external merge  Disk: 11576kB
                          Buffers: shared hit=578208 read=1439685, temp read=4504 written=4531
                          I/O Timings: shared read=93815.903, temp read=18.249 write=8.678
                          Worker 0:  Sort Method: external merge  Disk: 11352kB
                          Worker 1:  Sort Method: external merge  Disk: 13104kB
                          ->  Hash Left Join  (cost=1.02..2107988.01 rows=22502 width=13) (actual time=729.940..33887.591 rows=679649 loops=3)
                                Hash Cond: (s.team_id = t.team_id)
                                Buffers: shared hit=578134 read=1439685
                                I/O Timings: shared read=93815.903
                                ->  Parallel Seq Scan on "LiteLLM_SpendLogs" s  (cost=0.00..2107802.75 rows=22502 width=9) (actual time=725.385..33688.914 rows=679649 loops=3)
                                      Filter: ((("startTime")::date >= '2025-11-01'::date) AND (("startTime")::date <= '2025-11-30'::date))
                                      Rows Removed by Filter: 2931725
                                      Buffers: shared hit=578109 read=1439685
                                      I/O Timings: shared read=93802.327
                                ->  Hash  (cost=1.01..1.01 rows=1 width=42) (actual time=0.039..0.040 rows=1 loops=3)
                                      Buckets: 1024  Batches: 1  Memory Usage: 9kB
                                      Buffers: shared hit=3
                                      ->  Seq Scan on "LiteLLM_TeamTable" t  (cost=0.00..1.01 rows=1 width=42) (actual time=0.021..0.022 rows=1 loops=3)
                                            Buffers: shared hit=3
Planning Time: 0.219 ms


---


-- NEW plan (sargable timestamp range, uses startTime index)

Sort  (cost=745451.67..745451.68 rows=1 width=13) (actual time=1915.971..1971.568 rows=1 loops=1)
  Sort Key: (sum(s.spend)) DESC
  Sort Method: quicksort  Memory: 25kB
  Buffers: shared hit=1696428
  ->  Finalize GroupAggregate  (cost=745451.40..745451.66 rows=1 width=13) (actual time=1915.964..1971.560 rows=1 loops=1)
        Group Key: t.team_alias
        Buffers: shared hit=1696428
        ->  Gather Merge  (cost=745451.40..745451.64 rows=2 width=13) (actual time=1915.951..1971.548 rows=3 loops=1)
              Workers Planned: 2
              Workers Launched: 2
              Buffers: shared hit=1696428
              ->  Sort  (cost=744451.38..744451.39 rows=1 width=13) (actual time=1893.499..1893.503 rows=1 loops=3)
                    Sort Key: t.team_alias
                    Sort Method: quicksort  Memory: 25kB
                    Buffers: shared hit=1696428
                    Worker 0:  Sort Method: quicksort  Memory: 25kB
                    Worker 1:  Sort Method: quicksort  Memory: 25kB
                    ->  Partial HashAggregate  (cost=744451.36..744451.37 rows=1 width=13) (actual time=1893.457..1893.460 rows=1 loops=3)
                          Group Key: t.team_alias
                          Batches: 1  Memory Usage: 24kB
                          Buffers: shared hit=1696412
                          Worker 0:  Batches: 1  Memory Usage: 24kB
                          Worker 1:  Batches: 1  Memory Usage: 24kB
                          ->  Hash Left Join  (cost=1.46..740288.77 rows=832518 width=13) (actual time=0.355..1645.286 rows=679649 loops=3)
                                Hash Cond: (s.team_id = t.team_id)
                                Buffers: shared hit=1696412
                                ->  Parallel Index Scan using "LiteLLM_SpendLogs_startTime_idx" on "LiteLLM_SpendLogs" s  (cost=0.44..733471.51 rows=832518 width=9) (actual time=0.298..1406.776 rows=679649 loops=3)
                                      Index Cond: (("startTime" >= '2025-11-01 00:00:00+00'::timestamp with time zone) AND ("startTime" < ('2025-11-30 00:00:00+00'::timestamp with time zone + '1 day'::interval)))
                                      Buffers: shared hit=1696409
                                ->  Hash  (cost=1.01..1.01 rows=1 width=42) (actual time=0.044..0.045 rows=1 loops=3)
                                      Buckets: 1024  Batches: 1  Memory Usage: 9kB
                                      Buffers: shared hit=3
                                      ->  Seq Scan on "LiteLLM_TeamTable" t  (cost=0.00..1.01 rows=1 width=42) (actual time=0.026..0.027 rows=1 loops=3)
                                            Buffers: shared hit=3
Planning Time: 0.214 ms

Are you a ML Ops Team?

No

What LiteLLM version are you on ?

v1.80.0

Twitter / LinkedIn details

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions