[Bug]: SpendLogs aggregation generates non-sargable startTime::date filter → massive Postgres seq scans & Aurora I/O costs

### What happened?

Hi LiteLLM team,  we’re seeing a severe performance and cost issue with spend reporting over Postgres/Aurora.

## Summary
When querying spend logs over a date range (used by UI spend dashboards / team spend views), LiteLLM generates SQL that casts the `startTime` column to `DATE`. This makes the predicate non-sargable, so Postgres can’t use the `startTime` index and falls back to parallel sequential scans over the entire `LiteLLM_SpendLogs` table. At scale this causes huge read I/O, temp spill, and multi-second to multi-minute response times.

## Culprit SQL (generated by LiteLLM)
We observed (via `pg_stat_statements`) LiteLLM emitting a query shaped like:

```sql
SELECT
    t.team_alias,
    SUM(s.spend) AS total_spend
FROM
    "LiteLLM_SpendLogs" s
LEFT JOIN
    "LiteLLM_TeamTable" t ON s.team_id = t.team_id
WHERE
    s."startTime"::DATE >= $1::date
    AND s."startTime"::DATE <= $2::date
GROUP BY
    t.team_alias
ORDER BY
    total_spend DESC;
```

The issue is the cast:

```sql
s."startTime"::DATE
```

## Impact in production

`LiteLLM_SpendLogs` dominates physical reads by orders of magnitude. Spend dashboards regularly scan the entire log table. Large GROUP BY / ORDER BY steps spill to disk (temp files/bytes soar). On Aurora, this translated into billions of billed Storage IOs. User-visible latency spikes on spend endpoints.


## EXPLAIN comparison

### Old plan (current generated SQL)

Key properties:

- Parallel Seq Scan on "LiteLLM_SpendLogs"
- ~1.44M shared blocks read from storage per run
- temp spill (external merge) during aggregation/sort
- ~34s runtime for a 30-day window

### New plan (index-friendly rewrite)

```sql
WHERE s."startTime" >= $1::timestamptz
  AND s."startTime" < ($2::timestamptz + interval '1 day')
```


Key properties:

- Parallel Index Scan using LiteLLM_SpendLogs_startTime_idx
- shared reads = 0 blocks (all cache hits)
- no temp spills
- ~2s runtime for same window (~17× faster and essentially eliminates physical reads)



## Suggested fix

1. Make spend time filtering sargable
- Do not cast `startTime` to `DATE`.
- Convert input dates into timestamp boundaries and filter directly on `startTime`:
  - `[start_ts, end_ts + 1 day)` using `gte` and `lt`
2. (Recommended) Avoid aggregating raw SpendLogs for dashboards
- Provide/enable a daily or hourly rollup (materialized view or table).
- Have dashboards query rollups by default, and only fall back to raw SpendLogs for narrow windows.
3. Add indexing
```sql
CREATE INDEX CONCURRENTLY spendlogs_starttime_team_idx
ON "LiteLLM_SpendLogs" ("startTime", team_id);
```

### Relevant log output

```shell
-- OLD plan (current LiteLLM-generated SQL with startTime::DATE)

Sort  (cost=2110783.71..2110783.72 rows=1 width=13) (actual time=34431.204..34504.340 rows=1 loops=1)
  Sort Key: (sum(s.spend)) DESC
  Sort Method: quicksort  Memory: 25kB
  Buffers: shared hit=578208 read=1439685, temp read=4504 written=4531
  I/O Timings: shared read=93815.903, temp read=18.249 write=8.678
  ->  Finalize GroupAggregate  (cost=2110614.67..2110783.70 rows=1 width=13) (actual time=34431.197..34504.332 rows=1 loops=1)
        Group Key: t.team_alias
        Buffers: shared hit=578208 read=1439685, temp read=4504 written=4531
        I/O Timings: shared read=93815.903, temp read=18.249 write=8.678
        ->  Gather Merge  (cost=2110614.67..2110783.68 rows=2 width=13) (actual time=34431.189..34504.324 rows=3 loops=1)
              Workers Planned: 2
              Workers Launched: 2
              Buffers: shared hit=578208 read=1439685, temp read=4504 written=4531
              I/O Timings: shared read=93815.903, temp read=18.249 write=8.678
              ->  Partial GroupAggregate  (cost=2109614.65..2109783.43 rows=1 width=13) (actual time=34377.370..34377.375 rows=1 loops=3)
                    Group Key: t.team_alias
                    Buffers: shared hit=578208 read=1439685, temp read=4504 written=4531
                    I/O Timings: shared read=93815.903, temp read=18.249 write=8.678
                    ->  Sort  (cost=2109614.65..2109670.91 rows=22502 width=13) (actual time=34092.017..34245.229 rows=679649 loops=3)
                          Sort Key: t.team_alias
                          Sort Method: external merge  Disk: 11576kB
                          Buffers: shared hit=578208 read=1439685, temp read=4504 written=4531
                          I/O Timings: shared read=93815.903, temp read=18.249 write=8.678
                          Worker 0:  Sort Method: external merge  Disk: 11352kB
                          Worker 1:  Sort Method: external merge  Disk: 13104kB
                          ->  Hash Left Join  (cost=1.02..2107988.01 rows=22502 width=13) (actual time=729.940..33887.591 rows=679649 loops=3)
                                Hash Cond: (s.team_id = t.team_id)
                                Buffers: shared hit=578134 read=1439685
                                I/O Timings: shared read=93815.903
                                ->  Parallel Seq Scan on "LiteLLM_SpendLogs" s  (cost=0.00..2107802.75 rows=22502 width=9) (actual time=725.385..33688.914 rows=679649 loops=3)
                                      Filter: ((("startTime")::date >= '2025-11-01'::date) AND (("startTime")::date <= '2025-11-30'::date))
                                      Rows Removed by Filter: 2931725
                                      Buffers: shared hit=578109 read=1439685
                                      I/O Timings: shared read=93802.327
                                ->  Hash  (cost=1.01..1.01 rows=1 width=42) (actual time=0.039..0.040 rows=1 loops=3)
                                      Buckets: 1024  Batches: 1  Memory Usage: 9kB
                                      Buffers: shared hit=3
                                      ->  Seq Scan on "LiteLLM_TeamTable" t  (cost=0.00..1.01 rows=1 width=42) (actual time=0.021..0.022 rows=1 loops=3)
                                            Buffers: shared hit=3
Planning Time: 0.219 ms


---


-- NEW plan (sargable timestamp range, uses startTime index)

Sort  (cost=745451.67..745451.68 rows=1 width=13) (actual time=1915.971..1971.568 rows=1 loops=1)
  Sort Key: (sum(s.spend)) DESC
  Sort Method: quicksort  Memory: 25kB
  Buffers: shared hit=1696428
  ->  Finalize GroupAggregate  (cost=745451.40..745451.66 rows=1 width=13) (actual time=1915.964..1971.560 rows=1 loops=1)
        Group Key: t.team_alias
        Buffers: shared hit=1696428
        ->  Gather Merge  (cost=745451.40..745451.64 rows=2 width=13) (actual time=1915.951..1971.548 rows=3 loops=1)
              Workers Planned: 2
              Workers Launched: 2
              Buffers: shared hit=1696428
              ->  Sort  (cost=744451.38..744451.39 rows=1 width=13) (actual time=1893.499..1893.503 rows=1 loops=3)
                    Sort Key: t.team_alias
                    Sort Method: quicksort  Memory: 25kB
                    Buffers: shared hit=1696428
                    Worker 0:  Sort Method: quicksort  Memory: 25kB
                    Worker 1:  Sort Method: quicksort  Memory: 25kB
                    ->  Partial HashAggregate  (cost=744451.36..744451.37 rows=1 width=13) (actual time=1893.457..1893.460 rows=1 loops=3)
                          Group Key: t.team_alias
                          Batches: 1  Memory Usage: 24kB
                          Buffers: shared hit=1696412
                          Worker 0:  Batches: 1  Memory Usage: 24kB
                          Worker 1:  Batches: 1  Memory Usage: 24kB
                          ->  Hash Left Join  (cost=1.46..740288.77 rows=832518 width=13) (actual time=0.355..1645.286 rows=679649 loops=3)
                                Hash Cond: (s.team_id = t.team_id)
                                Buffers: shared hit=1696412
                                ->  Parallel Index Scan using "LiteLLM_SpendLogs_startTime_idx" on "LiteLLM_SpendLogs" s  (cost=0.44..733471.51 rows=832518 width=9) (actual time=0.298..1406.776 rows=679649 loops=3)
                                      Index Cond: (("startTime" >= '2025-11-01 00:00:00+00'::timestamp with time zone) AND ("startTime" < ('2025-11-30 00:00:00+00'::timestamp with time zone + '1 day'::interval)))
                                      Buffers: shared hit=1696409
                                ->  Hash  (cost=1.01..1.01 rows=1 width=42) (actual time=0.044..0.045 rows=1 loops=3)
                                      Buckets: 1024  Batches: 1  Memory Usage: 9kB
                                      Buffers: shared hit=3
                                      ->  Seq Scan on "LiteLLM_TeamTable" t  (cost=0.00..1.01 rows=1 width=42) (actual time=0.026..0.027 rows=1 loops=3)
                                            Buffers: shared hit=3
Planning Time: 0.214 ms
```

### Are you a ML Ops Team?

No

### What LiteLLM version are you on ?

v1.80.0

### Twitter / LinkedIn details

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: SpendLogs aggregation generates non-sargable startTime::date filter → massive Postgres seq scans & Aurora I/O costs #17487

What happened?

Summary

Culprit SQL (generated by LiteLLM)

Impact in production

EXPLAIN comparison

Old plan (current generated SQL)

New plan (index-friendly rewrite)

Suggested fix

Relevant log output

Are you a ML Ops Team?

What LiteLLM version are you on ?

Twitter / LinkedIn details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: SpendLogs aggregation generates non-sargable startTime::date filter → massive Postgres seq scans & Aurora I/O costs #17487

Description

What happened?

Summary

Culprit SQL (generated by LiteLLM)

Impact in production

EXPLAIN comparison

Old plan (current generated SQL)

New plan (index-friendly rewrite)

Suggested fix

Relevant log output

Are you a ML Ops Team?

What LiteLLM version are you on ?

Twitter / LinkedIn details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions