Fix: usage from earlier stream chunks when later chunks have none #2126

ihower · 2025-11-26T12:05:11Z

Resolved: #2122

When using xai/grok-4-1-fast-reasoning, the LiteLLM streaming output includes usage in a non-final chunk, instead of the last one:

The final chunk contains no usage data
A previous chunk contains valid usage data

However, the current SDK logic overwrites usage with None if later chunks do not include it. This causes valid usage information to be lost in the final response.

Repro

from agents import Agent, Runner, ModelSettings
from agents.extensions.models.litellm_model import LitellmModel
import asyncio

async def main():
    agent = Agent(
        name="Assistant",
        instructions="You are a helpful assistant",
        model=LitellmModel(model="xai/grok-4-1-fast-reasoning"),
        model_settings=ModelSettings(include_usage=True)
    )

    result = Runner.run_streamed(agent, "just say hello")

    async for event in result.stream_events():
        pass

    print(result.context_wrapper.usage)


if __name__ == "__main__":
    asyncio.run(main())

Output:

Usage(requests=0, input_tokens=0, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=0, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=0, request_usage_entries=[])

As shown above, all usage values are reported as 0. This happens because the final streaming chunk does not include usage data, which causes valid usage from earlier chunks to be overwritten.

Root cause

Here is an example of streaming chunks from LiteLLM output (xai/grok-4-1-fast-reasoning):

...
ModelResponseStream(id='923d4779-8674-4a1e-a509-b9ecc58263a9', created=1764152777, model='grok-4-1-fast-non-reasoning', object='chat.completion.chunk', system_fingerprint=None, choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(provider_specific_fields=None, content='😊', role=None, function_call=None, tool_calls=None, audio=None), logprobs=None)], provider_specific_fields=None, citations=None)

ModelResponseStream(id='923d4779-8674-4a1e-a509-b9ecc58263a9', created=1764152777, model='grok-4-1-fast-non-reasoning', object='chat.completion.chunk', system_fingerprint=None, choices=[StreamingChoices(finish_reason='stop', index=0, delta=Delta(provider_specific_fields=None, content=None, role=None, function_call=None, tool_calls=None, audio=None), logprobs=None)], provider_specific_fields=None)

ModelResponseStream(id='923d4779-8674-4a1e-a509-b9ecc58263a9', created=1764152777, model='grok-4-1-fast-non-reasoning', object='chat.completion.chunk', system_fingerprint=None, choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(provider_specific_fields=None, content=None, role=None, 
function_call=None, tool_calls=None, audio=None), logprobs=None)], provider_specific_fields=None, usage=Usage(completion_tokens=11, prompt_tokens=19, total_tokens=30, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=0, rejected_prediction_tokens=None, text_tokens=None, image_tokens=None), prompt_tokens_details=None))

ModelResponseStream(id='923d4779-8674-4a1e-a509-b9ecc58263a9', created=1764152777, model='grok-4-1-fast-non-reasoning', object='chat.completion.chunk', system_fingerprint=None, choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(provider_specific_fields=None, content=None, role=None, function_call=None, tool_calls=None, audio=None), logprobs=None)], provider_specific_fields=None)

As shown above, usage appears in the second-to-last chunk, not the final chunk

Solution

This PR updates the stream handler to:

Only update usage when the current chunk actually includes usage data
Preserve the last valid usage instead of overwriting it with None

This ensures correct token accounting even when providers (e.g. LiteLLM) do not attach
usage to the final chunk.

Note

This behavior is likely a LiteLLM issue. I have reported it here: BerriAI/litellm#17136

That said, adding this defensive handling in the SDK is harmless, simple and allows us to
gracefully handle this case immediately without waiting for an upstream fix.

… none

Fix: preserve usage from earlier stream chunks when later chunks have…

16d503c

… none

ihower mentioned this pull request Nov 26, 2025

I can't get the token limit when using Gork with LiteLLM, while other models like GPT or Gemini work normally. #2122

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: usage from earlier stream chunks when later chunks have none #2126

Fix: usage from earlier stream chunks when later chunks have none #2126

Uh oh!

ihower commented Nov 26, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix: usage from earlier stream chunks when later chunks have none #2126

Are you sure you want to change the base?

Fix: usage from earlier stream chunks when later chunks have none #2126

Uh oh!

Conversation

ihower commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Repro

Root cause

Solution

Note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ihower commented Nov 26, 2025 •

edited

Loading