Skip to content

Conversation

@ihower
Copy link
Contributor

@ihower ihower commented Nov 26, 2025

Resolved: #2122

When using xai/grok-4-1-fast-reasoning, the LiteLLM streaming output includes usage in a non-final chunk, instead of the last one:

  • The final chunk contains no usage data
  • A previous chunk contains valid usage data

However, the current SDK logic overwrites usage with None if later chunks do not include it. This causes valid usage information to be lost in the final response.

Repro

from agents import Agent, Runner, ModelSettings
from agents.extensions.models.litellm_model import LitellmModel
import asyncio

async def main():
    agent = Agent(
        name="Assistant",
        instructions="You are a helpful assistant",
        model=LitellmModel(model="xai/grok-4-1-fast-reasoning"),
        model_settings=ModelSettings(include_usage=True)
    )

    result = Runner.run_streamed(agent, "just say hello")

    async for event in result.stream_events():
        pass

    print(result.context_wrapper.usage)


if __name__ == "__main__":
    asyncio.run(main())

Output:

Usage(requests=0, input_tokens=0, input_tokens_details=InputTokensDetails(cached_tokens=0), output_tokens=0, output_tokens_details=OutputTokensDetails(reasoning_tokens=0), total_tokens=0, request_usage_entries=[])

As shown above, all usage values are reported as 0. This happens because the final streaming chunk does not include usage data, which causes valid usage from earlier chunks to be overwritten.

Root cause

Here is an example of streaming chunks from LiteLLM output (xai/grok-4-1-fast-reasoning):

...
ModelResponseStream(id='923d4779-8674-4a1e-a509-b9ecc58263a9', created=1764152777, model='grok-4-1-fast-non-reasoning', object='chat.completion.chunk', system_fingerprint=None, choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(provider_specific_fields=None, content='😊', role=None, function_call=None, tool_calls=None, audio=None), logprobs=None)], provider_specific_fields=None, citations=None)

ModelResponseStream(id='923d4779-8674-4a1e-a509-b9ecc58263a9', created=1764152777, model='grok-4-1-fast-non-reasoning', object='chat.completion.chunk', system_fingerprint=None, choices=[StreamingChoices(finish_reason='stop', index=0, delta=Delta(provider_specific_fields=None, content=None, role=None, function_call=None, tool_calls=None, audio=None), logprobs=None)], provider_specific_fields=None)

ModelResponseStream(id='923d4779-8674-4a1e-a509-b9ecc58263a9', created=1764152777, model='grok-4-1-fast-non-reasoning', object='chat.completion.chunk', system_fingerprint=None, choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(provider_specific_fields=None, content=None, role=None, 
function_call=None, tool_calls=None, audio=None), logprobs=None)], provider_specific_fields=None, usage=Usage(completion_tokens=11, prompt_tokens=19, total_tokens=30, completion_tokens_details=CompletionTokensDetailsWrapper(accepted_prediction_tokens=None, audio_tokens=None, reasoning_tokens=0, rejected_prediction_tokens=None, text_tokens=None, image_tokens=None), prompt_tokens_details=None))

ModelResponseStream(id='923d4779-8674-4a1e-a509-b9ecc58263a9', created=1764152777, model='grok-4-1-fast-non-reasoning', object='chat.completion.chunk', system_fingerprint=None, choices=[StreamingChoices(finish_reason=None, index=0, delta=Delta(provider_specific_fields=None, content=None, role=None, function_call=None, tool_calls=None, audio=None), logprobs=None)], provider_specific_fields=None)

As shown above, usage appears in the second-to-last chunk, not the final chunk

Solution

This PR updates the stream handler to:

  • Only update usage when the current chunk actually includes usage data
  • Preserve the last valid usage instead of overwriting it with None

This ensures correct token accounting even when providers (e.g. LiteLLM) do not attach
usage to the final chunk.

Note

This behavior is likely a LiteLLM issue. I have reported it here: BerriAI/litellm#17136

That said, adding this defensive handling in the SDK is harmless, simple and allows us to
gracefully handle this case immediately without waiting for an upstream fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

I can't get the token limit when using Gork with LiteLLM, while other models like GPT or Gemini work normally.

1 participant