Skip to content

Conversation

@lintangsutawika
Copy link
Collaborator

No description provided.


# Colocated GRPO training+generation for Qwen3-8B on the SWE-Bench task.
# Uses 1 node with 8 GPUs.
# uv run --isolated examples/mini_swe_agent/preprocess_swegym.py --output_dir ~/data/swe_gym_subset

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is in SkyRL repo - is it intentional or just a copy paste error?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, leftover from a skyrl example

Removed commented instructions and notes related to colocated GRPO training.
llm=LLM(
service_id="agent",
model=litellm_model_name,
base_url=f"http://localhost:8080/v1/",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
base_url=f"http://localhost:8080/v1/",
base_url=litellm_base_url,

messages = []

agent = get_default_agent(
llm=LLM(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could also pass litellm_extra_body = {"return_token_ids": true} to avoid retokenization drift in vllm

and then we could also avoid doing tokenization by ourselves

initial_input_len = 0
past_trajectory_len = 0
for idx, message in enumerate(messages):
current_prompt_ids = message["prompt_tokens_ids"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused here: where does this field come from?

vllm returns prompt_token_ids and token_ids if we pass return_token_ids: true as litellm_extra_body, but since you are not doing so, I suppose you need to actually run tokenization?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am hardcoding return_token_ids=True in the openhands submodule. prompt_token_ids and token_ids are the tokens that the vllm inference engine sees an generates. So that's why I didn't have to retokenize.

I had to make a custom event processor (also hardcoded in the agent-sdk submodule directory) so I can get the keys by doing this

messages = [msg for msg in messages if msg["kind"] == "TokenEvent"]

will prepare a PR for openhands for that it's more clean. But for now, the code is doing what you suggested.

Copy link

@li-boxuan li-boxuan Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am hardcoding return_token_ids=True in the openhands submodule

Gotcha. You might wanna try litellm_extra_body = {"return_token_ids": true} LLM config - I think it shall just work.

Also shouldn't it be prompt_token_ids instead of prompt_tokens_ids and token_ids instead of response_token_ids? Did you rename them in the agent-sdk submodule (sry I don't know how to view the diff there).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that this was a recent update (or atleast new compared to the version of openhands I was using). Will make changes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep it was a recent update I made


trajectory_ids = current_prompt_ids[initial_input_len:]
past_response_len = len(response_ids_list[idx-1]) if idx > 0 else 0
past_response_observation_ids = trajectory_ids[past_trajectory_len+past_response_len:]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you test if this works? I was playing around with vllm and it seems their token_ids field includes EOS token at the end, which could make this offset off by one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not error out. Is the EOS suppose to be masked or not? I will need to adjust this accordingly.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is not (just) about masking. trajectory_ids won't contain this EOS token in the middle but response_ids does, which means here you may need to change it into

past_response_observation_ids = trajectory_ids[past_trajectory_len+past_response_len-1:]

if the finish reason was "stop".

That being said, I am not 100% sure on this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created an issue in vllm for clarification: vllm-project/vllm#28185

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I might be totally wrong. I tested using vllm library directly and found EOS tokens appearing normally in the prompt_token_ids

Maybe litellm was doing something weird, maybe I did something stupid... let's ignore it for now

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I used a wrong chat template during testing. You should be good!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants