RL Training #7

lintangsutawika · 2025-11-04T18:06:59Z

No description provided.

li-boxuan · 2025-11-05T03:45:54Z

scripts/run_training.sh

+
+# Colocated GRPO training+generation for Qwen3-8B on the SWE-Bench task.
+# Uses 1 node with 8 GPUs.
+# uv run --isolated examples/mini_swe_agent/preprocess_swegym.py --output_dir ~/data/swe_gym_subset


this file is in SkyRL repo - is it intentional or just a copy paste error?

Sorry, leftover from a skyrl example

Removed commented instructions and notes related to colocated GRPO training.

li-boxuan · 2025-11-05T06:36:34Z

src/generator/code_search_generator.py

+        llm=LLM(
+            service_id="agent",
+            model=litellm_model_name,
+            base_url=f"http://localhost:8080/v1/",


Suggested change

base_url=f"http://localhost:8080/v1/",

base_url=litellm_base_url,

li-boxuan · 2025-11-05T06:53:04Z

src/generator/code_search_generator.py

+    messages = []
+
+    agent = get_default_agent(
+        llm=LLM(


we could also pass litellm_extra_body = {"return_token_ids": true} to avoid retokenization drift in vllm

and then we could also avoid doing tokenization by ourselves

li-boxuan · 2025-11-05T07:08:34Z

src/generator/code_search_generator.py

+        initial_input_len = 0
+        past_trajectory_len = 0
+        for idx, message in enumerate(messages):
+            current_prompt_ids = message["prompt_tokens_ids"]


I am a bit confused here: where does this field come from?

vllm returns prompt_token_ids and token_ids if we pass return_token_ids: true as litellm_extra_body, but since you are not doing so, I suppose you need to actually run tokenization?

I am hardcoding return_token_ids=True in the openhands submodule. prompt_token_ids and token_ids are the tokens that the vllm inference engine sees an generates. So that's why I didn't have to retokenize.

I had to make a custom event processor (also hardcoded in the agent-sdk submodule directory) so I can get the keys by doing this

messages = [msg for msg in messages if msg["kind"] == "TokenEvent"]

will prepare a PR for openhands for that it's more clean. But for now, the code is doing what you suggested.

I am hardcoding return_token_ids=True in the openhands submodule

Gotcha. You might wanna try litellm_extra_body = {"return_token_ids": true} LLM config - I think it shall just work.

Also shouldn't it be prompt_token_ids instead of prompt_tokens_ids and token_ids instead of response_token_ids? Did you rename them in the agent-sdk submodule (sry I don't know how to view the diff there).

I see that this was a recent update (or atleast new compared to the version of openhands I was using). Will make changes.

yep it was a recent update I made

li-boxuan · 2025-11-06T03:37:55Z

src/generator/code_search_generator.py

+
+                trajectory_ids = current_prompt_ids[initial_input_len:]
+                past_response_len = len(response_ids_list[idx-1]) if idx > 0 else 0
+                past_response_observation_ids = trajectory_ids[past_trajectory_len+past_response_len:]


Did you test if this works? I was playing around with vllm and it seems their token_ids field includes EOS token at the end, which could make this offset off by one.

It does not error out. Is the EOS suppose to be masked or not? I will need to adjust this accordingly.

The problem is not (just) about masking. trajectory_ids won't contain this EOS token in the middle but response_ids does, which means here you may need to change it into

past_response_observation_ids = trajectory_ids[past_trajectory_len+past_response_len-1:]

if the finish reason was "stop".

That being said, I am not 100% sure on this.

I created an issue in vllm for clarification: vllm-project/vllm#28185

Sorry, I might be totally wrong. I tested using vllm library directly and found EOS tokens appearing normally in the prompt_token_ids

Maybe litellm was doing something weird, maybe I did something stupid... let's ignore it for now

My bad, I used a wrong chat template during testing. You should be good!

lintangsutawika and others added 18 commits October 31, 2025 15:20

pyproject.toml

3a7e4ca

add rl utilities

ad1238e

add final message

e695891

add training scripts

64480ec

remove unused variables and fix other variables

bdfbd70

add sbatch stuff

436f7ed

Added reward

da4fbeb

stuff

6cfb901

update dependecies

eec93d3

fixes

ee9b99d

made new TokenEvent to work with vllm return ids

1c5e553

fix flash attn

774321e

remove print

8bd432e

validation set to 100 samples

edf030d

modified batch size

1395605

fix generator, remove retokenization

2eec728

core commands

21a5579

Merge branch 'main' into RL-training

9eacdaa

li-boxuan reviewed Nov 5, 2025

View reviewed changes

Clean up comments in run_training.sh

c02aaec

Removed commented instructions and notes related to colocated GRPO training.

li-boxuan reviewed Nov 5, 2025

View reviewed changes

lintangsutawika added 2 commits November 5, 2025 21:30

update rollout

19bfd8d

move the exception

87012f4

li-boxuan reviewed Nov 6, 2025

View reviewed changes

	base_url=f"http://localhost:8080/v1/",
	base_url=litellm_base_url,

RL Training #7

Are you sure you want to change the base?

RL Training #7

Uh oh!

Conversation

lintangsutawika commented Nov 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

li-boxuan Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

li-boxuan Nov 5, 2025 •

edited

Loading