-
Notifications
You must be signed in to change notification settings - Fork 236
Description
I wanted to figure out whether OpenAI’s smallest FOSS model (gpt-oss-20b) could understand GPS coordinates and reason about distances so I vibe-coded a little eval framework that creates some simple scenarios and evaluates them deterministically by parsing the output string.
It's a very simple setup:
- the model receives an string which is always identical -> "you are an agent that understands GPS and can reason about distances"
- it also receives a string which is a randomly-badly-framed set of 2 coordinates with a distance query.
- The model’s output is parsed in a very tolerant way for anything that looks like a distance result in any unit and this results is compared to the expected value.
I made 2 versions of the eval script, one follow's OpenAI's "Harmony" format as recommended in the model card and documentation, and the second just uses model.generate from Transformers just slinging tokens at the FOSS model to see what comes out 😄
Now the interesting thing is this: the "slinging tokens approach" works WILDLY better for the purpose of this eval 🤯
It gets a whopping 68% of distances <1% error, and has a "good approximation" of distance for another 17%! So 85% of measurements are pretty good, from a 20Bn model that has not been tuned on this task!
I was honestly shocked how well the model performed.
Conversely with the Harmony format the model did very badly, rarely predicting anything at all which could even be interpreted as a distance, and often getting lost in rambling.
It did not do better when I put a hard limit on analysis token to prevent spiraling out of control, or when I gave it 10x the output-token limit (max 12K, so 6x the tokens vs the benchmark).
As you can see the difference is massive using Harmony, only 18% of the test set were correct (vs 68%) in "high" reasoning effort, and "medium/low" are too low to even report (basically 0 for both). Basically unusable, which to be fair whas what I was expecting for this task actually… but this bad performance seems to be uniquely linked to using the Harmony format!
Looking at the raw output Harmony causes the model to ramble quite often... and when is rambles like this it just loses track of the question and never formulates an answer, even with a huge token budget (tested up to 12K output).
But what's interesting is that forcing the model to change modes from "analysis" to "final content" after a set number of tokens doesn't make any difference, the Harmony-formatted inference never recovers once it goes off the rails.
Anyway, no real conclusions here yet… it's just interesting to me how the Harmony format actually messes up reasoning for this specific model in this specific, narrow, use case.
Next step: training the model on this use case with and without Harmony format!
Input format:

Result without Harmony:

Result with Harmony:

Use code from here if you want to reproduce or test yourself:
https://github.com/stephansturges/gpt-oss-20b_GPS_distance_understanding_finetune/