-
Notifications
You must be signed in to change notification settings - Fork 153
fix(benchmark): fix hle text only #87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR updates the HLE text-only benchmark data loading mechanism by switching from the HuggingFace dataset to the raw JSON data source from the WebThinker repository. The change removes the dependency on the HuggingFace datasets library for this specific benchmark and updates field names to match the new data source format.
- Replaced HuggingFace dataset loading with direct HTTP request to WebThinker GitHub repository
- Updated field name mappings (
question→Question,answer→answer, etc.) - Updated documentation to clarify the dataset source and size
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| utils/prepare_benchmark/gen_hle_text_only.py | Replaced dataset loading logic with direct JSON fetch from WebThinker repository and updated field name handling |
| docs/mkdocs/docs/hle_text_only.md | Updated dataset description to reference WebThinker source and specify 500 text-only subset |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| metadata.pop("rationale_image") | ||
| response = requests.get( | ||
| "https://raw.githubusercontent.com/RUC-NLPIR/WebThinker/refs/heads/main/data/HLE/test.json" | ||
| ) |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing error handling for the HTTP request. If the request fails or returns non-200 status, the code will fail silently or raise an unclear exception. Add status code checking and provide a clear error message if the request fails.
| ) | |
| ) | |
| response.raise_for_status() |
| response = requests.get( | ||
| "https://raw.githubusercontent.com/RUC-NLPIR/WebThinker/refs/heads/main/data/HLE/test.json" | ||
| ) | ||
| dataset = json.loads(response.content) |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using response.content (bytes) with json.loads() may cause issues. Use response.json() instead, which handles decoding automatically and is the standard approach for JSON responses.
| dataset = json.loads(response.content) | |
| dataset = response.json() |
| dataset = json.loads(response.content) | ||
| for row in dataset: | ||
| metadata: MutableMapping = row | ||
| task_id = str(metadata.pop("id", "")) |
Copilot
AI
Oct 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting the ID to string with an empty string default may create invalid task IDs if 'id' is missing. Consider raising an error or using a more meaningful default value to ensure task_id is always valid.
| task_id = str(metadata.pop("id", "")) | |
| try: | |
| task_id = str(metadata.pop("id")) | |
| except KeyError: | |
| raise ValueError(f"Missing 'id' field in row: {row}") |
Describe this PR
What changed?
Why?
Related issues
Checklist for PR
Write a descriptive PR title following the Angular commit message format:
<type>(<scope>): <subject>feat(agent): add pdf tool via mcp,perf: make llm client async,fix(utils): load custom config via importlibfeat,fix,docs,style,refactor,perf,test,build,ci,revertcheck-pr-titleCI job will validate your title formatUpdate README❌ Missing type and colonfeat add new feature❌ Missing colon after typeFeature: add new tool❌ Invalid type (should befeat)feat(Agent): add tool❌ Scope should be lowercasefeat(): add tool❌ Empty scope not allowedfeat(my_scope): add tool❌ Underscores not allowed in scopefeat(my space): add tool❌ Space not allowed in scopefeat(scope):add tool❌ Missing space after colonfeat(scope):❌ Empty subjectRun lint and format locally:
uv tool run [email protected] check --fix .uv tool run [email protected] format .lintenforces ruff default format/lint rules on all new codes.