fix(benchmark): fix hle text only #87

ntudy · 2025-10-16T07:36:19Z

Describe this PR

What changed?

Why?

Related issues

Checklist for PR

Write a descriptive PR title following the Angular commit message format: <type>(<scope>): <subject>
- Examples: feat(agent): add pdf tool via mcp, perf: make llm client async, fix(utils): load custom config via importlib
- Valid types: feat, fix, docs, style, refactor, perf, test, build, ci, revert
- The check-pr-title CI job will validate your title format
- Bad title examples and why they fail:
  - Update README ❌ Missing type and colon
  - feat add new feature ❌ Missing colon after type
  - Feature: add new tool ❌ Invalid type (should be feat)
  - feat(Agent): add tool ❌ Scope should be lowercase
  - feat(): add tool ❌ Empty scope not allowed
  - feat(my_scope): add tool ❌ Underscores not allowed in scope
  - feat(my space): add tool ❌ Space not allowed in scope
  - feat(scope):add tool ❌ Missing space after colon
  - feat(scope): ❌ Empty subject
Run lint and format locally:
- uv tool run [email protected] check --fix .
- uv tool run [email protected] format .
- CI job lint enforces ruff default format/lint rules on all new codes.

Copilot

Pull Request Overview

This PR updates the HLE text-only benchmark data loading mechanism by switching from the HuggingFace dataset to the raw JSON data source from the WebThinker repository. The change removes the dependency on the HuggingFace datasets library for this specific benchmark and updates field names to match the new data source format.

Replaced HuggingFace dataset loading with direct HTTP request to WebThinker GitHub repository
Updated field name mappings (question → Question, answer → answer, etc.)
Updated documentation to clarify the dataset source and size

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
utils/prepare_benchmark/gen_hle_text_only.py	Replaced dataset loading logic with direct JSON fetch from WebThinker repository and updated field name handling
docs/mkdocs/docs/hle_text_only.md	Updated dataset description to reference WebThinker source and specify 500 text-only subset

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-16T07:36:53Z

utils/prepare_benchmark/gen_hle_text_only.py

-        metadata.pop("rationale_image")
+    response = requests.get(
+        "https://raw.githubusercontent.com/RUC-NLPIR/WebThinker/refs/heads/main/data/HLE/test.json"
+    )


Missing error handling for the HTTP request. If the request fails or returns non-200 status, the code will fail silently or raise an unclear exception. Add status code checking and provide a clear error message if the request fails.

Suggested change

)

)

response.raise_for_status()

Copilot · 2025-10-16T07:36:53Z

utils/prepare_benchmark/gen_hle_text_only.py

+    response = requests.get(
+        "https://raw.githubusercontent.com/RUC-NLPIR/WebThinker/refs/heads/main/data/HLE/test.json"
+    )
+    dataset = json.loads(response.content)


Using response.content (bytes) with json.loads() may cause issues. Use response.json() instead, which handles decoding automatically and is the standard approach for JSON responses.

Suggested change

dataset = json.loads(response.content)

dataset = response.json()

Copilot · 2025-10-16T07:36:54Z

utils/prepare_benchmark/gen_hle_text_only.py

+    dataset = json.loads(response.content)
+    for row in dataset:
+        metadata: MutableMapping = row
+        task_id = str(metadata.pop("id", ""))


Converting the ID to string with an empty string default may create invalid task IDs if 'id' is missing. Consider raising an error or using a more meaningful default value to ensure task_id is always valid.

Suggested change

task_id = str(metadata.pop("id", ""))

try:

task_id = str(metadata.pop("id"))

except KeyError:

raise ValueError(f"Missing 'id' field in row: {row}")

fix hle text only

e0926c9

ntudy requested review from BinWang28 and Copilot October 16, 2025 07:36

Copilot AI reviewed Oct 16, 2025

View reviewed changes

BinWang28 approved these changes Oct 16, 2025

View reviewed changes

BinWang28 merged commit e1cbf9a into miroflow-v0.3 Oct 16, 2025
3 checks passed

BinWang28 deleted the fix-hle-text branch October 16, 2025 10:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(benchmark): fix hle text only #87

fix(benchmark): fix hle text only #87

Uh oh!

ntudy commented Oct 16, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 16, 2025

Uh oh!

Copilot AI Oct 16, 2025

Uh oh!

Copilot AI Oct 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	dataset = json.loads(response.content)
	dataset = response.json()

-        task_id = str(metadata.pop("id", ""))
+        try:
+            task_id = str(metadata.pop("id"))
+        except KeyError:
+            raise ValueError(f"Missing 'id' field in row: {row}")

fix(benchmark): fix hle text only #87

fix(benchmark): fix hle text only #87

Uh oh!

Conversation

ntudy commented Oct 16, 2025

Describe this PR

What changed?

Why?

Related issues

Checklist for PR

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants