feat(benchmark): add hle-text-only #81

ntudy · 2025-10-14T08:46:30Z

Describe this PR

What changed?

Why?

Related issues

Checklist for PR

Write a descriptive PR title following the Angular commit message format: <type>(<scope>): <subject>
- Examples: feat(agent): add pdf tool via mcp, perf: make llm client async, fix(utils): load custom config via importlib
- Valid types: feat, fix, docs, style, refactor, perf, test, build, ci, revert
- The check-pr-title CI job will validate your title format
- Bad title examples and why they fail:
  - Update README ❌ Missing type and colon
  - feat add new feature ❌ Missing colon after type
  - Feature: add new tool ❌ Invalid type (should be feat)
  - feat(Agent): add tool ❌ Scope should be lowercase
  - feat(): add tool ❌ Empty scope not allowed
  - feat(my_scope): add tool ❌ Underscores not allowed in scope
  - feat(my space): add tool ❌ Space not allowed in scope
  - feat(scope):add tool ❌ Missing space after colon
  - feat(scope): ❌ Empty subject
Run lint and format locally:
- uv tool run [email protected] check --fix .
- uv tool run [email protected] format .
- CI job lint enforces ruff default format/lint rules on all new codes.

Copilot

Pull Request Overview

This PR adds support for the HLE-text-only benchmark dataset by implementing data preparation, configuration, and documentation for evaluating text-only reasoning tasks. The implementation follows the existing pattern of other benchmark integrations in the codebase.

Added HLE-text-only dataset preparation generator that loads and processes the text-only subset of HLE data
Created benchmark configuration and agent configuration files for running evaluations with Claude 3.7 Sonnet
Added comprehensive documentation with setup and usage instructions

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
utils/prepare_benchmark/main.py	Added import and case handling for hle-text-only dataset preparation
utils/prepare_benchmark/gen_hle_text_only.py	New generator module that processes HLE text-only dataset from HuggingFace
scripts/run_prepare_benchmark.sh	Added command to prepare hle-text-only dataset
docs/mkdocs/docs/hle-text-only.md	Complete documentation for HLE-text-only benchmark usage
config/benchmark/hle-text-only.yaml	Benchmark configuration for HLE-text-only dataset
config/agent_hle-text-only_claude37sonnet.yaml	Agent configuration for running HLE-text-only with Claude 3.7 Sonnet

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-14T08:46:51Z

utils/prepare_benchmark/gen_hle_text_only.py

+
+    return


The explicit return statement at the end of a generator function is unnecessary. Generator functions automatically return when they reach the end.

Suggested change

return

add hle-text-only

c8462cd

ntudy requested review from BinWang28 and Copilot October 14, 2025 08:46

Copilot AI reviewed Oct 14, 2025

View reviewed changes

add doc

c71b845

BinWang28 approved these changes Oct 14, 2025

View reviewed changes

BinWang28 merged commit 10d9ab6 into miroflow-v0.3 Oct 14, 2025
3 checks passed

BinWang28 deleted the add-hle-text branch October 14, 2025 09:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(benchmark): add hle-text-only #81

feat(benchmark): add hle-text-only #81

Uh oh!

ntudy commented Oct 14, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(benchmark): add hle-text-only #81

feat(benchmark): add hle-text-only #81

Uh oh!

Conversation

ntudy commented Oct 14, 2025

Describe this PR

What changed?

Why?

Related issues

Checklist for PR

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants