feat(benchmark): add hle #79

ntudy · 2025-10-14T07:39:15Z

Describe this PR

What changed?

Why?

Related issues

Checklist for PR

Write a descriptive PR title following the Angular commit message format: <type>(<scope>): <subject>
- Examples: feat(agent): add pdf tool via mcp, perf: make llm client async, fix(utils): load custom config via importlib
- Valid types: feat, fix, docs, style, refactor, perf, test, build, ci, revert
- The check-pr-title CI job will validate your title format
- Bad title examples and why they fail:
  - Update README ❌ Missing type and colon
  - feat add new feature ❌ Missing colon after type
  - Feature: add new tool ❌ Invalid type (should be feat)
  - feat(Agent): add tool ❌ Scope should be lowercase
  - feat(): add tool ❌ Empty scope not allowed
  - feat(my_scope): add tool ❌ Underscores not allowed in scope
  - feat(my space): add tool ❌ Space not allowed in scope
  - feat(scope):add tool ❌ Missing space after colon
  - feat(scope): ❌ Empty subject
Run lint and format locally:
- uv tool run [email protected] check --fix .
- uv tool run [email protected] format .
- CI job lint enforces ruff default format/lint rules on all new codes.

Copilot

Pull Request Overview

This PR adds support for the HLE (Human-Level Evaluation) benchmark, a multimodal reasoning dataset that tests AI systems' ability to perform human-level understanding across vision and language tasks.

Adds HLE benchmark configuration and agent setup using Claude 3.7 Sonnet
Creates comprehensive documentation with setup instructions and usage examples
Integrates HLE into the existing benchmark navigation structure

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
docs/mkdocs/mkdocs.yml	Adds HLE to the navigation menu
docs/mkdocs/docs/hle.md	Comprehensive documentation for HLE benchmark setup and usage
config/benchmark/hle.yaml	Benchmark configuration file defining HLE-specific parameters
config/agent_hle_claude37sonnet.yaml	Agent configuration using Claude 3.7 Sonnet for HLE evaluation

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

config/benchmark/hle.yaml

BinWang28 · 2025-10-14T07:41:30Z

docs/mkdocs/docs/hle.md

+### Step 3: Run the Evaluation
+
+```bash title="Run HLE Evaluation"
+uv run main.py common-benchmark --config_file_name=agent_hle_claude37sonnet benchmark=hle output_dir="logs/hle/$(date +"%Y%m%d_%H%M")"


"benchmark=hle"

remove as not necessary

BinWang28 · 2025-10-14T07:41:35Z

docs/mkdocs/docs/hle.md

+    Specify the same output directory to continue from where you left off:
+
+    ```bash
+    uv run main.py common-benchmark --config_file_name=agent_hle_claude37sonnet benchmark=hle output_dir="logs/hle/20251014_1504"


benchmark=hle

BinWang28 · 2025-10-14T07:42:05Z

docs/mkdocs/docs/hle.md

+### Test with Limited Tasks
+
+```bash
+uv run main.py common-benchmark --config_file_name=agent_hle_claude37sonnet benchmark=hle benchmark.execution.max_tasks=10 output_dir="logs/hle/$(date +"%Y%m%d_%H%M")"


benchmark=hle

BinWang28 · 2025-10-14T07:42:14Z

docs/mkdocs/docs/hle.md

+### Adjust Concurrency
+
+```bash
+uv run main.py common-benchmark --config_file_name=agent_hle_claude37sonnet benchmark=hle benchmark.execution.max_concurrent=5 output_dir="logs/hle/$(date +"%Y%m%d_%H%M")"


benchmark=hle

Copilot

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-14T07:53:18Z

config/benchmark/hle.yaml

@@ -0,0 +1,20 @@
+# config/benchmark/browsecomp-en.yaml


The comment header is incorrect - it references 'browsecomp-en.yaml' instead of 'hle.yaml'. This should be updated to reflect the actual file being configured.

Suggested change

# config/benchmark/browsecomp-en.yaml

# config/benchmark/hle.yaml

Co-authored-by: Copilot <[email protected]>

add hle

9a9c2e5

ntudy requested review from BinWang28 and Copilot October 14, 2025 07:39

Copilot AI reviewed Oct 14, 2025

View reviewed changes

config/benchmark/hle.yaml Outdated Show resolved Hide resolved

config/benchmark/hle.yaml Outdated Show resolved Hide resolved

BinWang28 reviewed Oct 14, 2025

View reviewed changes

BinWang28 approved these changes Oct 14, 2025

View reviewed changes

remove redudant code

7a2fd76

ntudy requested a review from Copilot October 14, 2025 07:52

Copilot AI reviewed Oct 14, 2025

View reviewed changes

ntudy and others added 2 commits October 14, 2025 15:54

Update config/benchmark/hle.yaml

62566e3

Co-authored-by: Copilot <[email protected]>

Update config/benchmark/hle.yaml

134543d

Co-authored-by: Copilot <[email protected]>

ntudy merged commit 33aeecc into miroflow-v0.3 Oct 14, 2025
3 checks passed

ntudy deleted the add-hle branch October 14, 2025 07:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(benchmark): add hle #79

feat(benchmark): add hle #79

Uh oh!

ntudy commented Oct 14, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

BinWang28 Oct 14, 2025

Uh oh!

BinWang28 Oct 14, 2025

Uh oh!

BinWang28 Oct 14, 2025

Uh oh!

BinWang28 Oct 14, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	# config/benchmark/browsecomp-en.yaml
	# config/benchmark/hle.yaml

feat(benchmark): add hle #79

feat(benchmark): add hle #79

Uh oh!

Conversation

ntudy commented Oct 14, 2025

Describe this PR

What changed?

Why?

Related issues

Checklist for PR

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

BinWang28 Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

BinWang28 Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

BinWang28 Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

BinWang28 Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants