Skip to content

Conversation

@ntudy
Copy link
Contributor

@ntudy ntudy commented Oct 14, 2025

Describe this PR

What changed?

Why?

Related issues

Checklist for PR

  • Write a descriptive PR title following the Angular commit message format: <type>(<scope>): <subject>

    • Examples: feat(agent): add pdf tool via mcp, perf: make llm client async, fix(utils): load custom config via importlib
    • Valid types: feat, fix, docs, style, refactor, perf, test, build, ci, revert
    • The check-pr-title CI job will validate your title format
    • Bad title examples and why they fail:
      • Update README ❌ Missing type and colon
      • feat add new feature ❌ Missing colon after type
      • Feature: add new tool ❌ Invalid type (should be feat)
      • feat(Agent): add tool ❌ Scope should be lowercase
      • feat(): add tool ❌ Empty scope not allowed
      • feat(my_scope): add tool ❌ Underscores not allowed in scope
      • feat(my space): add tool ❌ Space not allowed in scope
      • feat(scope):add tool ❌ Missing space after colon
      • feat(scope): ❌ Empty subject
  • Run lint and format locally:

@ntudy ntudy requested review from BinWang28 and Copilot October 14, 2025 07:39
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for the HLE (Human-Level Evaluation) benchmark, a multimodal reasoning dataset that tests AI systems' ability to perform human-level understanding across vision and language tasks.

  • Adds HLE benchmark configuration and agent setup using Claude 3.7 Sonnet
  • Creates comprehensive documentation with setup instructions and usage examples
  • Integrates HLE into the existing benchmark navigation structure

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
docs/mkdocs/mkdocs.yml Adds HLE to the navigation menu
docs/mkdocs/docs/hle.md Comprehensive documentation for HLE benchmark setup and usage
config/benchmark/hle.yaml Benchmark configuration file defining HLE-specific parameters
config/agent_hle_claude37sonnet.yaml Agent configuration using Claude 3.7 Sonnet for HLE evaluation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

### Step 3: Run the Evaluation

```bash title="Run HLE Evaluation"
uv run main.py common-benchmark --config_file_name=agent_hle_claude37sonnet benchmark=hle output_dir="logs/hle/$(date +"%Y%m%d_%H%M")"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"benchmark=hle"

remove as not necessary

Specify the same output directory to continue from where you left off:

```bash
uv run main.py common-benchmark --config_file_name=agent_hle_claude37sonnet benchmark=hle output_dir="logs/hle/20251014_1504"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

benchmark=hle

### Test with Limited Tasks

```bash
uv run main.py common-benchmark --config_file_name=agent_hle_claude37sonnet benchmark=hle benchmark.execution.max_tasks=10 output_dir="logs/hle/$(date +"%Y%m%d_%H%M")"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

benchmark=hle

### Adjust Concurrency

```bash
uv run main.py common-benchmark --config_file_name=agent_hle_claude37sonnet benchmark=hle benchmark.execution.max_concurrent=5 output_dir="logs/hle/$(date +"%Y%m%d_%H%M")"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

benchmark=hle

@ntudy ntudy requested a review from Copilot October 14, 2025 07:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@@ -0,0 +1,20 @@
# config/benchmark/browsecomp-en.yaml
Copy link

Copilot AI Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment header is incorrect - it references 'browsecomp-en.yaml' instead of 'hle.yaml'. This should be updated to reflect the actual file being configured.

Suggested change
# config/benchmark/browsecomp-en.yaml
# config/benchmark/hle.yaml

Copilot uses AI. Check for mistakes.
@ntudy ntudy merged commit 33aeecc into miroflow-v0.3 Oct 14, 2025
3 checks passed
@ntudy ntudy deleted the add-hle branch October 14, 2025 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants