-
Notifications
You must be signed in to change notification settings - Fork 155
feat(benchmark): add hle #79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for the HLE (Human-Level Evaluation) benchmark, a multimodal reasoning dataset that tests AI systems' ability to perform human-level understanding across vision and language tasks.
- Adds HLE benchmark configuration and agent setup using Claude 3.7 Sonnet
- Creates comprehensive documentation with setup instructions and usage examples
- Integrates HLE into the existing benchmark navigation structure
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| docs/mkdocs/mkdocs.yml | Adds HLE to the navigation menu |
| docs/mkdocs/docs/hle.md | Comprehensive documentation for HLE benchmark setup and usage |
| config/benchmark/hle.yaml | Benchmark configuration file defining HLE-specific parameters |
| config/agent_hle_claude37sonnet.yaml | Agent configuration using Claude 3.7 Sonnet for HLE evaluation |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
docs/mkdocs/docs/hle.md
Outdated
| ### Step 3: Run the Evaluation | ||
|
|
||
| ```bash title="Run HLE Evaluation" | ||
| uv run main.py common-benchmark --config_file_name=agent_hle_claude37sonnet benchmark=hle output_dir="logs/hle/$(date +"%Y%m%d_%H%M")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"benchmark=hle"
remove as not necessary
docs/mkdocs/docs/hle.md
Outdated
| Specify the same output directory to continue from where you left off: | ||
|
|
||
| ```bash | ||
| uv run main.py common-benchmark --config_file_name=agent_hle_claude37sonnet benchmark=hle output_dir="logs/hle/20251014_1504" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
benchmark=hle
docs/mkdocs/docs/hle.md
Outdated
| ### Test with Limited Tasks | ||
|
|
||
| ```bash | ||
| uv run main.py common-benchmark --config_file_name=agent_hle_claude37sonnet benchmark=hle benchmark.execution.max_tasks=10 output_dir="logs/hle/$(date +"%Y%m%d_%H%M")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
benchmark=hle
docs/mkdocs/docs/hle.md
Outdated
| ### Adjust Concurrency | ||
|
|
||
| ```bash | ||
| uv run main.py common-benchmark --config_file_name=agent_hle_claude37sonnet benchmark=hle benchmark.execution.max_concurrent=5 output_dir="logs/hle/$(date +"%Y%m%d_%H%M")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
benchmark=hle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
config/benchmark/hle.yaml
Outdated
| @@ -0,0 +1,20 @@ | |||
| # config/benchmark/browsecomp-en.yaml | |||
Copilot
AI
Oct 14, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment header is incorrect - it references 'browsecomp-en.yaml' instead of 'hle.yaml'. This should be updated to reflect the actual file being configured.
| # config/benchmark/browsecomp-en.yaml | |
| # config/benchmark/hle.yaml |
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Describe this PR
What changed?
Why?
Related issues
Checklist for PR
Write a descriptive PR title following the Angular commit message format:
<type>(<scope>): <subject>feat(agent): add pdf tool via mcp,perf: make llm client async,fix(utils): load custom config via importlibfeat,fix,docs,style,refactor,perf,test,build,ci,revertcheck-pr-titleCI job will validate your title formatUpdate README❌ Missing type and colonfeat add new feature❌ Missing colon after typeFeature: add new tool❌ Invalid type (should befeat)feat(Agent): add tool❌ Scope should be lowercasefeat(): add tool❌ Empty scope not allowedfeat(my_scope): add tool❌ Underscores not allowed in scopefeat(my space): add tool❌ Space not allowed in scopefeat(scope):add tool❌ Missing space after colonfeat(scope):❌ Empty subjectRun lint and format locally:
uv tool run [email protected] check --fix .uv tool run [email protected] format .lintenforces ruff default format/lint rules on all new codes.