Skip to content

Commit 9a9c2e5

Browse files
author
Yue Deng
committed
add hle
1 parent f922bfe commit 9a9c2e5

File tree

4 files changed

+198
-0
lines changed

4 files changed

+198
-0
lines changed
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
defaults:
2+
- benchmark: hle
3+
- override hydra/job_logging: none
4+
- _self_ # Allow defining variables at the top of this file
5+
6+
7+
main_agent:
8+
prompt_class: MainAgentPrompt_GAIA
9+
llm:
10+
provider_class: "ClaudeOpenRouterClient"
11+
model_name: "anthropic/claude-3.7-sonnet"
12+
async_client: true
13+
temperature: 0.3
14+
top_p: 0.95
15+
min_p: 0.0
16+
top_k: -1
17+
max_tokens: 32000
18+
openrouter_api_key: "${oc.env:OPENROUTER_API_KEY,???}"
19+
openrouter_base_url: "${oc.env:OPENROUTER_BASE_URL,https://openrouter.ai/api/v1}"
20+
openrouter_provider: "anthropic"
21+
disable_cache_control: false
22+
keep_tool_result: -1
23+
oai_tool_thinking: false
24+
25+
tool_config:
26+
- tool-reasoning
27+
28+
max_turns: 50 # Maximum number of turns for main agent execution
29+
max_tool_calls_per_turn: 10 # Maximum number of tool calls per turn
30+
31+
input_process:
32+
hint_generation: true
33+
hint_llm_base_url: "${oc.env:HINT_LLM_BASE_URL,https://api.openai.com/v1}"
34+
output_process:
35+
final_answer_extraction: true
36+
final_answer_llm_base_url: "${oc.env:FINAL_ANSWER_LLM_BASE_URL,https://api.openai.com/v1}"
37+
38+
openai_api_key: "${oc.env:OPENAI_API_KEY,???}" # used for hint generation and final answer extraction
39+
add_message_id: true
40+
keep_tool_result: -1
41+
chinese_context: "${oc.env:CHINESE_CONTEXT,false}"
42+
43+
44+
sub_agents:
45+
agent-worker:
46+
prompt_class: SubAgentWorkerPrompt
47+
llm:
48+
provider_class: "ClaudeOpenRouterClient"
49+
model_name: "anthropic/claude-3.7-sonnet"
50+
async_client: true
51+
temperature: 0.3
52+
top_p: 0.95
53+
min_p: 0.0
54+
top_k: -1
55+
max_tokens: 32000
56+
openrouter_api_key: "${oc.env:OPENROUTER_API_KEY,???}"
57+
openrouter_base_url: "${oc.env:OPENROUTER_BASE_URL,https://openrouter.ai/api/v1}"
58+
openrouter_provider: "anthropic"
59+
disable_cache_control: false
60+
keep_tool_result: -1
61+
oai_tool_thinking: false
62+
63+
tool_config:
64+
- tool-searching
65+
- tool-image-video
66+
- tool-reading
67+
- tool-code
68+
- tool-audio
69+
70+
max_turns: 50 # Maximum number of turns for main agent execution
71+
max_tool_calls_per_turn: 10 # Maximum number of tool calls per turn
72+
73+
74+
# Can define some top-level or default parameters here
75+
output_dir: logs/
76+
data_dir: "${oc.env:DATA_DIR,data}" # Points to where data is stored
77+
78+

config/benchmark/hle.yaml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# config/benchmark/browsecomp-en.yaml
2+
defaults:
3+
- default
4+
- _self_
5+
6+
name: "hle"
7+
8+
data:
9+
data_dir: "${data_dir}/hle" # Path to hle dataset
10+
metadata_file: "standardized_data.jsonl" # Metadata filename
11+
whitelist: [] # Optional: List of specific task_ids to run
12+
13+
execution:
14+
max_tasks: null # null = no limit, or specify a number
15+
max_concurrent: 10 # Number of parallel tasks
16+
pass_at_k: 1 # Number of attempts per task
17+
18+
# OpenAI API key for evaluation (required for browsecomp since it has ground truth)
19+
openai_api_key: "${oc.env:OPENAI_API_KEY,???}"
20+

docs/mkdocs/docs/hle.md

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
# HLE
2+
3+
MiroFlow's evaluation on the HLE benchmark demonstrates capabilities in multimodal reasoning and question answering tasks that require human-level understanding across vision and language.
4+
5+
More details: [HLE Dataset on HuggingFace](https://huggingface.co/datasets/cais/hle)
6+
7+
---
8+
9+
## Dataset Overview
10+
11+
!!! info "HLE Dataset"
12+
The HLE dataset consists of challenging multimodal tasks that test AI systems' ability to perform human-level reasoning with both visual and textual information.
13+
14+
!!! abstract "Key Dataset Characteristics"
15+
16+
- **Total Tasks**: Test split from HuggingFace `cais/hle` dataset
17+
- **Task Type**: Multimodal question answering and reasoning
18+
- **Modalities**: Text + Images
19+
- **Ground Truth**: Available for evaluation
20+
21+
---
22+
23+
## Quick Start Guide
24+
25+
### Step 1: Prepare the HLE Dataset
26+
27+
```bash title="Download HLE Dataset"
28+
uv run main.py prepare-benchmark get hle
29+
```
30+
31+
This will download the dataset and save images to `data/hle/images/`.
32+
33+
### Step 2: Configure API Keys
34+
35+
```env title=".env Configuration"
36+
# For searching and web scraping
37+
SERPER_API_KEY="xxx"
38+
JINA_API_KEY="xxx"
39+
40+
# For Linux sandbox (code execution environment)
41+
E2B_API_KEY="xxx"
42+
43+
# Claude-3.7-Sonnet via OpenRouter
44+
OPENROUTER_API_KEY="xxx"
45+
OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"
46+
47+
# Vision understanding
48+
ANTHROPIC_API_KEY="xxx"
49+
GEMINI_API_KEY="xxx"
50+
51+
# Hint generation and final answer extraction
52+
OPENAI_API_KEY="xxx"
53+
OPENAI_BASE_URL="https://api.openai.com/v1"
54+
```
55+
56+
### Step 3: Run the Evaluation
57+
58+
```bash title="Run HLE Evaluation"
59+
uv run main.py common-benchmark --config_file_name=agent_hle_claude37sonnet benchmark=hle output_dir="logs/hle/$(date +"%Y%m%d_%H%M")"
60+
```
61+
62+
!!! tip "Resume Interrupted Evaluation"
63+
Specify the same output directory to continue from where you left off:
64+
65+
```bash
66+
uv run main.py common-benchmark --config_file_name=agent_hle_claude37sonnet benchmark=hle output_dir="logs/hle/20251014_1504"
67+
```
68+
69+
### Step 4: Review Results
70+
71+
```bash title="Check Results"
72+
# View accuracy summary
73+
cat logs/hle/*/benchmark_results_pass_at_1_accuracy.txt
74+
75+
# View detailed results
76+
cat logs/hle/*/benchmark_results.jsonl
77+
```
78+
79+
---
80+
81+
## Usage Examples
82+
83+
### Test with Limited Tasks
84+
85+
```bash
86+
uv run main.py common-benchmark --config_file_name=agent_hle_claude37sonnet benchmark=hle benchmark.execution.max_tasks=10 output_dir="logs/hle/$(date +"%Y%m%d_%H%M")"
87+
```
88+
89+
### Adjust Concurrency
90+
91+
```bash
92+
uv run main.py common-benchmark --config_file_name=agent_hle_claude37sonnet benchmark=hle benchmark.execution.max_concurrent=5 output_dir="logs/hle/$(date +"%Y%m%d_%H%M")"
93+
```
94+
95+
---
96+
97+
!!! info "Documentation Info"
98+
**Last Updated:** October 2025 · **Doc Contributor:** Team @ MiroMind AI
99+

docs/mkdocs/mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,6 +64,7 @@ nav:
6464
- FutureX: futurex.md
6565
- xBench-DeepSearch: xbench_ds.md
6666
- FinSearchComp: finsearchcomp.md
67+
- HLE: hle.md
6768

6869
# - Benchmarks:
6970
# - GAIA-Validation-Text-Only: gaia_validation_text_only.md

0 commit comments

Comments
 (0)