Skip to content

Commit 10d9ab6

Browse files
ntudyYue Deng
andauthored
feat(benchmark): add hle-text-only (#81)
* add hle-text-only * add doc --------- Co-authored-by: Yue Deng <[email protected]>
1 parent c7de7f8 commit 10d9ab6

File tree

7 files changed

+231
-0
lines changed

7 files changed

+231
-0
lines changed
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
defaults:
2+
- benchmark: hle-text-only
3+
- override hydra/job_logging: none
4+
- _self_ # Allow defining variables at the top of this file
5+
6+
7+
main_agent:
8+
prompt_class: MainAgentPrompt_GAIA
9+
llm:
10+
provider_class: "ClaudeOpenRouterClient"
11+
model_name: "anthropic/claude-3.7-sonnet"
12+
async_client: true
13+
temperature: 0.3
14+
top_p: 0.95
15+
min_p: 0.0
16+
top_k: -1
17+
max_tokens: 32000
18+
openrouter_api_key: "${oc.env:OPENROUTER_API_KEY,???}"
19+
openrouter_base_url: "${oc.env:OPENROUTER_BASE_URL,https://openrouter.ai/api/v1}"
20+
openrouter_provider: "anthropic"
21+
disable_cache_control: false
22+
keep_tool_result: -1
23+
oai_tool_thinking: false
24+
25+
tool_config:
26+
- tool-reasoning
27+
28+
max_turns: 50 # Maximum number of turns for main agent execution
29+
max_tool_calls_per_turn: 10 # Maximum number of tool calls per turn
30+
31+
input_process:
32+
hint_generation: true
33+
hint_llm_base_url: "${oc.env:HINT_LLM_BASE_URL,https://api.openai.com/v1}"
34+
output_process:
35+
final_answer_extraction: true
36+
final_answer_llm_base_url: "${oc.env:FINAL_ANSWER_LLM_BASE_URL,https://api.openai.com/v1}"
37+
38+
openai_api_key: "${oc.env:OPENAI_API_KEY,???}" # used for hint generation and final answer extraction
39+
add_message_id: true
40+
keep_tool_result: -1
41+
chinese_context: "${oc.env:CHINESE_CONTEXT,false}"
42+
43+
44+
sub_agents:
45+
agent-worker:
46+
prompt_class: SubAgentWorkerPrompt
47+
llm:
48+
provider_class: "ClaudeOpenRouterClient"
49+
model_name: "anthropic/claude-3.7-sonnet"
50+
async_client: true
51+
temperature: 0.3
52+
top_p: 0.95
53+
min_p: 0.0
54+
top_k: -1
55+
max_tokens: 32000
56+
openrouter_api_key: "${oc.env:OPENROUTER_API_KEY,???}"
57+
openrouter_base_url: "${oc.env:OPENROUTER_BASE_URL,https://openrouter.ai/api/v1}"
58+
openrouter_provider: "anthropic"
59+
disable_cache_control: false
60+
keep_tool_result: -1
61+
oai_tool_thinking: false
62+
63+
tool_config:
64+
- tool-searching
65+
- tool-image-video
66+
- tool-reading
67+
- tool-code
68+
- tool-audio
69+
70+
max_turns: 50 # Maximum number of turns for main agent execution
71+
max_tool_calls_per_turn: 10 # Maximum number of tool calls per turn
72+
73+
74+
# Can define some top-level or default parameters here
75+
output_dir: logs/
76+
data_dir: "${oc.env:DATA_DIR,data}" # Points to where data is stored
77+
78+
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# config/benchmark/hle-text-only.yaml
2+
defaults:
3+
- default
4+
- _self_
5+
6+
name: "hle-text-only"
7+
8+
data:
9+
data_dir: "${data_dir}/hle-text-only" # Path to hle-text-only dataset
10+
metadata_file: "standardized_data.jsonl" # Metadata filename
11+
whitelist: [] # Optional: List of specific task_ids to run
12+
13+
execution:
14+
max_tasks: null # null = no limit, or specify a number
15+
max_concurrent: 10 # Number of parallel tasks
16+
pass_at_k: 1 # Number of attempts per task
17+
18+
# OpenAI API key for evaluation (required for hle-text-only since it has ground truth)
19+
openai_api_key: "${oc.env:OPENAI_API_KEY,???}"
20+

docs/mkdocs/docs/hle-text-only.md

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
# HLE
2+
3+
MiroFlow's evaluation on the HLE-text-only benchmark demonstrates capabilities in multimodal reasoning and question answering tasks that require human-level understanding across vision and language.
4+
5+
More details: [HLE text only Dataset on HuggingFace](https://huggingface.co/datasets/macabdul9/hle_text_only)
6+
7+
---
8+
9+
## Dataset Overview
10+
11+
!!! info "HLE Dataset (text only)"
12+
The dataset is a text-only subset of HLE.
13+
14+
---
15+
16+
## Quick Start Guide
17+
18+
### Step 1: Prepare the HLE(text only) Dataset
19+
20+
```bash title="Download HLE(text only) Dataset"
21+
uv run main.py prepare-benchmark get hle-text-only
22+
```
23+
24+
This will download the dataset to `data/hle-text-only/`.
25+
26+
### Step 2: Configure API Keys
27+
28+
```env title=".env Configuration"
29+
# For searching and web scraping
30+
SERPER_API_KEY="xxx"
31+
JINA_API_KEY="xxx"
32+
33+
# For Linux sandbox (code execution environment)
34+
E2B_API_KEY="xxx"
35+
36+
# Claude-3.7-Sonnet via OpenRouter
37+
OPENROUTER_API_KEY="xxx"
38+
OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"
39+
40+
# Vision understanding
41+
ANTHROPIC_API_KEY="xxx"
42+
GEMINI_API_KEY="xxx"
43+
44+
# Hint generation and final answer extraction
45+
OPENAI_API_KEY="xxx"
46+
OPENAI_BASE_URL="https://api.openai.com/v1"
47+
```
48+
49+
### Step 3: Run the Evaluation
50+
51+
```bash title="Run HLE Evaluation"
52+
uv run main.py common-benchmark --config_file_name=agent_hle-text-only_claude37sonnet output_dir="logs/hle-text-only/$(date +"%Y%m%d_%H%M")"
53+
```
54+
55+
!!! tip "Resume Interrupted Evaluation"
56+
Specify the same output directory to continue from where you left off:
57+
58+
```bash
59+
uv run main.py common-benchmark --config_file_name=agent_hle-text-only_claude37sonnet output_dir="logs/hle-text-only/20251014_1504"
60+
```
61+
62+
### Step 4: Review Results
63+
64+
```bash title="Check Results"
65+
# View accuracy summary
66+
cat logs/hle-text-only/*/benchmark_results_pass_at_1_accuracy.txt
67+
68+
# View detailed results
69+
cat logs/hle-text-only/*/benchmark_results.jsonl
70+
```
71+
72+
---
73+
74+
## Usage Examples
75+
76+
### Test with Limited Tasks
77+
78+
```bash
79+
uv run main.py common-benchmark --config_file_name=agent_hle-text-only_claude37sonnet benchmark.execution.max_tasks=10 output_dir="logs/hle-text-only/$(date +"%Y%m%d_%H%M")"
80+
```
81+
82+
### Adjust Concurrency
83+
84+
```bash
85+
uv run main.py common-benchmark --config_file_name=agent_hle-text-only_claude37sonnet benchmark.execution.max_concurrent=5 output_dir="logs/hle-text-only/$(date +"%Y%m%d_%H%M")"
86+
```
87+
88+
---
89+
90+
!!! info "Documentation Info"
91+
**Last Updated:** October 2025 · **Doc Contributor:** Team @ MiroMind AI
92+

docs/mkdocs/mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -65,6 +65,7 @@ nav:
6565
- xBench-DeepSearch: xbench_ds.md
6666
- FinSearchComp: finsearchcomp.md
6767
- HLE: hle.md
68+
- HLE(text only): hle_text_only.md
6869

6970
# - Benchmarks:
7071
# - GAIA-Validation-Text-Only: gaia_validation_text_only.md

scripts/run_prepare_benchmark.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ uv run main.py prepare-benchmark get webwalkerqa
2020
uv run main.py prepare-benchmark get browsecomp-test
2121
uv run main.py prepare-benchmark get browsecomp-zh-test
2222
uv run main.py prepare-benchmark get hle
23+
uv run main.py prepare-benchmark get hle-text-only
2324
uv run main.py prepare-benchmark get xbench-ds
2425
uv run main.py prepare-benchmark get futurex
2526
uv run main.py prepare-benchmark get finsearchcomp
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# SPDX-FileCopyrightText: 2025 MiromindAI
2+
#
3+
# SPDX-License-Identifier: Apache-2.0
4+
5+
from typing import Generator, MutableMapping
6+
7+
from datasets import load_dataset
8+
9+
from utils.prepare_benchmark.common import Task
10+
11+
12+
def gen_hle_text_only(hf_token: str) -> Generator[Task, None, None]:
13+
dataset = load_dataset("macabdul9/hle_text_only", split="test", token=hf_token)
14+
for x in dataset:
15+
metadata: MutableMapping = x # type: ignore
16+
task_id = metadata.pop("id")
17+
question = metadata.pop("question")
18+
gt = metadata.pop("answer")
19+
metadata.pop("image_preview")
20+
metadata.pop("rationale_image")
21+
task = Task(
22+
task_id=task_id,
23+
task_question=question,
24+
ground_truth=gt,
25+
file_path=None,
26+
metadata=metadata,
27+
)
28+
yield task
29+
30+
return

utils/prepare_benchmark/main.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
from utils.prepare_benchmark.gen_gaia import gen_gaia_validation
1717
from utils.prepare_benchmark.gen_gaia_text_only import gen_gaia_text_only
1818
from utils.prepare_benchmark.gen_hle import gen_hle_test
19+
from utils.prepare_benchmark.gen_hle_text_only import gen_hle_text_only
1920
from utils.prepare_benchmark.gen_webwalkerqa import gen_webwalkerqa
2021
from utils.prepare_benchmark.gen_xbench_ds import gen_xbench_ds
2122
from utils.prepare_benchmark.gen_futurex import gen_futurex
@@ -32,6 +33,7 @@ class _Env:
3233
"browsecomp-test",
3334
"browsecomp-zh-test",
3435
"hle",
36+
"hle-text-only",
3537
"xbench-ds",
3638
"futurex",
3739
"finsearchcomp",
@@ -105,6 +107,13 @@ def gen():
105107
for x in gen_hle_test(env.hf_token, env.data_dir):
106108
yield x
107109

110+
return gen
111+
case "hle-text-only":
112+
113+
def gen():
114+
for x in gen_hle_text_only(env.hf_token):
115+
yield x
116+
108117
return gen
109118
case "xbench-ds":
110119

0 commit comments

Comments
 (0)