Skip to content

Commit 60df257

Browse files
authored
feat(benchmark): add support for WebWalkerQA dataset (#84)
add support for webwalkerqa
1 parent 7652386 commit 60df257

File tree

6 files changed

+267
-0
lines changed

6 files changed

+267
-0
lines changed
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
defaults:
2+
- benchmark: webwalkerqa
3+
- override hydra/job_logging: none
4+
- _self_ # Allow defining variables at the top of this file
5+
6+
7+
main_agent:
8+
prompt_class: MainAgentPrompt_GAIA
9+
llm:
10+
provider_class: "ClaudeOpenRouterClient"
11+
model_name: "anthropic/claude-3.7-sonnet"
12+
async_client: true
13+
temperature: 0.3
14+
top_p: 0.95
15+
min_p: 0.0
16+
top_k: -1
17+
max_tokens: 32000
18+
openrouter_api_key: "${oc.env:OPENROUTER_API_KEY,???}"
19+
openrouter_base_url: "${oc.env:OPENROUTER_BASE_URL,https://openrouter.ai/api/v1}"
20+
openrouter_provider: "anthropic"
21+
disable_cache_control: false
22+
keep_tool_result: -1
23+
oai_tool_thinking: false
24+
25+
tool_config:
26+
- tool-searching
27+
- tool-image-video
28+
- tool-reading
29+
- tool-code
30+
- tool-audio
31+
- tool-reasoning
32+
33+
max_turns: 50 # Maximum number of turns for main agent execution
34+
max_tool_calls_per_turn: 10 # Maximum number of tool calls per turn
35+
36+
input_process:
37+
hint_generation: true
38+
hint_llm_base_url: "${oc.env:HINT_LLM_BASE_URL,https://api.openai.com/v1}"
39+
output_process:
40+
final_answer_extraction: true
41+
final_answer_llm_base_url: "${oc.env:FINAL_ANSWER_LLM_BASE_URL,https://api.openai.com/v1}"
42+
43+
openai_api_key: "${oc.env:OPENAI_API_KEY,???}" # used for hint generation and final answer extraction
44+
add_message_id: true
45+
keep_tool_result: -1
46+
chinese_context: "${oc.env:CHINESE_CONTEXT,false}"
47+
48+
49+
sub_agents: null
50+
51+
52+
# Can define some top-level or default parameters here
53+
output_dir: logs/
54+
data_dir: "${oc.env:DATA_DIR,data}" # Points to where data is stored
55+
56+
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
defaults:
2+
- benchmark: webwalkerqa
3+
- override hydra/job_logging: none
4+
- _self_ # Allow defining variables at the top of this file
5+
6+
7+
main_agent:
8+
prompt_class: MainAgentPrompt_GAIA
9+
llm:
10+
provider_class: "MiroThinkerSGLangClient"
11+
model_name: "DUMMY_MODEL_NAME"
12+
async_client: true
13+
temperature: 0.3
14+
top_p: 0.95
15+
min_p: 0.0
16+
top_k: -1
17+
max_tokens: 4096
18+
oai_mirothinker_api_key: "${oc.env:OAI_MIROTHINKER_API_KEY,dummy_key}"
19+
oai_mirothinker_base_url: "${oc.env:OAI_MIROTHINKER_BASE_URL,http://localhost:61005/v1}"
20+
keep_tool_result: -1
21+
oai_tool_thinking: false
22+
23+
tool_config:
24+
- tool-searching
25+
- tool-image-video
26+
- tool-reading
27+
- tool-code
28+
- tool-audio
29+
- tool-reasoning
30+
31+
max_turns: 50 # Maximum number of turns for main agent execution
32+
max_tool_calls_per_turn: 10 # Maximum number of tool calls per turn
33+
34+
input_process:
35+
hint_generation: false
36+
hint_llm_base_url: "${oc.env:HINT_LLM_BASE_URL,https://api.openai.com/v1}"
37+
38+
output_process:
39+
final_answer_extraction: true
40+
final_answer_llm_base_url: "${oc.env:FINAL_ANSWER_LLM_BASE_URL,https://api.openai.com/v1}"
41+
42+
openai_api_key: "${oc.env:OPENAI_API_KEY,???}" # used for hint generation and final answer extraction
43+
add_message_id: true
44+
keep_tool_result: -1
45+
chinese_context: "${oc.env:CHINESE_CONTEXT,false}"
46+
47+
48+
sub_agents: null
49+
50+
51+
# Can define some top-level or default parameters here
52+
output_dir: logs/
53+
data_dir: "${oc.env:DATA_DIR,data}" # Points to where data is stored
54+
55+
56+

config/benchmark/webwalkerqa.yaml

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# config/benchmark/webwalkerqa.yaml
2+
defaults:
3+
- default
4+
- _self_
5+
6+
name: "webwalkerqa"
7+
8+
data:
9+
data_dir: "${data_dir}/webwalkerqa" # Path to webwalkerqa dataset
10+
metadata_file: "standardized_data.jsonl" # Metadata filename
11+
whitelist: [] # Optional: List of specific task_ids to run
12+
13+
execution:
14+
max_tasks: null # null = no limit, or specify a number
15+
max_concurrent: 5 # Number of parallel tasks
16+
pass_at_k: 1 # Number of attempts per task
17+
18+
# OpenAI API key for evaluation (required for webwalkerqa since it has ground truth)
19+
openai_api_key: "${oc.env:OPENAI_API_KEY,???}"
20+
21+

docs/mkdocs/docs/mirothinker.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ uv run main.py common-benchmark --config_file_name=agent_llm_mirothinker output_
6060
```
6161

6262
This command will:
63+
6364
- Use the `agent_llm_mirothinker` configuration with the dedicated MiroThinkerSGLangClient
6465
- Run the example dataset benchmark (configured in the YAML file)
6566
- Test the model's question-answering capabilities

docs/mkdocs/docs/webwalkerqa.md

Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
# WebWalkerQA
2+
3+
MiroFlow's evaluation on the WebWalkerQA benchmark demonstrates web navigation and question-answering capabilities across diverse domains.
4+
5+
More details: [WebWalkerQA on HuggingFace](https://huggingface.co/datasets/MiromindAI/WebWalkerQA)
6+
7+
---
8+
9+
## Dataset Overview
10+
11+
!!! abstract "Key Dataset Characteristics"
12+
13+
- **Total Tasks**: 680 tasks in the main split
14+
- **Language**: English
15+
- **Domains**: Conference, game, academic, business, and more
16+
- **Task Types**: Web navigation, information retrieval, multi-hop reasoning
17+
- **Difficulty Levels**: Easy, medium, hard
18+
- **Evaluation**: Automated comparison with ground truth answers
19+
20+
---
21+
22+
## Quick Start Guide
23+
24+
### Step 1: Prepare the WebWalkerQA Dataset
25+
26+
```bash title="Download WebWalkerQA Dataset"
27+
uv run main.py prepare-benchmark get webwalkerqa
28+
```
29+
30+
This will create the standardized dataset at `data/webwalkerqa/standardized_data.jsonl`.
31+
32+
### Step 2: Configure API Keys
33+
34+
=== "Claude 3.7 Sonnet"
35+
36+
```env title=".env Configuration"
37+
# Search and web scraping
38+
SERPER_API_KEY="xxx"
39+
JINA_API_KEY="xxx"
40+
41+
# Code execution
42+
E2B_API_KEY="xxx"
43+
44+
# LLM (Claude 3.7 Sonnet via OpenRouter)
45+
OPENROUTER_API_KEY="xxx"
46+
OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"
47+
48+
# Evaluation and hint generation
49+
OPENAI_API_KEY="xxx"
50+
51+
# Vision capabilities
52+
ANTHROPIC_API_KEY="xxx"
53+
GEMINI_API_KEY="xxx"
54+
```
55+
56+
=== "MiroThinker"
57+
58+
```env title=".env Configuration"
59+
# Search and web scraping
60+
SERPER_API_KEY="xxx"
61+
JINA_API_KEY="xxx"
62+
63+
# Code execution
64+
E2B_API_KEY="xxx"
65+
66+
# LLM (MiroThinker via SGLang)
67+
OAI_MIROTHINKER_API_KEY="dummy_key"
68+
OAI_MIROTHINKER_BASE_URL="http://localhost:61005/v1"
69+
70+
# Evaluation and final answer extraction
71+
OPENAI_API_KEY="xxx"
72+
73+
# Vision capabilities
74+
ANTHROPIC_API_KEY="xxx"
75+
GEMINI_API_KEY="xxx"
76+
```
77+
78+
### Step 3: Run the Evaluation
79+
80+
```bash title="Run WebWalkerQA Evaluation"
81+
uv run main.py common-benchmark --config_file_name=agent_webwalkerqa_claude37sonnet output_dir="logs/webwalkerqa/$(date +"%Y%m%d_%H%M")"
82+
```
83+
84+
!!! tip "Progress Monitoring and Resume"
85+
To check the progress while running:
86+
87+
```bash title="Check Progress"
88+
ls -lh logs/webwalkerqa/YOUR_RUN_DIR/
89+
```
90+
91+
If you need to resume an interrupted evaluation, specify the same output directory:
92+
93+
```bash title="Resume Evaluation"
94+
uv run main.py common-benchmark --config_file_name=agent_webwalkerqa_claude37sonnet output_dir=${PATH_TO_LOG}
95+
```
96+
97+
Results are automatically generated in the output directory:
98+
- `benchmark_results.jsonl` - Detailed results for each task
99+
- `benchmark_results_pass_at_1_accuracy.txt` - Summary accuracy statistics
100+
101+
---
102+
103+
## Usage Examples
104+
105+
```bash title="Limited Task Testing"
106+
# Test with 10 tasks only
107+
uv run main.py common-benchmark --config_file_name=agent_webwalkerqa_claude37sonnet benchmark.execution.max_tasks=10 output_dir="logs/webwalkerqa/test"
108+
```
109+
110+
```bash title="Custom Concurrency"
111+
# Run with 10 concurrent tasks
112+
uv run main.py common-benchmark --config_file_name=agent_webwalkerqa_claude37sonnet benchmark.execution.max_concurrent=10 output_dir="logs/webwalkerqa/$(date +"%Y%m%d_%H%M")"
113+
```
114+
115+
```bash title="Using MiroThinker Model"
116+
uv run main.py common-benchmark --config_file_name=agent_webwalkerqa_mirothinker output_dir="logs/webwalkerqa/$(date +"%Y%m%d_%H%M")"
117+
```
118+
119+
---
120+
121+
## Available Agent Configurations
122+
123+
| Agent Configuration | Model | Use Case |
124+
|-------------------|-------|----------|
125+
| `agent_webwalkerqa_claude37sonnet` | Claude 3.7 Sonnet | Recommended for best performance |
126+
| `agent_webwalkerqa_mirothinker` | MiroThinker | For local deployment |
127+
128+
---
129+
130+
!!! info "Documentation Info"
131+
**Last Updated:** October 2025 · **Doc Contributor:** Team @ MiroMind AI
132+

docs/mkdocs/mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@ nav:
6161
- GAIA-Val-Text: gaia_validation_text_only.md
6262
- GAIA-Test: gaia_test.md
6363
- BrowseComp-EN: browsecomp_en.md
64+
- WebWalkerQA: webwalkerqa.md
6465
- FutureX: futurex.md
6566
- xBench-DeepSearch: xbench_ds.md
6667
- FinSearchComp: finsearchcomp.md

0 commit comments

Comments
 (0)