Skip to content

Commit a0f47df

Browse files
ntudyYue Deng
andauthored
feat(benchmark): add browsecomp_zh (#88)
add browsecomp_zh Co-authored-by: Yue Deng <[email protected]>
1 parent e1cbf9a commit a0f47df

File tree

5 files changed

+251
-0
lines changed

5 files changed

+251
-0
lines changed
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
defaults:
2+
- benchmark: browsecomp-zh
3+
- override hydra/job_logging: none
4+
- _self_ # Allow defining variables at the top of this file
5+
6+
7+
main_agent:
8+
prompt_class: MainAgentPrompt_GAIA
9+
llm:
10+
provider_class: "ClaudeOpenRouterClient"
11+
model_name: "anthropic/claude-3.7-sonnet"
12+
async_client: true
13+
temperature: 0.3
14+
top_p: 0.95
15+
min_p: 0.0
16+
top_k: -1
17+
max_tokens: 32000
18+
openrouter_api_key: "${oc.env:OPENROUTER_API_KEY,???}"
19+
openrouter_base_url: "${oc.env:OPENROUTER_BASE_URL,https://openrouter.ai/api/v1}"
20+
openrouter_provider: "anthropic"
21+
disable_cache_control: false
22+
keep_tool_result: -1
23+
oai_tool_thinking: false
24+
25+
tool_config:
26+
- tool-reasoning
27+
28+
max_turns: 50 # Maximum number of turns for main agent execution
29+
max_tool_calls_per_turn: 10 # Maximum number of tool calls per turn
30+
31+
input_process:
32+
hint_generation: true
33+
hint_llm_base_url: "${oc.env:HINT_LLM_BASE_URL,https://api.openai.com/v1}"
34+
output_process:
35+
final_answer_extraction: true
36+
final_answer_llm_base_url: "${oc.env:FINAL_ANSWER_LLM_BASE_URL,https://api.openai.com/v1}"
37+
38+
openai_api_key: "${oc.env:OPENAI_API_KEY,???}" # used for hint generation and final answer extraction
39+
add_message_id: true
40+
keep_tool_result: -1
41+
chinese_context: "${oc.env:CHINESE_CONTEXT,true}"
42+
43+
44+
sub_agents:
45+
agent-worker:
46+
prompt_class: SubAgentWorkerPrompt
47+
llm:
48+
provider_class: "ClaudeOpenRouterClient"
49+
model_name: "anthropic/claude-3.7-sonnet"
50+
async_client: true
51+
temperature: 0.3
52+
top_p: 0.95
53+
min_p: 0.0
54+
top_k: -1
55+
max_tokens: 32000
56+
openrouter_api_key: "${oc.env:OPENROUTER_API_KEY,???}"
57+
openrouter_base_url: "${oc.env:OPENROUTER_BASE_URL,https://openrouter.ai/api/v1}"
58+
openrouter_provider: "anthropic"
59+
disable_cache_control: false
60+
keep_tool_result: -1
61+
oai_tool_thinking: false
62+
63+
tool_config:
64+
- tool-searching
65+
- tool-image-video
66+
- tool-reading
67+
- tool-code
68+
- tool-audio
69+
70+
max_turns: 50 # Maximum number of turns for main agent execution
71+
max_tool_calls_per_turn: 10 # Maximum number of tool calls per turn
72+
73+
74+
# Can define some top-level or default parameters here
75+
output_dir: logs/
76+
data_dir: "${oc.env:DATA_DIR,data}" # Points to where data is stored
77+
78+
79+
Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
defaults:
2+
- benchmark: browsecomp-zh
3+
- override hydra/job_logging: none
4+
- _self_ # Allow defining variables at the top of this file
5+
6+
7+
main_agent:
8+
prompt_class: MainAgentPrompt_GAIA
9+
llm:
10+
provider_class: "MiroThinkerSGLangClient"
11+
model_name: "DUMMY_MODEL_NAME"
12+
async_client: true
13+
temperature: 0.3
14+
top_p: 0.95
15+
min_p: 0.0
16+
top_k: -1
17+
max_tokens: 4096
18+
oai_mirothinker_api_key: "${oc.env:OAI_MIROTHINKER_API_KEY,dummy_key}"
19+
oai_mirothinker_base_url: "${oc.env:OAI_MIROTHINKER_BASE_URL,http://localhost:61005/v1}"
20+
keep_tool_result: -1
21+
oai_tool_thinking: false
22+
23+
tool_config:
24+
- tool-reasoning
25+
- tool-searching
26+
- tool-image-video
27+
- tool-reading
28+
- tool-code
29+
- tool-audio
30+
31+
max_turns: 50 # Maximum number of turns for main agent execution
32+
max_tool_calls_per_turn: 10 # Maximum number of tool calls per turn
33+
34+
input_process:
35+
hint_generation: false
36+
hint_llm_base_url: "${oc.env:HINT_LLM_BASE_URL,https://api.openai.com/v1}"
37+
38+
output_process:
39+
final_answer_extraction: true
40+
final_answer_llm_base_url: "${oc.env:FINAL_ANSWER_LLM_BASE_URL,https://api.openai.com/v1}"
41+
42+
openai_api_key: "${oc.env:OPENAI_API_KEY,???}" # used for hint generation and final answer extraction
43+
add_message_id: true
44+
keep_tool_result: -1
45+
chinese_context: "${oc.env:CHINESE_CONTEXT,true}"
46+
47+
48+
sub_agents: null
49+
50+
51+
# Can define some top-level or default parameters here
52+
output_dir: logs/
53+
data_dir: "${oc.env:DATA_DIR,data}" # Points to where data is stored
54+
55+
56+
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# config/benchmark/browsecomp-zh.yaml
2+
defaults:
3+
- default
4+
- _self_
5+
6+
name: "browsecomp-zh"
7+
8+
data:
9+
data_dir: "${data_dir}/browsecomp-zh-test" # Path to browsecomp-zh-test (Chinese) dataset
10+
metadata_file: "standardized_data.jsonl" # Metadata filename
11+
whitelist: [] # Optional: List of specific task_ids to run
12+
13+
execution:
14+
max_tasks: null # null = no limit, or specify a number
15+
max_concurrent: 5 # Number of parallel tasks
16+
pass_at_k: 1 # Number of attempts per task
17+
18+
# OpenAI API key for evaluation (required for browsecomp-zh since it has ground truth)
19+
openai_api_key: "${oc.env:OPENAI_API_KEY,???}"
20+
21+

docs/mkdocs/docs/browsecomp_zh.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# BrowseComp-ZH (Chinese)
2+
3+
MiroFlow's evaluation on the BrowseComp-ZH benchmark demonstrates advanced web browsing and information retrieval capabilities in the Chinese information ecosystem.
4+
5+
More details: [BrowseComp-ZH: Benchmarking Web Browsing Ability of Large Language Models in Chinese](https://github.com/PALIN2018/BrowseComp-ZH)
6+
7+
---
8+
9+
## Dataset Overview
10+
11+
!!! abstract "Key Dataset Characteristics"
12+
13+
- **Total Tasks**: 289 complex multi-hop retrieval questions in the test split
14+
- **Language**: Chinese (Simplified)
15+
- **Task Types**: Web browsing, search, and information retrieval with multi-hop reasoning
16+
- **Domains**: 11 domains including Film & TV, Technology, Medicine, History, Sports, and Arts
17+
- **Evaluation**: Automated comparison with ground truth answers
18+
- **Difficulty**: High-difficulty benchmark designed to test real-world Chinese web browsing capabilities
19+
20+
---
21+
22+
## Quick Start Guide
23+
24+
### Step 1: Prepare the BrowseComp-ZH Dataset
25+
26+
```bash title="Download BrowseComp-ZH Dataset"
27+
uv run main.py prepare-benchmark get browsecomp-zh-test
28+
```
29+
30+
This will create the standardized dataset at `data/browsecomp-zh-test/standardized_data.jsonl`.
31+
32+
### Step 2: Configure API Keys
33+
34+
```env title=".env Configuration"
35+
# Search and web scraping (recommended for Chinese web)
36+
SERPER_API_KEY="xxx"
37+
JINA_API_KEY="xxx"
38+
39+
# Code execution
40+
E2B_API_KEY="xxx"
41+
42+
# LLM (Claude 3.7 Sonnet via OpenRouter)
43+
OPENROUTER_API_KEY="xxx"
44+
OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"
45+
46+
# Evaluation and hint generation
47+
OPENAI_API_KEY="xxx"
48+
49+
# Vision capabilities
50+
ANTHROPIC_API_KEY="xxx"
51+
GEMINI_API_KEY="xxx"
52+
53+
# Optional: Set Chinese context mode
54+
CHINESE_CONTEXT="true"
55+
```
56+
57+
### Step 3: Run the Evaluation
58+
59+
```bash title="Run BrowseComp-ZH Evaluation"
60+
uv run main.py common-benchmark --config_file_name=agent_browsecomp-zh_claude37sonnet output_dir="logs/browsecomp-zh/$(date +"%Y%m%d_%H%M")"
61+
```
62+
63+
Results are automatically generated in the output directory:
64+
- `benchmark_results.jsonl` - Detailed results for each task
65+
- `benchmark_results_pass_at_1_accuracy.txt` - Summary accuracy statistics
66+
67+
---
68+
69+
## Usage Examples
70+
71+
```bash title="Limited Task Testing"
72+
# Test with 10 tasks only
73+
uv run main.py common-benchmark --config_file_name=agent_browsecomp-zh_claude37sonnet benchmark.execution.max_tasks=10 output_dir="logs/browsecomp-zh/$(date +"%Y%m%d_%H%M")"
74+
```
75+
76+
```bash title="Using MiroThinker Model"
77+
uv run main.py common-benchmark --config_file_name=agent_browsecomp-zh_mirothinker output_dir="logs/browsecomp-zh/$(date +"%Y%m%d_%H%M")"
78+
```
79+
80+
---
81+
82+
## Available Agent Configurations
83+
84+
| Agent Configuration | Model | Use Case |
85+
|-------------------|-------|----------|
86+
| `agent_browsecomp-zh_claude37sonnet` | Claude 3.7 Sonnet | Recommended for better performance on Chinese tasks |
87+
| `agent_browsecomp-zh_mirothinker` | MiroThinker | For local deployment |
88+
89+
---
90+
91+
!!! info "Documentation Info"
92+
**Last Updated:** October 2025 · **Doc Contributor:** Team @ MiroMind AI
93+
94+

docs/mkdocs/mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@ nav:
6161
- GAIA-Val-Text: gaia_validation_text_only.md
6262
- GAIA-Test: gaia_test.md
6363
- BrowseComp-EN: browsecomp_en.md
64+
- BrowseComp-ZH: browsecomp_zh.md
6465
- WebWalkerQA: webwalkerqa.md
6566
- FutureX: futurex.md
6667
- xBench-DeepSearch: xbench_ds.md

0 commit comments

Comments
 (0)