feat(benchmark): add browsecomp_zh #88

ntudy · 2025-10-16T08:06:51Z

Describe this PR

What changed?

Why?

Related issues

Checklist for PR

Write a descriptive PR title following the Angular commit message format: <type>(<scope>): <subject>
- Examples: feat(agent): add pdf tool via mcp, perf: make llm client async, fix(utils): load custom config via importlib
- Valid types: feat, fix, docs, style, refactor, perf, test, build, ci, revert
- The check-pr-title CI job will validate your title format
- Bad title examples and why they fail:
  - Update README ❌ Missing type and colon
  - feat add new feature ❌ Missing colon after type
  - Feature: add new tool ❌ Invalid type (should be feat)
  - feat(Agent): add tool ❌ Scope should be lowercase
  - feat(): add tool ❌ Empty scope not allowed
  - feat(my_scope): add tool ❌ Underscores not allowed in scope
  - feat(my space): add tool ❌ Space not allowed in scope
  - feat(scope):add tool ❌ Missing space after colon
  - feat(scope): ❌ Empty subject
Run lint and format locally:
- uv tool run [email protected] check --fix .
- uv tool run [email protected] format .
- CI job lint enforces ruff default format/lint rules on all new codes.

Copilot

Pull Request Overview

Adds BrowseComp-ZH benchmark support, including docs, benchmark config, and two agent configurations (Claude via OpenRouter and MiroThinker).

New mkdocs page and nav entry for BrowseComp-ZH
New benchmark config browsecomp-zh.yaml
New agent configs for Claude 3.7 Sonnet (OpenRouter) and MiroThinker

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
docs/mkdocs/mkdocs.yml	Adds BrowseComp-ZH page to docs navigation
docs/mkdocs/docs/browsecomp_zh.md	New documentation page with setup and run instructions
config/benchmark/browsecomp-zh.yaml	New benchmark configuration (data paths, execution params, OpenAI key)
config/agent_browsecomp-zh_mirothinker.yaml	New agent config for MiroThinker with Chinese context
config/agent_browsecomp-zh_claude37sonnet.yaml	New agent config for Claude 3.7 Sonnet via OpenRouter with worker sub-agent

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-16T08:10:07Z

docs/mkdocs/docs/browsecomp_zh.md

+| Agent Configuration | Model | Use Case |
+|-------------------|-------|----------|
+| `agent_browsecomp-zh_claude37sonnet` | Claude 3.7 Sonnet | Recommended for better performance on Chinese tasks |
+| `agent_browsecomp-zh_mirothinker` | MiroThinker | For local deployment |


The table header currently starts with a double pipe on each line (||), which breaks Markdown table rendering. Use a single leading pipe (|) as shown to render correctly.

Copilot · 2025-10-16T08:10:07Z

config/agent_browsecomp-zh_claude37sonnet.yaml

+    openrouter_base_url: "${oc.env:OPENROUTER_BASE_URL,https://openrouter.ai/api/v1}"
+    openrouter_provider: "anthropic"
+    disable_cache_control: false
+    keep_tool_result: -1


[nitpick] keep_tool_result is defined both under main_agent.llm and main_agent; this duplication can cause confusion about which value is authoritative. Define it in a single place (preferably at main_agent level if the agent layer consumes it) and remove the duplicate.

Suggested change

keep_tool_result: -1

Copilot · 2025-10-16T08:10:08Z

config/agent_browsecomp-zh_claude37sonnet.yaml

+
+  openai_api_key: "${oc.env:OPENAI_API_KEY,???}" # used for hint generation and final answer extraction
+  add_message_id: true
+  keep_tool_result: -1


[nitpick] keep_tool_result is defined both under main_agent.llm and main_agent; this duplication can cause confusion about which value is authoritative. Define it in a single place (preferably at main_agent level if the agent layer consumes it) and remove the duplicate.

Copilot · 2025-10-16T08:10:08Z

config/agent_browsecomp-zh_mirothinker.yaml

+    max_tokens: 4096
+    oai_mirothinker_api_key: "${oc.env:OAI_MIROTHINKER_API_KEY,dummy_key}"
+    oai_mirothinker_base_url: "${oc.env:OAI_MIROTHINKER_BASE_URL,http://localhost:61005/v1}"
+    keep_tool_result: -1


[nitpick] keep_tool_result is duplicated at both llm and main_agent levels. Consolidate to a single definition to avoid ambiguity; keep it where your code reads it (commonly the agent level) and remove the other.

Suggested change

keep_tool_result: -1

config/agent_browsecomp-zh_mirothinker.yaml

* system prompt: 提示文件嵌入格式 * 简化提示词 * 更新 system prompt, 规范解析失败时的格式 * 优化 system prompt, 防止模型认为空内容是 OCR 失败导致的 * 支持注入额外 prompt

add browsecomp_zh

05f6c21

ntudy requested review from BinWang28 and Copilot and removed request for Copilot October 16, 2025 08:07

Copilot AI reviewed Oct 16, 2025

View reviewed changes

BinWang28 approved these changes Oct 16, 2025

View reviewed changes

BinWang28 merged commit a0f47df into miroflow-v0.3 Oct 16, 2025
3 checks passed

BinWang28 deleted the bc-zh branch October 16, 2025 10:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(benchmark): add browsecomp_zh #88

feat(benchmark): add browsecomp_zh #88

Uh oh!

ntudy commented Oct 16, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 16, 2025

Uh oh!

Copilot AI Oct 16, 2025

Uh oh!

Copilot AI Oct 16, 2025

Uh oh!

Copilot AI Oct 16, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(benchmark): add browsecomp_zh #88

feat(benchmark): add browsecomp_zh #88

Uh oh!

Conversation

ntudy commented Oct 16, 2025

Describe this PR

What changed?

Why?

Related issues

Checklist for PR

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants