-
Notifications
You must be signed in to change notification settings - Fork 155
feat(benchmark): add hle-text-only #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
| defaults: | ||
| - benchmark: hle-text-only | ||
| - override hydra/job_logging: none | ||
| - _self_ # Allow defining variables at the top of this file | ||
|
|
||
|
|
||
| main_agent: | ||
| prompt_class: MainAgentPrompt_GAIA | ||
| llm: | ||
| provider_class: "ClaudeOpenRouterClient" | ||
| model_name: "anthropic/claude-3.7-sonnet" | ||
| async_client: true | ||
| temperature: 0.3 | ||
| top_p: 0.95 | ||
| min_p: 0.0 | ||
| top_k: -1 | ||
| max_tokens: 32000 | ||
| openrouter_api_key: "${oc.env:OPENROUTER_API_KEY,???}" | ||
| openrouter_base_url: "${oc.env:OPENROUTER_BASE_URL,https://openrouter.ai/api/v1}" | ||
| openrouter_provider: "anthropic" | ||
| disable_cache_control: false | ||
| keep_tool_result: -1 | ||
| oai_tool_thinking: false | ||
|
|
||
| tool_config: | ||
| - tool-reasoning | ||
|
|
||
| max_turns: 50 # Maximum number of turns for main agent execution | ||
| max_tool_calls_per_turn: 10 # Maximum number of tool calls per turn | ||
|
|
||
| input_process: | ||
| hint_generation: true | ||
| hint_llm_base_url: "${oc.env:HINT_LLM_BASE_URL,https://api.openai.com/v1}" | ||
| output_process: | ||
| final_answer_extraction: true | ||
| final_answer_llm_base_url: "${oc.env:FINAL_ANSWER_LLM_BASE_URL,https://api.openai.com/v1}" | ||
|
|
||
| openai_api_key: "${oc.env:OPENAI_API_KEY,???}" # used for hint generation and final answer extraction | ||
| add_message_id: true | ||
| keep_tool_result: -1 | ||
| chinese_context: "${oc.env:CHINESE_CONTEXT,false}" | ||
|
|
||
|
|
||
| sub_agents: | ||
| agent-worker: | ||
| prompt_class: SubAgentWorkerPrompt | ||
| llm: | ||
| provider_class: "ClaudeOpenRouterClient" | ||
| model_name: "anthropic/claude-3.7-sonnet" | ||
| async_client: true | ||
| temperature: 0.3 | ||
| top_p: 0.95 | ||
| min_p: 0.0 | ||
| top_k: -1 | ||
| max_tokens: 32000 | ||
| openrouter_api_key: "${oc.env:OPENROUTER_API_KEY,???}" | ||
| openrouter_base_url: "${oc.env:OPENROUTER_BASE_URL,https://openrouter.ai/api/v1}" | ||
| openrouter_provider: "anthropic" | ||
| disable_cache_control: false | ||
| keep_tool_result: -1 | ||
| oai_tool_thinking: false | ||
|
|
||
| tool_config: | ||
| - tool-searching | ||
| - tool-image-video | ||
| - tool-reading | ||
| - tool-code | ||
| - tool-audio | ||
|
|
||
| max_turns: 50 # Maximum number of turns for main agent execution | ||
| max_tool_calls_per_turn: 10 # Maximum number of tool calls per turn | ||
|
|
||
|
|
||
| # Can define some top-level or default parameters here | ||
| output_dir: logs/ | ||
| data_dir: "${oc.env:DATA_DIR,data}" # Points to where data is stored | ||
|
|
||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| # config/benchmark/hle-text-only.yaml | ||
| defaults: | ||
| - default | ||
| - _self_ | ||
|
|
||
| name: "hle-text-only" | ||
|
|
||
| data: | ||
| data_dir: "${data_dir}/hle-text-only" # Path to hle-text-only dataset | ||
| metadata_file: "standardized_data.jsonl" # Metadata filename | ||
| whitelist: [] # Optional: List of specific task_ids to run | ||
|
|
||
| execution: | ||
| max_tasks: null # null = no limit, or specify a number | ||
| max_concurrent: 10 # Number of parallel tasks | ||
| pass_at_k: 1 # Number of attempts per task | ||
|
|
||
| # OpenAI API key for evaluation (required for hle-text-only since it has ground truth) | ||
| openai_api_key: "${oc.env:OPENAI_API_KEY,???}" | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,92 @@ | ||
| # HLE | ||
|
|
||
| MiroFlow's evaluation on the HLE-text-only benchmark demonstrates capabilities in multimodal reasoning and question answering tasks that require human-level understanding across vision and language. | ||
|
|
||
| More details: [HLE text only Dataset on HuggingFace](https://huggingface.co/datasets/macabdul9/hle_text_only) | ||
|
|
||
| --- | ||
|
|
||
| ## Dataset Overview | ||
|
|
||
| !!! info "HLE Dataset (text only)" | ||
| The dataset is a text-only subset of HLE. | ||
|
|
||
| --- | ||
|
|
||
| ## Quick Start Guide | ||
|
|
||
| ### Step 1: Prepare the HLE(text only) Dataset | ||
|
|
||
| ```bash title="Download HLE(text only) Dataset" | ||
| uv run main.py prepare-benchmark get hle-text-only | ||
| ``` | ||
|
|
||
| This will download the dataset to `data/hle-text-only/`. | ||
|
|
||
| ### Step 2: Configure API Keys | ||
|
|
||
| ```env title=".env Configuration" | ||
| # For searching and web scraping | ||
| SERPER_API_KEY="xxx" | ||
| JINA_API_KEY="xxx" | ||
|
|
||
| # For Linux sandbox (code execution environment) | ||
| E2B_API_KEY="xxx" | ||
|
|
||
| # Claude-3.7-Sonnet via OpenRouter | ||
| OPENROUTER_API_KEY="xxx" | ||
| OPENROUTER_BASE_URL="https://openrouter.ai/api/v1" | ||
|
|
||
| # Vision understanding | ||
| ANTHROPIC_API_KEY="xxx" | ||
| GEMINI_API_KEY="xxx" | ||
|
|
||
| # Hint generation and final answer extraction | ||
| OPENAI_API_KEY="xxx" | ||
| OPENAI_BASE_URL="https://api.openai.com/v1" | ||
| ``` | ||
|
|
||
| ### Step 3: Run the Evaluation | ||
|
|
||
| ```bash title="Run HLE Evaluation" | ||
| uv run main.py common-benchmark --config_file_name=agent_hle-text-only_claude37sonnet output_dir="logs/hle-text-only/$(date +"%Y%m%d_%H%M")" | ||
| ``` | ||
|
|
||
| !!! tip "Resume Interrupted Evaluation" | ||
| Specify the same output directory to continue from where you left off: | ||
|
|
||
| ```bash | ||
| uv run main.py common-benchmark --config_file_name=agent_hle-text-only_claude37sonnet output_dir="logs/hle-text-only/20251014_1504" | ||
| ``` | ||
|
|
||
| ### Step 4: Review Results | ||
|
|
||
| ```bash title="Check Results" | ||
| # View accuracy summary | ||
| cat logs/hle-text-only/*/benchmark_results_pass_at_1_accuracy.txt | ||
|
|
||
| # View detailed results | ||
| cat logs/hle-text-only/*/benchmark_results.jsonl | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| ## Usage Examples | ||
|
|
||
| ### Test with Limited Tasks | ||
|
|
||
| ```bash | ||
| uv run main.py common-benchmark --config_file_name=agent_hle-text-only_claude37sonnet benchmark.execution.max_tasks=10 output_dir="logs/hle-text-only/$(date +"%Y%m%d_%H%M")" | ||
| ``` | ||
|
|
||
| ### Adjust Concurrency | ||
|
|
||
| ```bash | ||
| uv run main.py common-benchmark --config_file_name=agent_hle-text-only_claude37sonnet benchmark.execution.max_concurrent=5 output_dir="logs/hle-text-only/$(date +"%Y%m%d_%H%M")" | ||
| ``` | ||
|
|
||
| --- | ||
|
|
||
| !!! info "Documentation Info" | ||
| **Last Updated:** October 2025 · **Doc Contributor:** Team @ MiroMind AI | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| # SPDX-FileCopyrightText: 2025 MiromindAI | ||
| # | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| from typing import Generator, MutableMapping | ||
|
|
||
| from datasets import load_dataset | ||
|
|
||
| from utils.prepare_benchmark.common import Task | ||
|
|
||
|
|
||
| def gen_hle_text_only(hf_token: str) -> Generator[Task, None, None]: | ||
| dataset = load_dataset("macabdul9/hle_text_only", split="test", token=hf_token) | ||
| for x in dataset: | ||
| metadata: MutableMapping = x # type: ignore | ||
| task_id = metadata.pop("id") | ||
| question = metadata.pop("question") | ||
| gt = metadata.pop("answer") | ||
| metadata.pop("image_preview") | ||
| metadata.pop("rationale_image") | ||
| task = Task( | ||
| task_id=task_id, | ||
| task_question=question, | ||
| ground_truth=gt, | ||
| file_path=None, | ||
| metadata=metadata, | ||
| ) | ||
| yield task | ||
|
|
||
| return | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The explicit
returnstatement at the end of a generator function is unnecessary. Generator functions automatically return when they reach the end.