Skip to content

Commit e276581

Browse files
authored
feat(benchmark): add evaluation support for finsearchcomp (#51)
* upd: add futurex evaluation support. * upd: support multiple eval for futurex and add relavent doc. * upd: fix bugs with doc for futurex. * debug: fix wrong calling path. * add preparation for finsearchcomp. * update a premature version of finsearchcomp benchmark. * clean redundent code in merging. * upd: modify yaml to use Mirothinker as the main agent, add check progress file to exclude T1. * upd: check_progress function for finsearchcomp now consider globe and greater china respectively. * upd: add docs and shell script for multiple runs. * fix: check_finsearchcomp_progress not displaying results from greater china region.
1 parent 3d49bd8 commit e276581

15 files changed

+871
-3
lines changed

common_benchmark.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -210,6 +210,7 @@ async def run_single_task(self, task: BenchmarkTask) -> BenchmarkResult:
210210
sub_agent_tool_managers=self.sub_agent_tool_managers,
211211
output_formatter=self.output_formatter,
212212
ground_truth=task.ground_truth,
213+
metadata=task.metadata,
213214
log_path=self.output_dir
214215
/ f"task_{task.task_id}_attempt_{attempt}.json",
215216
)
@@ -242,6 +243,7 @@ async def run_single_task(self, task: BenchmarkTask) -> BenchmarkResult:
242243
question=task.task_question,
243244
target=task.ground_truth,
244245
predicted_answer=attempt_result["model_boxed_answer"],
246+
metadata=task.metadata,
245247
)
246248
attempt_result["judge_result"] = evaluation_result
247249
attempt_result["is_correct"] = evaluation_result == "CORRECT"

config/agent_finsearchcomp.yaml

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
defaults:
2+
- benchmark: finsearchcomp
3+
- override hydra/job_logging: none
4+
- _self_ # Allow defining variables at the top of this file
5+
6+
7+
main_agent:
8+
prompt_class: MainAgentPrompt_GAIA
9+
llm:
10+
provider_class: "MiroThinkerSGLangClient"
11+
model_name: "MODEL_NAME"
12+
async_client: true
13+
temperature: 0.6
14+
top_p: 0.95
15+
min_p: 0.0
16+
top_k: -1
17+
max_tokens: 8192
18+
oai_mirothinker_api_key: "${oc.env:OAI_MIROTHINKER_API_KEY,dummy_key}"
19+
oai_mirothinker_base_url: "${oc.env:OAI_MIROTHINKER_BASE_URL,http://localhost:61005/v1}"
20+
keep_tool_result: -1
21+
oai_tool_thinking: false
22+
23+
tool_config:
24+
- tool-reasoning
25+
26+
max_turns: 20 # Maximum number of turns for main agent execution
27+
max_tool_calls_per_turn: 10 # Maximum number of tool calls per turn
28+
29+
input_process:
30+
o3_hint: true
31+
output_process:
32+
o3_final_answer: true
33+
34+
openai_api_key: "${oc.env:OPENAI_API_KEY,???}" # used for o3 hints and final answer extraction
35+
add_message_id: true
36+
keep_tool_result: -1
37+
chinese_context: "${oc.env:CHINESE_CONTEXT,false}"
38+
39+
40+
sub_agents:
41+
agent-worker:
42+
prompt_class: SubAgentWorkerPrompt
43+
llm:
44+
provider_class: "MiroThinkerSGLangClient"
45+
model_name: "MODEL_NAME"
46+
async_client: true
47+
temperature: 0.6
48+
top_p: 0.95
49+
min_p: 0.0
50+
top_k: -1
51+
max_tokens: 8192
52+
oai_mirothinker_api_key: "${oc.env:OAI_MIROTHINKER_API_KEY,dummy_key}"
53+
oai_mirothinker_base_url: "${oc.env:OAI_MIROTHINKER_BASE_URL,http://localhost:61005/v1}"
54+
keep_tool_result: -1
55+
oai_tool_thinking: false
56+
57+
tool_config:
58+
- tool-searching
59+
- tool-image-video
60+
- tool-reading
61+
- tool-code
62+
- tool-audio
63+
64+
max_turns: 20 # Maximum number of turns for main agent execution
65+
max_tool_calls_per_turn: 10 # Maximum number of tool calls per turn
66+
67+
# Can define some top-level or default parameters here
68+
output_dir: logs/
69+
data_dir: "${oc.env:DATA_DIR,data}" # Points to where data is stored
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# config/benchmark/finsearchcomp.yaml
2+
defaults:
3+
- default
4+
- _self_
5+
6+
name: "finsearchcomp"
7+
8+
data:
9+
data_dir: "${data_dir}/finsearchcomp" # Path to finsearchcomp dataset
10+
metadata_file: "standardized_data.jsonl" # Metadata filename
11+
whitelist: [] # Optional: List of specific task_ids to run
12+
13+
execution:
14+
max_tasks: null # null = no limit, or specify a number
15+
max_concurrent: 5 # Number of parallel tasks
16+
pass_at_k: 1 # Number of attempts per task
17+
18+
# OpenAI API key for evaluation (required for finsearchcomp since it has ground truth)
19+
openai_api_key: "${oc.env:OPENAI_API_KEY,???}"

docs/mkdocs/docs/finsearchcomp.md

Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
# FinSearchComp
2+
3+
MiroFlow's evaluation on the FinSearchComp benchmark demonstrates capabilities in financial information search and analysis tasks, showcasing advanced reasoning abilities in complex financial research scenarios.
4+
5+
More details: [FinSearchComp Dataset](https://huggingface.co/datasets/ByteSeedXpert/FinSearchComp)
6+
7+
---
8+
9+
## Dataset Overview
10+
11+
!!! info "FinSearchComp Dataset"
12+
The FinSearchComp dataset consists of financial search and analysis tasks that require comprehensive research capabilities including:
13+
14+
- Financial data retrieval and analysis
15+
- Market research and company analysis
16+
- Investment decision support
17+
- Financial news and report interpretation
18+
- Time-sensitive financial information gathering
19+
20+
!!! abstract "Key Dataset Characteristics"
21+
22+
- **Total Tasks**: 635 (across T1, T2, T3 categories)
23+
- **Task Types**:
24+
- **T1**: Time-Sensitive Data Fetching
25+
- **T2**: Financial Analysis and Research
26+
- **T3**: Complex Historical Investigation
27+
- **Answer Format**: Detailed financial analysis and research reports
28+
- **Ground Truth**: Available for T2 and T3 tasks, changes dynamically for T1 tasks
29+
- **Evaluation**: Judge-based evaluation with correctness assessment
30+
31+
---
32+
33+
## Quick Start Guide
34+
35+
!!! note "Quick Start Instructions"
36+
This section provides step-by-step instructions to run the FinSearchComp benchmark and prepare submission results. **Note**: This is a quick start guide for running the benchmark, not for reproducing exact submitted results.
37+
38+
### Step 1: Prepare the FinSearchComp Dataset
39+
40+
!!! tip "Dataset Setup"
41+
Use the integrated prepare-benchmark command to download and process the dataset:
42+
43+
```bash title="Download FinSearchComp Dataset"
44+
uv run main.py prepare-benchmark get finsearchcomp
45+
```
46+
47+
This will create the standardized dataset at `data/finsearchcomp/standardized_data.jsonl`.
48+
49+
### Step 2: Configure API Keys
50+
51+
!!! warning "API Key Configuration"
52+
Set up the required API keys for model access and tool functionality. Update the `.env` file to include the following keys:
53+
54+
```env title=".env Configuration"
55+
# For searching and web scraping
56+
SERPER_API_KEY="xxx"
57+
JINA_API_KEY="xxx"
58+
59+
# For Linux sandbox (code execution environment)
60+
E2B_API_KEY="xxx"
61+
62+
# We use MiroThinker model for financial analysis
63+
OAI_MIROTHINKER_API_KEY="xxx"
64+
OAI_MIROTHINKER_BASE_URL="http://localhost:61005/v1"
65+
66+
# Used for o3 hints and final answer extraction
67+
OPENAI_API_KEY="xxx"
68+
OPENAI_BASE_URL="https://api.openai.com/v1"
69+
70+
# Used for Claude vision understanding
71+
ANTHROPIC_API_KEY="xxx"
72+
73+
# Used for Gemini vision
74+
GEMINI_API_KEY="xxx"
75+
```
76+
77+
### Step 3: Run the Evaluation
78+
79+
!!! example "Evaluation Execution"
80+
Execute the following command to run evaluation on the FinSearchComp dataset:
81+
82+
```bash title="Run FinSearchComp Evaluation"
83+
uv run main.py common-benchmark --config_file_name=agent_finsearchcomp benchmark=finsearchcomp output_dir="logs/finsearchcomp/$(date +"%Y%m%d_%H%M")"
84+
```
85+
86+
!!! tip "Progress Monitoring and Resume"
87+
To check the progress while running:
88+
89+
```bash title="Check Progress"
90+
uv run utils/progress_check/check_finsearchcomp_progress.py $PATH_TO_LOG
91+
```
92+
93+
If you need to resume an interrupted evaluation, specify the same output directory to continue from where you left off.
94+
95+
```bash title="Resume Evaluation, e.g."
96+
uv run main.py common-benchmark --config_file_name=agent_finsearchcomp benchmark=finsearchcomp output_dir=${PATH_TO_LOG}
97+
```
98+
99+
### Step 4: Extract Results
100+
101+
!!! example "Result Extraction"
102+
After evaluation completion, the results are automatically generated in the output directory:
103+
104+
- `benchmark_results.jsonl`: Detailed results for each task
105+
- `benchmark_results_pass_at_1_accuracy.txt`: Summary accuracy statistics
106+
- `task_*_attempt_1.json`: Individual task execution traces
107+
108+
---
109+
110+
## Evaluation Notes
111+
112+
!!! warning "Task Type Considerations"
113+
The FinSearchComp dataset includes different task types with varying evaluation criteria:
114+
115+
- **T1 Tasks**: Time-Sensitive Data Fetching tasks are excluded from correctness evaluation due to outdated ground truth, but completion is still tracked
116+
- **T2 Tasks**: Financial Analysis tasks are evaluated for correctness and quality
117+
- **T3 Tasks**: Complex Historical Investigation tasks require comprehensive research and analysis
118+
119+
!!! info "Output Analysis"
120+
The evaluation generates detailed execution traces showing:
121+
122+
- Research process for each financial task
123+
- Information gathering from multiple sources
124+
- Financial calculations and analysis
125+
- Comprehensive reports with insights and recommendations
126+
127+
### Directory Structure
128+
129+
After running evaluations, you'll find the following structure:
130+
131+
```
132+
logs/finsearchcomp/agent_finsearchcomp_YYYYMMDD_HHMM/
133+
├── benchmark_results.jsonl # Task results summary
134+
├── benchmark_results_pass_at_1_accuracy.txt # Accuracy statistics
135+
├── task_(T1)Time_Sensitive_Data_Fetching_*.json # T1 task traces
136+
├── task_(T2)Financial_Analysis_*.json # T2 task traces
137+
├── task_(T3)Complex_Historical_Investigation_*.json # T3 task traces
138+
└── output.log # Execution log
139+
```
140+
141+
### Task Categories Breakdown
142+
143+
The progress checker provides detailed statistics:
144+
145+
- **Total Tasks**: Complete count across all categories
146+
- **Completed Tasks**: Successfully finished tasks
147+
- **Correct Tasks**: Tasks with judge_result "CORRECT" (T2 and T3 only)
148+
- **Category Breakdown**: Separate counts for T1, T2, and T3 tasks
149+
- **Accuracy Metrics**: Pass@1 accuracy for evaluable tasks
150+
151+
---
152+
153+
## Usage Examples
154+
155+
### Single Run Evaluation
156+
```bash title="Basic Evaluation"
157+
uv run main.py common-benchmark --config_file_name=agent_finsearchcomp benchmark=finsearchcomp output_dir="logs/finsearchcomp/$(date +"%Y%m%d_%H%M")"
158+
```
159+
160+
### Limited Task Testing
161+
```bash title="Test with Limited Tasks"
162+
uv run main.py common-benchmark --config_file_name=agent_finsearchcomp benchmark=finsearchcomp benchmark.execution.max_tasks=5 output_dir="logs/finsearchcomp/$(date +"%Y%m%d_%H%M")"
163+
```
164+
165+
### Custom Agent Configuration
166+
```bash title="Different Agent Setup"
167+
uv run main.py common-benchmark --config_file_name=agent_gaia-validation benchmark=finsearchcomp output_dir="logs/finsearchcomp/$(date +"%Y%m%d_%H%M")"
168+
```
169+
170+
### Multiple Runs for Reliability
171+
```bash title="Multiple Runs"
172+
NUM_RUNS=5 ./scripts/run_evaluate_multiple_runs_finsearchcomp.sh
173+
```
174+
175+
---
176+
177+
!!! info "Documentation Info"
178+
**Last Updated:** September 2025 · **Doc Contributor:** Team @ MiroMind AI

docs/mkdocs/mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,7 @@ nav:
5454
- GAIA-Test: gaia_test.md
5555
- FutureX: futurex.md
5656
- xBench-DeepSearch: xbench_ds.md
57+
- FinSearchComp: finsearchcomp.md
5758
- Download Datasets: download_datasets.md
5859
- Add New Benchmarks: contribute_benchmarks.md
5960

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
#!/bin/bash
2+
3+
# SPDX-FileCopyrightText: 2025 MiromindAI
4+
#
5+
# SPDX-License-Identifier: Apache-2.0
6+
7+
# Multiple runs FinSearchComp evaluation script
8+
# Based on the working command: uv run main.py common-benchmark --config_file_name=agent_finsearchcomp benchmark=finsearchcomp output_dir=logs/finsearchcomp/$(date +"%Y%m%d_%H%M")
9+
10+
# Configuration parameters
11+
NUM_RUNS=${NUM_RUNS:-3}
12+
MAX_TASKS=${MAX_TASKS:-1}
13+
MAX_CONCURRENT=${MAX_CONCURRENT:-5}
14+
BENCHMARK_NAME="finsearchcomp"
15+
AGENT_SET=${AGENT_SET:-"agent_finsearchcomp"}
16+
17+
# Set results directory with timestamp
18+
TIMESTAMP=$(date +%Y%m%d_%H%M)
19+
RESULTS_DIR="logs/${BENCHMARK_NAME}/${AGENT_SET}_${TIMESTAMP}"
20+
21+
export LOGGER_LEVEL="INFO"
22+
23+
echo "🚀 Starting $NUM_RUNS runs of FinSearchComp evaluation..."
24+
echo "📊 Using max_tasks: $MAX_TASKS (set MAX_TASKS=null for full dataset)"
25+
echo "📊 Using max_concurrent: $MAX_CONCURRENT"
26+
echo "📁 Results will be saved in: $RESULTS_DIR"
27+
28+
# Create results directory
29+
mkdir -p "$RESULTS_DIR"
30+
31+
# Launch all parallel tasks
32+
for i in $(seq 1 $NUM_RUNS); do
33+
echo "=========================================="
34+
echo "🚀 Launching experiment $i/$NUM_RUNS"
35+
echo "📝 Output log: $RESULTS_DIR/run_${i}_output.log"
36+
echo "=========================================="
37+
38+
# Set specific identifier for this run
39+
RUN_ID="run_$i"
40+
41+
# Run experiment (background execution)
42+
(
43+
echo "Starting run $i at $(date)"
44+
uv run main.py common-benchmark \
45+
--config_file_name=$AGENT_SET \
46+
benchmark=$BENCHMARK_NAME \
47+
benchmark.execution.max_tasks=$MAX_TASKS \
48+
benchmark.execution.max_concurrent=$MAX_CONCURRENT \
49+
benchmark.execution.pass_at_k=1 \
50+
output_dir=${RESULTS_DIR}/$RUN_ID \
51+
hydra.run.dir=${RESULTS_DIR}/$RUN_ID \
52+
> "$RESULTS_DIR/${RUN_ID}_output.log" 2>&1
53+
54+
# Check if run was successful
55+
if [ $? -eq 0 ]; then
56+
echo "✅ Run $i completed successfully at $(date)"
57+
RESULT_FILE=$(find "${RESULTS_DIR}/$RUN_ID" -name "*accuracy.txt" 2>/dev/null | head -1)
58+
if [ -f "$RESULT_FILE" ]; then
59+
echo "📊 Results saved to $RESULT_FILE"
60+
else
61+
echo "⚠️ Warning: Result file not found for run $i"
62+
fi
63+
else
64+
echo "❌ Run $i failed at $(date)!"
65+
fi
66+
) &
67+
68+
# Small delay between launches
69+
sleep 2
70+
done
71+
72+
echo "🎯 All $NUM_RUNS runs have been launched in parallel"
73+
echo "⏳ Waiting for all runs to complete..."
74+
75+
# Wait for all background tasks to complete
76+
wait
77+
78+
echo "=========================================="
79+
echo "🎉 All $NUM_RUNS runs completed!"
80+
echo "=========================================="
81+
82+
# Show progress summary
83+
echo "=========================================="
84+
echo "📊 Progress Summary:"
85+
echo "=========================================="
86+
87+
echo "=========================================="
88+
echo "🎯 Multiple runs FinSearchComp evaluation completed!"
89+
echo "📁 Check results in: $RESULTS_DIR"
90+
echo "📝 Check individual run logs: $RESULTS_DIR/run_*_output.log"
91+
echo "=========================================="
92+
echo ""
93+
echo "💡 Usage examples:"
94+
echo " # Default: 3 runs with full dataset"
95+
echo " ./scripts/run_evaluate_multiple_runs_finsearchcomp.sh"
96+
echo ""
97+
echo " # Custom parameters"
98+
echo " NUM_RUNS=5 MAX_TASKS=10 MAX_CONCURRENT=3 ./scripts/run_evaluate_multiple_runs_finsearchcomp.sh"
99+
echo ""
100+
echo " # Different agent configuration"
101+
echo " AGENT_SET=agent_gaia-validation ./scripts/run_evaluate_multiple_runs_finsearchcomp.sh"
102+
echo ""
103+
echo " # Limited tasks for testing"
104+
echo " MAX_TASKS=5 ./scripts/run_evaluate_multiple_runs_finsearchcomp.sh"

0 commit comments

Comments
 (0)