Skip to content

Commit d9a29ba

Browse files
authored
feat(benchmark): add support for evaluation on futurex (#40)
* upd: add futurex evaluation support. * upd: support multiple eval for futurex and add relavent doc. * upd: fix bugs with doc for futurex. * debug: fix wrong calling path.
1 parent 88c0528 commit d9a29ba

File tree

11 files changed

+1030
-1
lines changed

11 files changed

+1030
-1
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -208,6 +208,7 @@ marimo/_lsp/
208208
__marimo__/
209209

210210
logs/
211+
tmp/
211212

212213
data/*
213214
!data/README.md

config/benchmark/futurex.yaml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# config/benchmark/futurex.yaml
2+
defaults:
3+
- default
4+
- _self_
5+
6+
name: "futurex"
7+
8+
data:
9+
data_dir: "${data_dir}/futurex" # Path to your dataset
10+
metadata_file: "standardized_data.jsonl" # Metadata filename
11+
whitelist: [] # Optional: List of specific task_ids to run
12+
13+
execution:
14+
max_tasks: null # null = no limit, or specify a number
15+
max_concurrent: 5 # Number of parallel tasks
16+
pass_at_k: 1 # Number of attempts per task
17+
18+
# Set to skip evaluation since we don't have ground truth
19+
openai_api_key: "skip_evaluation"
20+

docs/mkdocs/docs/download_datasets.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,7 @@ uv run main.py prepare-benchmark get browsecomp-test
7979
uv run main.py prepare-benchmark get browsecomp-zh-test
8080
uv run main.py prepare-benchmark get hle
8181
uv run main.py prepare-benchmark get xbench-ds
82+
uv run main.py prepare-benchmark get futurex
8283
```
8384

8485
### What This Script Does
@@ -94,6 +95,7 @@ uv run main.py prepare-benchmark get xbench-ds
9495
- `browsecomp-zh-test` - Chinese BrowseComp test set
9596
- `hle` - HLE dataset
9697
- `xbench-ds` - xbench-DeepSearch dataset
98+
- `futurex` - Futurex-Online dataset
9799

98100
### Customizing Dataset Selection
99101

docs/mkdocs/docs/futurex.md

Lines changed: 267 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
# Futurex-Online
2+
3+
MiroFlow's evaluation on the Futurex-Online benchmark demonstrates capabilities in future event prediction tasks.
4+
5+
---
6+
7+
## Dataset Overview
8+
9+
!!! info "Futurex-Online Dataset"
10+
The Futurex-Online dataset consists of 61 prediction tasks covering various future events including:
11+
12+
- Political events (referendums, elections)
13+
- Sports outcomes (football matches)
14+
- Legal proceedings
15+
- Economic indicators
16+
17+
18+
!!! abstract "Key Dataset Characteristics"
19+
20+
- **Total Tasks**: 61
21+
- **Task Type**: Future event prediction
22+
- **Answer Format**: Boxed answers (\\boxed{Yes/No} or \\boxed{A/B/C})
23+
- **Ground Truth**: Not available (prediction tasks)
24+
- **Resolution Date**: Around 2025-09-21 (GMT+8)
25+
26+
---
27+
28+
## Quick Start Guide
29+
30+
!!! note "Quick Start Instructions"
31+
This section provides step-by-step instructions to run the Futurex-Online benchmark and prepare submission results. Since this is a prediction dataset without ground truth, we focus on execution traces and response generation. **Note**: This is a quick start guide for running the benchmark, not for reproducing exact submitted results.
32+
33+
### Step 1: Prepare the Futurex-Online Dataset
34+
35+
!!! tip "Dataset Setup"
36+
Use the integrated prepare-benchmark command to download and process the dataset:
37+
38+
```bash title="Download Futurex-Online Dataset"
39+
uv run main.py prepare-benchmark get futurex
40+
```
41+
42+
This will create the standardized dataset at `data/futurex/standardized_data.jsonl`.
43+
44+
### Step 2: Configure API Keys
45+
46+
!!! warning "API Key Configuration"
47+
Set up the required API keys for model access and tool functionality. Update the `.env` file to include the following keys:
48+
49+
```env title=".env Configuration"
50+
# For searching and web scraping
51+
SERPER_API_KEY="xxx"
52+
JINA_API_KEY="xxx"
53+
54+
# For Linux sandbox (code execution environment)
55+
E2B_API_KEY="xxx"
56+
57+
# We use Claude-3.7-Sonnet with OpenRouter backend to initialize the LLM
58+
OPENROUTER_API_KEY="xxx"
59+
OPENROUTER_BASE_URL="https://openrouter.ai/api/v1"
60+
61+
# Used for Claude vision understanding
62+
ANTHROPIC_API_KEY="xxx"
63+
64+
# Used for Gemini vision
65+
GEMINI_API_KEY="xxx"
66+
67+
# Use for llm judge, reasoning, o3 hints, etc.
68+
OPENAI_API_KEY="xxx"
69+
OPENAI_BASE_URL="https://api.openai.com/v1"
70+
```
71+
72+
### Step 3: Run the Evaluation
73+
74+
!!! example "Evaluation Execution"
75+
Execute the following command to run evaluation on the Futurex-Online dataset. This uses the basic `agent_quickstart_1` configuration for quick start purposes.
76+
77+
```bash title="Run Futurex-Online Evaluation"
78+
uv run main.py common-benchmark --config_file_name=agent_quickstart_1 benchmark=futurex output_dir="logs/futurex/$(date +"%Y%m%d_%H%M")"
79+
```
80+
81+
!!! tip "Progress Monitoring and Resume"
82+
To check the progress while running:
83+
84+
```bash title="Check Progress"
85+
uv run utils/progress_check/check_futurex_progress.py $PATH_TO_LOG
86+
```
87+
88+
If you need to resume an interrupted evaluation, specify the same output directory to continue from where you left off.
89+
90+
```bash title="Resume Evaluation, e.g."
91+
uv run main.py common-benchmark --config_file_name=agent_quickstart_1 benchmark=futurex output_dir="logs/futurex/20250918_1010"
92+
```
93+
94+
### Step 4: Extract Results
95+
96+
!!! example "Result Extraction"
97+
After evaluation completion, extract the results using the provided utility:
98+
99+
```bash title="Extract Results"
100+
uv run utils/extract_futurex_results.py logs/futurex/$(date +"%Y%m%d_%H%M")
101+
```
102+
103+
This will generate:
104+
105+
- `futurex_results.json`: Detailed results for each task
106+
- `futurex_summary.json`: Summary statistics
107+
- `futurex_predictions.csv`: Predictions in CSV format
108+
109+
---
110+
111+
## Sample Task Examples
112+
113+
### Political Prediction
114+
```
115+
Task: "Will the 2025 Guinea referendum pass? (resolved around 2025-09-21 (GMT+8))"
116+
Expected Format: \boxed{Yes} or \boxed{No}
117+
```
118+
119+
### Sports Prediction
120+
```
121+
Task: "Brighton vs. Tottenham (resolved around 2025-09-21 (GMT+8))
122+
A. Brighton win on 2025-09-20
123+
B. Brighton vs. Tottenham end in a draw
124+
C. Tottenham win on 2025-09-20"
125+
Expected Format: \boxed{A}, \boxed{B}, or \boxed{C}
126+
```
127+
128+
---
129+
130+
## Multiple Runs and Voting
131+
132+
!!! tip "Improving Prediction Accuracy"
133+
For better prediction accuracy, you can run multiple evaluations and use voting mechanisms to aggregate results. This approach helps reduce randomness and improve the reliability of predictions. **Note**: This is a quick start approach; production submissions may use more sophisticated configurations.
134+
135+
### Step 1: Run Multiple Evaluations
136+
137+
Use the multiple runs script to execute several independent evaluations:
138+
139+
```bash title="Run Multiple Evaluations"
140+
./scripts/run_evaluate_multiple_runs_futurex.sh
141+
```
142+
143+
This script will:
144+
145+
- Run 3 independent evaluations by default (configurable with `NUM_RUNS`)
146+
- Execute all tasks in parallel for efficiency
147+
- Generate separate result files for each run in `run_1/`, `run_2/`, etc.
148+
- Create a consolidated `futurex_submission.jsonl` file with voting results
149+
150+
### Step 2: Customize Multiple Runs
151+
152+
You can customize the evaluation parameters:
153+
154+
```bash title="Custom Multiple Runs"
155+
# Run 5 evaluations with limited tasks for testing
156+
NUM_RUNS=5 MAX_TASKS=10 ./scripts/run_evaluate_multiple_runs_futurex.sh
157+
158+
# Use different agent configuration
159+
AGENT_SET=agent_gaia-validation ./scripts/run_evaluate_multiple_runs_futurex.sh
160+
161+
# Adjust concurrency for resource management
162+
MAX_CONCURRENT=3 ./scripts/run_evaluate_multiple_runs_futurex.sh
163+
```
164+
165+
### Step 3: Voting and Aggregation
166+
167+
After multiple runs, the system automatically:
168+
169+
1. **Extracts predictions** from all runs using `utils/extract_futurex_results.py`
170+
2. **Applies majority voting** to aggregate predictions across runs
171+
3. **Generates submission file** in the format required by FutureX platform
172+
4. **Provides voting statistics** showing prediction distribution across runs
173+
174+
The voting process works as follows:
175+
176+
- **Majority Vote**: Most common prediction across all runs wins
177+
- **Tie-breaking**: If tied, chooses the prediction that appeared earliest across all runs
178+
- **Vote Counts**: Tracks how many runs predicted each option
179+
- **Confidence Indicators**: High agreement indicates more reliable predictions
180+
181+
### Step 4: Analyze Voting Results
182+
183+
Check the generated files for voting analysis:
184+
185+
```bash title="Check Voting Results"
186+
# View submission file with voting results
187+
cat logs/futurex/agent_quickstart_1_*/futurex_submission.jsonl
188+
189+
# Check individual run results
190+
ls logs/futurex/agent_quickstart_1_*/run_*/
191+
192+
# Check progress and voting statistics
193+
uv run python utils/progress_check/check_futurex_progress.py logs/futurex/agent_quickstart_1_*
194+
```
195+
196+
### Manual Voting Aggregation
197+
198+
You can also manually run the voting aggregation:
199+
200+
```bash title="Manual Voting Aggregation"
201+
# Aggregate multiple runs with majority voting
202+
uv run python utils/extract_futurex_results.py logs/futurex/agent_quickstart_1_* --aggregate
203+
204+
# Force single run mode (if needed)
205+
uv run python utils/extract_futurex_results.py logs/futurex/agent_quickstart_1_*/run_1 --single
206+
207+
# Specify custom output file
208+
uv run python utils/extract_futurex_results.py logs/futurex/agent_quickstart_1_* -o my_voted_predictions.jsonl
209+
```
210+
211+
### Voting Output Format
212+
213+
The voting aggregation generates a submission file with the following format:
214+
215+
```json
216+
{"id": "687104310a994c0060ef87a9", "prediction": "No", "vote_counts": {"No": 2}}
217+
{"id": "68a9b46e961bd3003c8f006b", "prediction": "Yes", "vote_counts": {"Yes": 2}}
218+
```
219+
220+
The output includes:
221+
222+
- **`id`**: Task identifier
223+
- **`prediction`**: Final voted prediction (without `\boxed{}` wrapper)
224+
- **`vote_counts`**: Dictionary showing how many runs predicted each option
225+
226+
For example, `"vote_counts": {"No": 2}` means 2 out of 2 runs predicted "No", indicating high confidence.
227+
228+
---
229+
230+
## Evaluation Notes
231+
232+
!!! warning "No Ground Truth Available"
233+
Since Futurex-Online is a prediction dataset, there are no ground truth answers available for evaluation. The focus is on:
234+
235+
- Response generation quality
236+
- Reasoning process documentation
237+
- Prediction confidence and methodology
238+
239+
!!! info "Output Analysis"
240+
The evaluation generates detailed execution traces showing:
241+
242+
- Research process for each prediction
243+
- Information gathering from web sources
244+
- Reasoning chains leading to predictions
245+
- Final boxed answers in required format
246+
247+
### Directory Structure
248+
249+
After running multiple evaluations, you'll find the following structure:
250+
251+
```
252+
logs/futurex/agent_quickstart_1_YYYYMMDD_HHMM/
253+
├── futurex_submission.jsonl # Final voted predictions
254+
├── run_1/ # First run results
255+
│ ├── benchmark_results.jsonl # Individual task results
256+
│ ├── benchmark_results_pass_at_1_accuracy.txt
257+
│ └── task_*_attempt_1.json # Detailed execution traces
258+
├── run_2/ # Second run results
259+
│ └── ... (same structure as run_1)
260+
├── run_1_output.log # Run 1 execution log
261+
└── run_2_output.log # Run 2 execution log
262+
```
263+
264+
---
265+
266+
!!! info "Documentation Info"
267+
**Last Updated:** September 2025 · **Doc Contributor:** Team @ MiroMind AI

docs/mkdocs/mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@ nav:
5252
- Benchmarks:
5353
- GAIA-Validation: gaia_validation.md
5454
- GAIA-Test: gaia_test.md
55+
- FutureX: futurex.md
5556
- Add New Benchmarks: contribute_benchmarks.md
5657

5758
- Tools:

0 commit comments

Comments
 (0)