|
| 1 | +# Futurex-Online |
| 2 | + |
| 3 | +MiroFlow's evaluation on the Futurex-Online benchmark demonstrates capabilities in future event prediction tasks. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Dataset Overview |
| 8 | + |
| 9 | +!!! info "Futurex-Online Dataset" |
| 10 | + The Futurex-Online dataset consists of 61 prediction tasks covering various future events including: |
| 11 | + |
| 12 | + - Political events (referendums, elections) |
| 13 | + - Sports outcomes (football matches) |
| 14 | + - Legal proceedings |
| 15 | + - Economic indicators |
| 16 | + |
| 17 | + |
| 18 | +!!! abstract "Key Dataset Characteristics" |
| 19 | + |
| 20 | + - **Total Tasks**: 61 |
| 21 | + - **Task Type**: Future event prediction |
| 22 | + - **Answer Format**: Boxed answers (\\boxed{Yes/No} or \\boxed{A/B/C}) |
| 23 | + - **Ground Truth**: Not available (prediction tasks) |
| 24 | + - **Resolution Date**: Around 2025-09-21 (GMT+8) |
| 25 | + |
| 26 | +--- |
| 27 | + |
| 28 | +## Quick Start Guide |
| 29 | + |
| 30 | +!!! note "Quick Start Instructions" |
| 31 | + This section provides step-by-step instructions to run the Futurex-Online benchmark and prepare submission results. Since this is a prediction dataset without ground truth, we focus on execution traces and response generation. **Note**: This is a quick start guide for running the benchmark, not for reproducing exact submitted results. |
| 32 | + |
| 33 | +### Step 1: Prepare the Futurex-Online Dataset |
| 34 | + |
| 35 | +!!! tip "Dataset Setup" |
| 36 | + Use the integrated prepare-benchmark command to download and process the dataset: |
| 37 | + |
| 38 | +```bash title="Download Futurex-Online Dataset" |
| 39 | +uv run main.py prepare-benchmark get futurex |
| 40 | +``` |
| 41 | + |
| 42 | +This will create the standardized dataset at `data/futurex/standardized_data.jsonl`. |
| 43 | + |
| 44 | +### Step 2: Configure API Keys |
| 45 | + |
| 46 | +!!! warning "API Key Configuration" |
| 47 | + Set up the required API keys for model access and tool functionality. Update the `.env` file to include the following keys: |
| 48 | + |
| 49 | +```env title=".env Configuration" |
| 50 | +# For searching and web scraping |
| 51 | +SERPER_API_KEY="xxx" |
| 52 | +JINA_API_KEY="xxx" |
| 53 | +
|
| 54 | +# For Linux sandbox (code execution environment) |
| 55 | +E2B_API_KEY="xxx" |
| 56 | +
|
| 57 | +# We use Claude-3.7-Sonnet with OpenRouter backend to initialize the LLM |
| 58 | +OPENROUTER_API_KEY="xxx" |
| 59 | +OPENROUTER_BASE_URL="https://openrouter.ai/api/v1" |
| 60 | +
|
| 61 | +# Used for Claude vision understanding |
| 62 | +ANTHROPIC_API_KEY="xxx" |
| 63 | +
|
| 64 | +# Used for Gemini vision |
| 65 | +GEMINI_API_KEY="xxx" |
| 66 | +
|
| 67 | +# Use for llm judge, reasoning, o3 hints, etc. |
| 68 | +OPENAI_API_KEY="xxx" |
| 69 | +OPENAI_BASE_URL="https://api.openai.com/v1" |
| 70 | +``` |
| 71 | + |
| 72 | +### Step 3: Run the Evaluation |
| 73 | + |
| 74 | +!!! example "Evaluation Execution" |
| 75 | + Execute the following command to run evaluation on the Futurex-Online dataset. This uses the basic `agent_quickstart_1` configuration for quick start purposes. |
| 76 | + |
| 77 | +```bash title="Run Futurex-Online Evaluation" |
| 78 | +uv run main.py common-benchmark --config_file_name=agent_quickstart_1 benchmark=futurex output_dir="logs/futurex/$(date +"%Y%m%d_%H%M")" |
| 79 | +``` |
| 80 | + |
| 81 | +!!! tip "Progress Monitoring and Resume" |
| 82 | + To check the progress while running: |
| 83 | + |
| 84 | + ```bash title="Check Progress" |
| 85 | + uv run utils/progress_check/check_futurex_progress.py $PATH_TO_LOG |
| 86 | + ``` |
| 87 | + |
| 88 | + If you need to resume an interrupted evaluation, specify the same output directory to continue from where you left off. |
| 89 | + |
| 90 | + ```bash title="Resume Evaluation, e.g." |
| 91 | + uv run main.py common-benchmark --config_file_name=agent_quickstart_1 benchmark=futurex output_dir="logs/futurex/20250918_1010" |
| 92 | + ``` |
| 93 | + |
| 94 | +### Step 4: Extract Results |
| 95 | + |
| 96 | +!!! example "Result Extraction" |
| 97 | + After evaluation completion, extract the results using the provided utility: |
| 98 | + |
| 99 | +```bash title="Extract Results" |
| 100 | +uv run utils/extract_futurex_results.py logs/futurex/$(date +"%Y%m%d_%H%M") |
| 101 | +``` |
| 102 | + |
| 103 | +This will generate: |
| 104 | + |
| 105 | +- `futurex_results.json`: Detailed results for each task |
| 106 | +- `futurex_summary.json`: Summary statistics |
| 107 | +- `futurex_predictions.csv`: Predictions in CSV format |
| 108 | + |
| 109 | +--- |
| 110 | + |
| 111 | +## Sample Task Examples |
| 112 | + |
| 113 | +### Political Prediction |
| 114 | +``` |
| 115 | +Task: "Will the 2025 Guinea referendum pass? (resolved around 2025-09-21 (GMT+8))" |
| 116 | +Expected Format: \boxed{Yes} or \boxed{No} |
| 117 | +``` |
| 118 | + |
| 119 | +### Sports Prediction |
| 120 | +``` |
| 121 | +Task: "Brighton vs. Tottenham (resolved around 2025-09-21 (GMT+8)) |
| 122 | +A. Brighton win on 2025-09-20 |
| 123 | +B. Brighton vs. Tottenham end in a draw |
| 124 | +C. Tottenham win on 2025-09-20" |
| 125 | +Expected Format: \boxed{A}, \boxed{B}, or \boxed{C} |
| 126 | +``` |
| 127 | + |
| 128 | +--- |
| 129 | + |
| 130 | +## Multiple Runs and Voting |
| 131 | + |
| 132 | +!!! tip "Improving Prediction Accuracy" |
| 133 | + For better prediction accuracy, you can run multiple evaluations and use voting mechanisms to aggregate results. This approach helps reduce randomness and improve the reliability of predictions. **Note**: This is a quick start approach; production submissions may use more sophisticated configurations. |
| 134 | + |
| 135 | +### Step 1: Run Multiple Evaluations |
| 136 | + |
| 137 | +Use the multiple runs script to execute several independent evaluations: |
| 138 | + |
| 139 | +```bash title="Run Multiple Evaluations" |
| 140 | +./scripts/run_evaluate_multiple_runs_futurex.sh |
| 141 | +``` |
| 142 | + |
| 143 | +This script will: |
| 144 | + |
| 145 | +- Run 3 independent evaluations by default (configurable with `NUM_RUNS`) |
| 146 | +- Execute all tasks in parallel for efficiency |
| 147 | +- Generate separate result files for each run in `run_1/`, `run_2/`, etc. |
| 148 | +- Create a consolidated `futurex_submission.jsonl` file with voting results |
| 149 | + |
| 150 | +### Step 2: Customize Multiple Runs |
| 151 | + |
| 152 | +You can customize the evaluation parameters: |
| 153 | + |
| 154 | +```bash title="Custom Multiple Runs" |
| 155 | +# Run 5 evaluations with limited tasks for testing |
| 156 | +NUM_RUNS=5 MAX_TASKS=10 ./scripts/run_evaluate_multiple_runs_futurex.sh |
| 157 | + |
| 158 | +# Use different agent configuration |
| 159 | +AGENT_SET=agent_gaia-validation ./scripts/run_evaluate_multiple_runs_futurex.sh |
| 160 | + |
| 161 | +# Adjust concurrency for resource management |
| 162 | +MAX_CONCURRENT=3 ./scripts/run_evaluate_multiple_runs_futurex.sh |
| 163 | +``` |
| 164 | + |
| 165 | +### Step 3: Voting and Aggregation |
| 166 | + |
| 167 | +After multiple runs, the system automatically: |
| 168 | + |
| 169 | +1. **Extracts predictions** from all runs using `utils/extract_futurex_results.py` |
| 170 | +2. **Applies majority voting** to aggregate predictions across runs |
| 171 | +3. **Generates submission file** in the format required by FutureX platform |
| 172 | +4. **Provides voting statistics** showing prediction distribution across runs |
| 173 | + |
| 174 | +The voting process works as follows: |
| 175 | + |
| 176 | +- **Majority Vote**: Most common prediction across all runs wins |
| 177 | +- **Tie-breaking**: If tied, chooses the prediction that appeared earliest across all runs |
| 178 | +- **Vote Counts**: Tracks how many runs predicted each option |
| 179 | +- **Confidence Indicators**: High agreement indicates more reliable predictions |
| 180 | + |
| 181 | +### Step 4: Analyze Voting Results |
| 182 | + |
| 183 | +Check the generated files for voting analysis: |
| 184 | + |
| 185 | +```bash title="Check Voting Results" |
| 186 | +# View submission file with voting results |
| 187 | +cat logs/futurex/agent_quickstart_1_*/futurex_submission.jsonl |
| 188 | + |
| 189 | +# Check individual run results |
| 190 | +ls logs/futurex/agent_quickstart_1_*/run_*/ |
| 191 | + |
| 192 | +# Check progress and voting statistics |
| 193 | +uv run python utils/progress_check/check_futurex_progress.py logs/futurex/agent_quickstart_1_* |
| 194 | +``` |
| 195 | + |
| 196 | +### Manual Voting Aggregation |
| 197 | + |
| 198 | +You can also manually run the voting aggregation: |
| 199 | + |
| 200 | +```bash title="Manual Voting Aggregation" |
| 201 | +# Aggregate multiple runs with majority voting |
| 202 | +uv run python utils/extract_futurex_results.py logs/futurex/agent_quickstart_1_* --aggregate |
| 203 | + |
| 204 | +# Force single run mode (if needed) |
| 205 | +uv run python utils/extract_futurex_results.py logs/futurex/agent_quickstart_1_*/run_1 --single |
| 206 | + |
| 207 | +# Specify custom output file |
| 208 | +uv run python utils/extract_futurex_results.py logs/futurex/agent_quickstart_1_* -o my_voted_predictions.jsonl |
| 209 | +``` |
| 210 | + |
| 211 | +### Voting Output Format |
| 212 | + |
| 213 | +The voting aggregation generates a submission file with the following format: |
| 214 | + |
| 215 | +```json |
| 216 | +{"id": "687104310a994c0060ef87a9", "prediction": "No", "vote_counts": {"No": 2}} |
| 217 | +{"id": "68a9b46e961bd3003c8f006b", "prediction": "Yes", "vote_counts": {"Yes": 2}} |
| 218 | +``` |
| 219 | + |
| 220 | +The output includes: |
| 221 | + |
| 222 | +- **`id`**: Task identifier |
| 223 | +- **`prediction`**: Final voted prediction (without `\boxed{}` wrapper) |
| 224 | +- **`vote_counts`**: Dictionary showing how many runs predicted each option |
| 225 | + |
| 226 | +For example, `"vote_counts": {"No": 2}` means 2 out of 2 runs predicted "No", indicating high confidence. |
| 227 | + |
| 228 | +--- |
| 229 | + |
| 230 | +## Evaluation Notes |
| 231 | + |
| 232 | +!!! warning "No Ground Truth Available" |
| 233 | + Since Futurex-Online is a prediction dataset, there are no ground truth answers available for evaluation. The focus is on: |
| 234 | + |
| 235 | + - Response generation quality |
| 236 | + - Reasoning process documentation |
| 237 | + - Prediction confidence and methodology |
| 238 | + |
| 239 | +!!! info "Output Analysis" |
| 240 | + The evaluation generates detailed execution traces showing: |
| 241 | + |
| 242 | + - Research process for each prediction |
| 243 | + - Information gathering from web sources |
| 244 | + - Reasoning chains leading to predictions |
| 245 | + - Final boxed answers in required format |
| 246 | + |
| 247 | +### Directory Structure |
| 248 | + |
| 249 | +After running multiple evaluations, you'll find the following structure: |
| 250 | + |
| 251 | +``` |
| 252 | +logs/futurex/agent_quickstart_1_YYYYMMDD_HHMM/ |
| 253 | +├── futurex_submission.jsonl # Final voted predictions |
| 254 | +├── run_1/ # First run results |
| 255 | +│ ├── benchmark_results.jsonl # Individual task results |
| 256 | +│ ├── benchmark_results_pass_at_1_accuracy.txt |
| 257 | +│ └── task_*_attempt_1.json # Detailed execution traces |
| 258 | +├── run_2/ # Second run results |
| 259 | +│ └── ... (same structure as run_1) |
| 260 | +├── run_1_output.log # Run 1 execution log |
| 261 | +└── run_2_output.log # Run 2 execution log |
| 262 | +``` |
| 263 | + |
| 264 | +--- |
| 265 | + |
| 266 | +!!! info "Documentation Info" |
| 267 | + **Last Updated:** September 2025 · **Doc Contributor:** Team @ MiroMind AI |
0 commit comments