|
| 1 | +# FinSearchComp |
| 2 | + |
| 3 | +MiroFlow's evaluation on the FinSearchComp benchmark demonstrates capabilities in financial information search and analysis tasks, showcasing advanced reasoning abilities in complex financial research scenarios. |
| 4 | + |
| 5 | +More details: [FinSearchComp Dataset](https://huggingface.co/datasets/ByteSeedXpert/FinSearchComp) |
| 6 | + |
| 7 | +--- |
| 8 | + |
| 9 | +## Dataset Overview |
| 10 | + |
| 11 | +!!! info "FinSearchComp Dataset" |
| 12 | + The FinSearchComp dataset consists of financial search and analysis tasks that require comprehensive research capabilities including: |
| 13 | + |
| 14 | + - Financial data retrieval and analysis |
| 15 | + - Market research and company analysis |
| 16 | + - Investment decision support |
| 17 | + - Financial news and report interpretation |
| 18 | + - Time-sensitive financial information gathering |
| 19 | + |
| 20 | +!!! abstract "Key Dataset Characteristics" |
| 21 | + |
| 22 | + - **Total Tasks**: 635 (across T1, T2, T3 categories) |
| 23 | + - **Task Types**: |
| 24 | + - **T1**: Time-Sensitive Data Fetching |
| 25 | + - **T2**: Financial Analysis and Research |
| 26 | + - **T3**: Complex Historical Investigation |
| 27 | + - **Answer Format**: Detailed financial analysis and research reports |
| 28 | + - **Ground Truth**: Available for T2 and T3 tasks, changes dynamically for T1 tasks |
| 29 | + - **Evaluation**: Judge-based evaluation with correctness assessment |
| 30 | + |
| 31 | +--- |
| 32 | + |
| 33 | +## Quick Start Guide |
| 34 | + |
| 35 | +!!! note "Quick Start Instructions" |
| 36 | + This section provides step-by-step instructions to run the FinSearchComp benchmark and prepare submission results. **Note**: This is a quick start guide for running the benchmark, not for reproducing exact submitted results. |
| 37 | + |
| 38 | +### Step 1: Prepare the FinSearchComp Dataset |
| 39 | + |
| 40 | +!!! tip "Dataset Setup" |
| 41 | + Use the integrated prepare-benchmark command to download and process the dataset: |
| 42 | + |
| 43 | +```bash title="Download FinSearchComp Dataset" |
| 44 | +uv run main.py prepare-benchmark get finsearchcomp |
| 45 | +``` |
| 46 | + |
| 47 | +This will create the standardized dataset at `data/finsearchcomp/standardized_data.jsonl`. |
| 48 | + |
| 49 | +### Step 2: Configure API Keys |
| 50 | + |
| 51 | +!!! warning "API Key Configuration" |
| 52 | + Set up the required API keys for model access and tool functionality. Update the `.env` file to include the following keys: |
| 53 | + |
| 54 | +```env title=".env Configuration" |
| 55 | +# For searching and web scraping |
| 56 | +SERPER_API_KEY="xxx" |
| 57 | +JINA_API_KEY="xxx" |
| 58 | +
|
| 59 | +# For Linux sandbox (code execution environment) |
| 60 | +E2B_API_KEY="xxx" |
| 61 | +
|
| 62 | +# We use MiroThinker model for financial analysis |
| 63 | +OAI_MIROTHINKER_API_KEY="xxx" |
| 64 | +OAI_MIROTHINKER_BASE_URL="http://localhost:61005/v1" |
| 65 | +
|
| 66 | +# Used for o3 hints and final answer extraction |
| 67 | +OPENAI_API_KEY="xxx" |
| 68 | +OPENAI_BASE_URL="https://api.openai.com/v1" |
| 69 | +
|
| 70 | +# Used for Claude vision understanding |
| 71 | +ANTHROPIC_API_KEY="xxx" |
| 72 | +
|
| 73 | +# Used for Gemini vision |
| 74 | +GEMINI_API_KEY="xxx" |
| 75 | +``` |
| 76 | + |
| 77 | +### Step 3: Run the Evaluation |
| 78 | + |
| 79 | +!!! example "Evaluation Execution" |
| 80 | + Execute the following command to run evaluation on the FinSearchComp dataset: |
| 81 | + |
| 82 | +```bash title="Run FinSearchComp Evaluation" |
| 83 | +uv run main.py common-benchmark --config_file_name=agent_finsearchcomp benchmark=finsearchcomp output_dir="logs/finsearchcomp/$(date +"%Y%m%d_%H%M")" |
| 84 | +``` |
| 85 | + |
| 86 | +!!! tip "Progress Monitoring and Resume" |
| 87 | + To check the progress while running: |
| 88 | + |
| 89 | + ```bash title="Check Progress" |
| 90 | + uv run utils/progress_check/check_finsearchcomp_progress.py $PATH_TO_LOG |
| 91 | + ``` |
| 92 | + |
| 93 | + If you need to resume an interrupted evaluation, specify the same output directory to continue from where you left off. |
| 94 | + |
| 95 | + ```bash title="Resume Evaluation, e.g." |
| 96 | + uv run main.py common-benchmark --config_file_name=agent_finsearchcomp benchmark=finsearchcomp output_dir=${PATH_TO_LOG} |
| 97 | + ``` |
| 98 | + |
| 99 | +### Step 4: Extract Results |
| 100 | + |
| 101 | +!!! example "Result Extraction" |
| 102 | + After evaluation completion, the results are automatically generated in the output directory: |
| 103 | + |
| 104 | +- `benchmark_results.jsonl`: Detailed results for each task |
| 105 | +- `benchmark_results_pass_at_1_accuracy.txt`: Summary accuracy statistics |
| 106 | +- `task_*_attempt_1.json`: Individual task execution traces |
| 107 | + |
| 108 | +--- |
| 109 | + |
| 110 | +## Evaluation Notes |
| 111 | + |
| 112 | +!!! warning "Task Type Considerations" |
| 113 | + The FinSearchComp dataset includes different task types with varying evaluation criteria: |
| 114 | + |
| 115 | + - **T1 Tasks**: Time-Sensitive Data Fetching tasks are excluded from correctness evaluation due to outdated ground truth, but completion is still tracked |
| 116 | + - **T2 Tasks**: Financial Analysis tasks are evaluated for correctness and quality |
| 117 | + - **T3 Tasks**: Complex Historical Investigation tasks require comprehensive research and analysis |
| 118 | + |
| 119 | +!!! info "Output Analysis" |
| 120 | + The evaluation generates detailed execution traces showing: |
| 121 | + |
| 122 | + - Research process for each financial task |
| 123 | + - Information gathering from multiple sources |
| 124 | + - Financial calculations and analysis |
| 125 | + - Comprehensive reports with insights and recommendations |
| 126 | + |
| 127 | +### Directory Structure |
| 128 | + |
| 129 | +After running evaluations, you'll find the following structure: |
| 130 | + |
| 131 | +``` |
| 132 | +logs/finsearchcomp/agent_finsearchcomp_YYYYMMDD_HHMM/ |
| 133 | +├── benchmark_results.jsonl # Task results summary |
| 134 | +├── benchmark_results_pass_at_1_accuracy.txt # Accuracy statistics |
| 135 | +├── task_(T1)Time_Sensitive_Data_Fetching_*.json # T1 task traces |
| 136 | +├── task_(T2)Financial_Analysis_*.json # T2 task traces |
| 137 | +├── task_(T3)Complex_Historical_Investigation_*.json # T3 task traces |
| 138 | +└── output.log # Execution log |
| 139 | +``` |
| 140 | + |
| 141 | +### Task Categories Breakdown |
| 142 | + |
| 143 | +The progress checker provides detailed statistics: |
| 144 | + |
| 145 | +- **Total Tasks**: Complete count across all categories |
| 146 | +- **Completed Tasks**: Successfully finished tasks |
| 147 | +- **Correct Tasks**: Tasks with judge_result "CORRECT" (T2 and T3 only) |
| 148 | +- **Category Breakdown**: Separate counts for T1, T2, and T3 tasks |
| 149 | +- **Accuracy Metrics**: Pass@1 accuracy for evaluable tasks |
| 150 | + |
| 151 | +--- |
| 152 | + |
| 153 | +## Usage Examples |
| 154 | + |
| 155 | +### Single Run Evaluation |
| 156 | +```bash title="Basic Evaluation" |
| 157 | +uv run main.py common-benchmark --config_file_name=agent_finsearchcomp benchmark=finsearchcomp output_dir="logs/finsearchcomp/$(date +"%Y%m%d_%H%M")" |
| 158 | +``` |
| 159 | + |
| 160 | +### Limited Task Testing |
| 161 | +```bash title="Test with Limited Tasks" |
| 162 | +uv run main.py common-benchmark --config_file_name=agent_finsearchcomp benchmark=finsearchcomp benchmark.execution.max_tasks=5 output_dir="logs/finsearchcomp/$(date +"%Y%m%d_%H%M")" |
| 163 | +``` |
| 164 | + |
| 165 | +### Custom Agent Configuration |
| 166 | +```bash title="Different Agent Setup" |
| 167 | +uv run main.py common-benchmark --config_file_name=agent_gaia-validation benchmark=finsearchcomp output_dir="logs/finsearchcomp/$(date +"%Y%m%d_%H%M")" |
| 168 | +``` |
| 169 | + |
| 170 | +### Multiple Runs for Reliability |
| 171 | +```bash title="Multiple Runs" |
| 172 | +NUM_RUNS=5 ./scripts/run_evaluate_multiple_runs_finsearchcomp.sh |
| 173 | +``` |
| 174 | + |
| 175 | +--- |
| 176 | + |
| 177 | +!!! info "Documentation Info" |
| 178 | + **Last Updated:** September 2025 · **Doc Contributor:** Team @ MiroMind AI |
0 commit comments