1- # 🧪 Adding New Benchmarks to MiroFlow
1+ # Contributing New Benchmarks to MiroFlow
22
3- This guide provides a comprehensive walkthrough for adding new benchmarks to the MiroFlow framework. MiroFlow uses a modular benchmark architecture that allows for easy integration of new evaluation datasets.
3+ This comprehensive guide walks you through adding new evaluation benchmarks to the MiroFlow framework. MiroFlow's modular architecture makes it easy to integrate diverse evaluation datasets while maintaining consistency and reproducibility .
44
5- ---
5+ ## Overview
66
7- ## 🚀 Step-by-Step Implementation Guide
7+ !!! info "Why Add New Benchmarks?"
8+ Adding new benchmarks serves multiple purposes:
9+
10+ - ** Internal Testing** : Validate your agent's performance on custom tasks and domains specific to your use case
11+ - ** Development Iteration** : Create targeted test sets to debug and improve specific agent capabilities
12+ - ** Domain-Specific Evaluation** : Test agents on proprietary or specialized datasets relevant to your application
13+ - ** Research Contributions** : Expand MiroFlow's benchmark coverage to advance the field with new evaluation paradigms
14+ - ** Comparative Analysis** : Benchmark your agent against custom baselines or competitors
15+
16+ ## Step-by-Step Implementation Guide
817
918### Step 1: Prepare Your Dataset
1019
11- Your benchmark dataset should follow this structure:
20+ Your benchmark dataset must follow MiroFlow's standardized structure for seamless integration.
21+
22+ #### Required Directory Structure
1223
1324```
1425your-benchmark/
1526├── standardized_data.jsonl # Metadata file (required)
1627├── file1.pdf # Optional: Binary files referenced by tasks
17- ├── file2.png
18- └── ...
28+ ├── file2.png # Optional: Images, documents, etc.
29+ ├── data.csv # Optional: Additional data files
30+ └── ... # Any other supporting files
1931```
2032
21- #### Metadata Format (JSONL)
33+ #### Metadata Format Specification
2234
23- Each line in ` standardized_data.jsonl ` should be a JSON object with these fields:
35+ Each line in ` standardized_data.jsonl ` must be a valid JSON object with the following schema:
36+
37+ !!! example "Required Fields"
38+ ```json
39+ {
40+ "task_id": "unique_task_identifier",
41+ "task_question": "The question or instruction for the task",
42+ "ground_truth": "The expected answer or solution",
43+ "file_path": "path/to/file.pdf", // Optional, can be null
44+ "metadata": { // Optional, can be empty object or other structure
45+ "difficulty": "hard",
46+ "category": "reasoning",
47+ "source": "original_dataset_name"
48+ }
49+ }
50+ ```
2451
25- ``` json
26- {
27- "task_id" : " unique_task_identifier" ,
28- "task_question" : " The question or instruction for the task" ,
29- "ground_truth" : " The expected answer or solution" ,
30- "file_path" : " path/to/file.pdf" , // Optional, can be null
31- "metadata" : { // Optional, can be empty
32- "difficulty" : " hard" ,
33- "category" : " reasoning" ,
34- "source" : " original_dataset_name"
35- }
36- }
37- ```
3852
39- ** Example:**
53+ #### Example Tasks
54+
55+ ** Simple Text-Only Task:**
4056``` json
4157{
4258 "task_id" : " math_001" ,
@@ -45,62 +61,133 @@ Each line in `standardized_data.jsonl` should be a JSON object with these fields
4561 "file_path" : null ,
4662 "metadata" : {
4763 "difficulty" : " medium" ,
48- "category" : " calculus"
64+ "category" : " calculus" ,
65+ "source" : " custom_math_problems"
66+ }
67+ }
68+ ```
69+
70+ ** File-Based Task:**
71+ ``` json
72+ {
73+ "task_id" : " doc_analysis_001" ,
74+ "task_question" : " Based on the provided financial report, what was the company's revenue growth rate?" ,
75+ "ground_truth" : " 12.5%" ,
76+ "file_path" : " reports/financial_q3_2023.pdf" ,
77+ "metadata" : {
78+ "difficulty" : " hard" ,
79+ "category" : " document_analysis" ,
80+ "file_type" : " pdf"
4981 }
5082}
5183```
5284
53- ### Step 2: Create Configuration File
85+ ### Step 2: Create Benchmark Configuration
86+
87+ Create a configuration file to define how MiroFlow should handle your benchmark.
88+
89+ #### Configuration File Location
5490
55- Create a new configuration file in ` config/benchmark/your-benchmark.yaml ` :
91+ Create: ` config/benchmark/your-benchmark.yaml `
5692
57- ``` yaml
58- # config/benchmark/your-benchmark.yaml
93+ #### Configuration Template
94+
95+ ``` yaml title="config/benchmark/your-benchmark.yaml"
96+ # Benchmark configuration for your custom dataset
5997defaults :
60- - default
61- - _self_
98+ - default # Use default benchmark settings
99+ - _self_ # Allow overrides in this file
62100
63101name : " your-benchmark"
64102
65103data :
66- data_dir : " ${data_dir}/your-benchmark" # Path to your dataset
67- metadata_file : " standardized_data.jsonl" # Metadata filename
68- whitelist : [] # Optional: List of specific task_ids to run
104+ data_dir : " ${data_dir}/your-benchmark" # Dataset location
105+ metadata_file : " standardized_data.jsonl" # Metadata filename
69106
70107execution :
71- max_tasks : null # null = no limit, or specify a number
72- max_concurrent : 5 # Number of parallel tasks
73- pass_at_k : 1 # Number of attempts per task
108+ max_tasks : null # null = no limit, number = max tasks to run
109+ max_concurrent : 5 # Number of parallel task executions
110+ pass_at_k : 1 # Number of attempts per task for pass@k evaluation
74111
112+ # LLM judge configuration for evaluation
75113openai_api_key : " ${oc.env:OPENAI_API_KEY,???}"
76114` ` `
77115
116+ #### Configuration Options
117+
118+ !!! tip "Execution Parameters"
119+ - **max_tasks**: Control dataset size during development (use small numbers for testing)
120+ - **max_concurrent**: Balance speed vs. resource usage
121+ - **pass_at_k**: Enable multiple attempts for better success measurement
122+
78123### Step 3: Set Up Data Directory
79124
80- Place your dataset in the appropriate data directory:
125+ Organize your dataset files in the MiroFlow data structure.
81126
82- ` ` ` bash
127+ ` ` ` bash title="Data Directory Setup"
83128# Create the benchmark data directory
84129mkdir -p data/your-benchmark
85130
86131# Copy your dataset files
87132cp your-dataset/* data/your-benchmark/
133+
134+ # Verify the structure
135+ ls -la data/your-benchmark/
88136```
89137
138+ !!! warning "File Path Consistency"
139+ Ensure that all ` file_path ` entries in your JSONL metadata correctly reference files in your data directory.
140+
90141### Step 4: Test Your Benchmark
91142
92- Run your benchmark using the MiroFlow CLI:
143+ Validate your benchmark integration with comprehensive testing.
93144
94- ``` bash
95- # Test with a small subset
145+ #### Initial Testing
146+
147+ Start with a small subset to verify everything works correctly:
148+
149+ ``` bash title="Test Benchmark Integration"
96150uv run main.py common-benchmark \
97151 --config_file_name=agent_quickstart_1 \
98152 benchmark=your-benchmark \
99- benchmark.execution.max_tasks=5 \
100- output_dir=logs/test-your-benchmark
153+ benchmark.execution.max_tasks=3 \
154+ output_dir=" logs/test-your-benchmark/ $( date + " %Y%m%d_%H%M " ) "
101155```
102156
103- ---
157+ #### Full Evaluation
158+
159+ Once testing passes, run the complete benchmark:
160+
161+ ``` bash title="Run Full Benchmark"
162+ uv run main.py common-benchmark \
163+ --config_file_name=agent_quickstart_1 \
164+ benchmark=your-benchmark \
165+ output_dir=" logs/your-benchmark/$( date +" %Y%m%d_%H%M" ) "
166+ ```
167+
168+ ### Step 5: Validate Results
169+
170+ Review the evaluation outputs to ensure proper integration:
171+
172+ #### Check Output Files
173+
174+ ``` bash title="Verify Results"
175+ # List generated files
176+ ls -la logs/your-benchmark/
177+
178+ # Review a sample task log
179+ cat logs/your-benchmark/task_* _attempt_1.json | head -50
180+ ```
181+
182+ #### Expected Output Structure
183+
184+ Your benchmark should generate:
185+
186+ - Individual task execution logs
187+ - Aggregate benchmark results (` benchmark_results.jsonl ` )
188+ - Accuracy summary files
189+ - Hydra configuration logs
190+
104191
105- ** Last Updated: ** Sep 2025
106- ** Doc Contributor:** Team @ MiroMind AI
192+ !!! info "Documentation Info"
193+ ** Last Updated: ** September 2025 · ** Doc Contributor:** Team @ MiroMind AI
0 commit comments