|
1 | | -# Add New Benchmarks |
| 1 | +# 🧪 Adding New Benchmarks to MiroFlow |
2 | 2 |
|
| 3 | +This guide provides a comprehensive walkthrough for adding new benchmarks to the MiroFlow framework. MiroFlow uses a modular benchmark architecture that allows for easy integration of new evaluation datasets. |
3 | 4 |
|
| 5 | +--- |
| 6 | + |
| 7 | +## 🚀 Step-by-Step Implementation Guide |
| 8 | + |
| 9 | +### Step 1: Prepare Your Dataset |
| 10 | + |
| 11 | +Your benchmark dataset should follow this structure: |
| 12 | + |
| 13 | +``` |
| 14 | +your-benchmark/ |
| 15 | +├── standardized_data.jsonl # Metadata file (required) |
| 16 | +├── file1.pdf # Optional: Binary files referenced by tasks |
| 17 | +├── file2.png |
| 18 | +└── ... |
| 19 | +``` |
| 20 | + |
| 21 | +#### Metadata Format (JSONL) |
| 22 | + |
| 23 | +Each line in `standardized_data.jsonl` should be a JSON object with these fields: |
| 24 | + |
| 25 | +```json |
| 26 | +{ |
| 27 | + "task_id": "unique_task_identifier", |
| 28 | + "task_question": "The question or instruction for the task", |
| 29 | + "ground_truth": "The expected answer or solution", |
| 30 | + "file_path": "path/to/file.pdf", // Optional, can be null |
| 31 | + "metadata": { // Optional, can be empty |
| 32 | + "difficulty": "hard", |
| 33 | + "category": "reasoning", |
| 34 | + "source": "original_dataset_name" |
| 35 | + } |
| 36 | +} |
| 37 | +``` |
| 38 | + |
| 39 | +**Example:** |
| 40 | +```json |
| 41 | +{ |
| 42 | + "task_id": "math_001", |
| 43 | + "task_question": "What is the integral of x^2 from 0 to 2?", |
| 44 | + "ground_truth": "8/3", |
| 45 | + "file_path": null, |
| 46 | + "metadata": { |
| 47 | + "difficulty": "medium", |
| 48 | + "category": "calculus" |
| 49 | + } |
| 50 | +} |
| 51 | +``` |
| 52 | + |
| 53 | +### Step 2: Create Configuration File |
| 54 | + |
| 55 | +Create a new configuration file in `config/benchmark/your-benchmark.yaml`: |
4 | 56 |
|
5 | | -# -- Coming Soon -- |
| 57 | +```yaml |
| 58 | +# config/benchmark/your-benchmark.yaml |
| 59 | +defaults: |
| 60 | + - default |
| 61 | + - _self_ |
6 | 62 |
|
| 63 | +name: "your-benchmark" |
| 64 | + |
| 65 | +data: |
| 66 | + data_dir: "${data_dir}/your-benchmark" # Path to your dataset |
| 67 | + metadata_file: "standardized_data.jsonl" # Metadata filename |
| 68 | + whitelist: [] # Optional: List of specific task_ids to run |
| 69 | + |
| 70 | +execution: |
| 71 | + max_tasks: null # null = no limit, or specify a number |
| 72 | + max_concurrent: 5 # Number of parallel tasks |
| 73 | + pass_at_k: 1 # Number of attempts per task |
| 74 | + |
| 75 | +openai_api_key: "${oc.env:OPENAI_API_KEY,???}" |
| 76 | +``` |
| 77 | +
|
| 78 | +### Step 3: Set Up Data Directory |
| 79 | +
|
| 80 | +Place your dataset in the appropriate data directory: |
| 81 | +
|
| 82 | +```bash |
| 83 | +# Create the benchmark data directory |
| 84 | +mkdir -p data/your-benchmark |
| 85 | + |
| 86 | +# Copy your dataset files |
| 87 | +cp your-dataset/* data/your-benchmark/ |
| 88 | +``` |
| 89 | + |
| 90 | +### Step 4: Test Your Benchmark |
| 91 | + |
| 92 | +Run your benchmark using the MiroFlow CLI: |
| 93 | + |
| 94 | +```bash |
| 95 | +# Test with a small subset |
| 96 | +uv run main.py common-benchmark \ |
| 97 | + --config_file_name=agent_quickstart_1 \ |
| 98 | + benchmark=your-benchmark \ |
| 99 | + benchmark.execution.max_tasks=5 \ |
| 100 | + output_dir=logs/test-your-benchmark |
| 101 | +``` |
7 | 102 |
|
8 | 103 | --- |
| 104 | + |
9 | 105 | **Last Updated:** Sep 2025 |
10 | | -**Doc Contributor:** Team @ MiroMind AI |
| 106 | +**Doc Contributor:** Team @ MiroMind AI |
0 commit comments