Skip to content

Commit 1fc9678

Browse files
committed
update docs style
1 parent 55a27bd commit 1fc9678

22 files changed

+1008
-1023
lines changed
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
defaults:
2+
- benchmark: gaia-validation-text-only
3+
- override hydra/job_logging: none
4+
- _self_ # Allow defining variables at the top of this file
5+
6+
7+
main_agent:
8+
prompt_class: MainAgentPrompt_GAIA
9+
llm:
10+
provider_class: "ClaudeOpenRouterClient"
11+
model_name: "anthropic/claude-3.7-sonnet"
12+
async_client: true
13+
temperature: 0.3
14+
top_p: 0.95
15+
min_p: 0.0
16+
top_k: -1
17+
max_tokens: 32000
18+
openrouter_api_key: "${oc.env:OPENROUTER_API_KEY,???}"
19+
openrouter_base_url: "${oc.env:OPENROUTER_BASE_URL,https://openrouter.ai/api/v1}"
20+
openrouter_provider: "anthropic"
21+
disable_cache_control: false
22+
keep_tool_result: -1
23+
oai_tool_thinking: false
24+
25+
tool_config:
26+
- tool-reasoning
27+
28+
max_turns: -1 # Maximum number of turns for main agent execution
29+
max_tool_calls_per_turn: 10 # Maximum number of tool calls per turn
30+
31+
input_process:
32+
o3_hint: true
33+
output_process:
34+
o3_final_answer: true
35+
36+
openai_api_key: "${oc.env:OPENAI_API_KEY,???}" # used for o3 hints and final answer extraction
37+
add_message_id: true
38+
keep_tool_result: -1
39+
chinese_context: "${oc.env:CHINESE_CONTEXT,false}"
40+
41+
42+
sub_agents:
43+
agent-worker:
44+
prompt_class: SubAgentWorkerPrompt
45+
llm:
46+
provider_class: "ClaudeOpenRouterClient"
47+
model_name: "anthropic/claude-3.7-sonnet"
48+
async_client: true
49+
temperature: 0.3
50+
top_p: 0.95
51+
min_p: 0.0
52+
top_k: -1
53+
max_tokens: 32000
54+
openrouter_api_key: "${oc.env:OPENROUTER_API_KEY,???}"
55+
openrouter_base_url: "${oc.env:OPENROUTER_BASE_URL,https://openrouter.ai/api/v1}"
56+
openrouter_provider: "anthropic"
57+
disable_cache_control: false
58+
keep_tool_result: -1
59+
oai_tool_thinking: false
60+
61+
tool_config:
62+
- tool-searching
63+
- tool-image-video
64+
- tool-reading
65+
- tool-code
66+
- tool-audio
67+
68+
max_turns: -1 # Maximum number of turns for main agent execution
69+
max_tool_calls_per_turn: 10 # Maximum number of tool calls per turn
70+
71+
72+
# Can define some top-level or default parameters here
73+
output_dir: logs/
74+
data_dir: "${oc.env:DATA_DIR,data}" # Points to where data is stored
75+
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
# config/benchmark/gaia-validation.yaml
2+
defaults:
3+
- default
4+
- _self_
5+
6+
name: "gaia-validation-text-only"
7+
8+
data:
9+
data_dir: "${data_dir}/gaia-val-text-only"
10+
11+
execution:
12+
max_tasks: null # null means no limit
13+
max_concurrent: 3
14+
pass_at_k: 1
15+
16+
openai_api_key: "${oc.env:OPENAI_API_KEY,???}"
Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -1,37 +1,38 @@
11
# Claude 3.7 Sonnet
22

3-
## What This Is
43
Anthropic's Claude 3.7 Sonnet model with 200K context, strong reasoning, and tool use capabilities.
54

65
## Available Clients
76

87
### ClaudeAnthropicClient (Direct API)
9-
**Environment:**
10-
```bash
8+
9+
**Environment Setup:**
10+
11+
```bash title="Environment Variables"
1112
export ANTHROPIC_API_KEY="your-key"
1213
export ANTHROPIC_BASE_URL="https://api.anthropic.com" # optional
1314
```
1415

15-
**Config:**
16-
```yaml
16+
**Configuration:**
17+
18+
```yaml title="Agent Configuration"
1719
main_agent:
1820
llm:
1921
provider_class: "ClaudeAnthropicClient"
2022
model_name: "claude-3-7-sonnet-20250219" # Use actual model name from Anthropic API
2123
anthropic_api_key: "${oc.env:ANTHROPIC_API_KEY,???}"
2224
anthropic_base_url: "${oc.env:ANTHROPIC_BASE_URL,https://api.anthropic.com}"
23-
...
2425
```
2526
2627
## Usage
27-
```bash
28+
29+
```bash title="Example Command"
2830
# Use existing config
2931
uv run main.py trace --config_file_name=your_config_file \
3032
--task="Your task" --task_file_name="data/file.txt"
3133
```
3234

33-
34-
3535
---
36-
**Last Updated:** Sep 2025
37-
**Doc Contributor:** Team @ MiroMind AI
36+
37+
!!! info "Documentation Info"
38+
**Last Updated:** September 2025 · **Doc Contributor:** Team @ MiroMind AI
Lines changed: 133 additions & 46 deletions
Original file line numberDiff line numberDiff line change
@@ -1,42 +1,58 @@
1-
# 🧪 Adding New Benchmarks to MiroFlow
1+
# Contributing New Benchmarks to MiroFlow
22

3-
This guide provides a comprehensive walkthrough for adding new benchmarks to the MiroFlow framework. MiroFlow uses a modular benchmark architecture that allows for easy integration of new evaluation datasets.
3+
This comprehensive guide walks you through adding new evaluation benchmarks to the MiroFlow framework. MiroFlow's modular architecture makes it easy to integrate diverse evaluation datasets while maintaining consistency and reproducibility.
44

5-
---
5+
## Overview
66

7-
## 🚀 Step-by-Step Implementation Guide
7+
!!! info "Why Add New Benchmarks?"
8+
Adding new benchmarks serves multiple purposes:
9+
10+
- **Internal Testing**: Validate your agent's performance on custom tasks and domains specific to your use case
11+
- **Development Iteration**: Create targeted test sets to debug and improve specific agent capabilities
12+
- **Domain-Specific Evaluation**: Test agents on proprietary or specialized datasets relevant to your application
13+
- **Research Contributions**: Expand MiroFlow's benchmark coverage to advance the field with new evaluation paradigms
14+
- **Comparative Analysis**: Benchmark your agent against custom baselines or competitors
15+
16+
## Step-by-Step Implementation Guide
817

918
### Step 1: Prepare Your Dataset
1019

11-
Your benchmark dataset should follow this structure:
20+
Your benchmark dataset must follow MiroFlow's standardized structure for seamless integration.
21+
22+
#### Required Directory Structure
1223

1324
```
1425
your-benchmark/
1526
├── standardized_data.jsonl # Metadata file (required)
1627
├── file1.pdf # Optional: Binary files referenced by tasks
17-
├── file2.png
18-
└── ...
28+
├── file2.png # Optional: Images, documents, etc.
29+
├── data.csv # Optional: Additional data files
30+
└── ... # Any other supporting files
1931
```
2032

21-
#### Metadata Format (JSONL)
33+
#### Metadata Format Specification
2234

23-
Each line in `standardized_data.jsonl` should be a JSON object with these fields:
35+
Each line in `standardized_data.jsonl` must be a valid JSON object with the following schema:
36+
37+
!!! example "Required Fields"
38+
```json
39+
{
40+
"task_id": "unique_task_identifier",
41+
"task_question": "The question or instruction for the task",
42+
"ground_truth": "The expected answer or solution",
43+
"file_path": "path/to/file.pdf", // Optional, can be null
44+
"metadata": { // Optional, can be empty object or other structure
45+
"difficulty": "hard",
46+
"category": "reasoning",
47+
"source": "original_dataset_name"
48+
}
49+
}
50+
```
2451

25-
```json
26-
{
27-
"task_id": "unique_task_identifier",
28-
"task_question": "The question or instruction for the task",
29-
"ground_truth": "The expected answer or solution",
30-
"file_path": "path/to/file.pdf", // Optional, can be null
31-
"metadata": { // Optional, can be empty
32-
"difficulty": "hard",
33-
"category": "reasoning",
34-
"source": "original_dataset_name"
35-
}
36-
}
37-
```
3852

39-
**Example:**
53+
#### Example Tasks
54+
55+
**Simple Text-Only Task:**
4056
```json
4157
{
4258
"task_id": "math_001",
@@ -45,62 +61,133 @@ Each line in `standardized_data.jsonl` should be a JSON object with these fields
4561
"file_path": null,
4662
"metadata": {
4763
"difficulty": "medium",
48-
"category": "calculus"
64+
"category": "calculus",
65+
"source": "custom_math_problems"
66+
}
67+
}
68+
```
69+
70+
**File-Based Task:**
71+
```json
72+
{
73+
"task_id": "doc_analysis_001",
74+
"task_question": "Based on the provided financial report, what was the company's revenue growth rate?",
75+
"ground_truth": "12.5%",
76+
"file_path": "reports/financial_q3_2023.pdf",
77+
"metadata": {
78+
"difficulty": "hard",
79+
"category": "document_analysis",
80+
"file_type": "pdf"
4981
}
5082
}
5183
```
5284

53-
### Step 2: Create Configuration File
85+
### Step 2: Create Benchmark Configuration
86+
87+
Create a configuration file to define how MiroFlow should handle your benchmark.
88+
89+
#### Configuration File Location
5490

55-
Create a new configuration file in `config/benchmark/your-benchmark.yaml`:
91+
Create: `config/benchmark/your-benchmark.yaml`
5692

57-
```yaml
58-
# config/benchmark/your-benchmark.yaml
93+
#### Configuration Template
94+
95+
```yaml title="config/benchmark/your-benchmark.yaml"
96+
# Benchmark configuration for your custom dataset
5997
defaults:
60-
- default
61-
- _self_
98+
- default # Use default benchmark settings
99+
- _self_ # Allow overrides in this file
62100

63101
name: "your-benchmark"
64102

65103
data:
66-
data_dir: "${data_dir}/your-benchmark" # Path to your dataset
67-
metadata_file: "standardized_data.jsonl" # Metadata filename
68-
whitelist: [] # Optional: List of specific task_ids to run
104+
data_dir: "${data_dir}/your-benchmark" # Dataset location
105+
metadata_file: "standardized_data.jsonl" # Metadata filename
69106

70107
execution:
71-
max_tasks: null # null = no limit, or specify a number
72-
max_concurrent: 5 # Number of parallel tasks
73-
pass_at_k: 1 # Number of attempts per task
108+
max_tasks: null # null = no limit, number = max tasks to run
109+
max_concurrent: 5 # Number of parallel task executions
110+
pass_at_k: 1 # Number of attempts per task for pass@k evaluation
74111

112+
# LLM judge configuration for evaluation
75113
openai_api_key: "${oc.env:OPENAI_API_KEY,???}"
76114
```
77115
116+
#### Configuration Options
117+
118+
!!! tip "Execution Parameters"
119+
- **max_tasks**: Control dataset size during development (use small numbers for testing)
120+
- **max_concurrent**: Balance speed vs. resource usage
121+
- **pass_at_k**: Enable multiple attempts for better success measurement
122+
78123
### Step 3: Set Up Data Directory
79124
80-
Place your dataset in the appropriate data directory:
125+
Organize your dataset files in the MiroFlow data structure.
81126
82-
```bash
127+
```bash title="Data Directory Setup"
83128
# Create the benchmark data directory
84129
mkdir -p data/your-benchmark
85130

86131
# Copy your dataset files
87132
cp your-dataset/* data/your-benchmark/
133+
134+
# Verify the structure
135+
ls -la data/your-benchmark/
88136
```
89137

138+
!!! warning "File Path Consistency"
139+
Ensure that all `file_path` entries in your JSONL metadata correctly reference files in your data directory.
140+
90141
### Step 4: Test Your Benchmark
91142

92-
Run your benchmark using the MiroFlow CLI:
143+
Validate your benchmark integration with comprehensive testing.
93144

94-
```bash
95-
# Test with a small subset
145+
#### Initial Testing
146+
147+
Start with a small subset to verify everything works correctly:
148+
149+
```bash title="Test Benchmark Integration"
96150
uv run main.py common-benchmark \
97151
--config_file_name=agent_quickstart_1 \
98152
benchmark=your-benchmark \
99-
benchmark.execution.max_tasks=5 \
100-
output_dir=logs/test-your-benchmark
153+
benchmark.execution.max_tasks=3 \
154+
output_dir="logs/test-your-benchmark/$(date +"%Y%m%d_%H%M")"
101155
```
102156

103-
---
157+
#### Full Evaluation
158+
159+
Once testing passes, run the complete benchmark:
160+
161+
```bash title="Run Full Benchmark"
162+
uv run main.py common-benchmark \
163+
--config_file_name=agent_quickstart_1 \
164+
benchmark=your-benchmark \
165+
output_dir="logs/your-benchmark/$(date +"%Y%m%d_%H%M")"
166+
```
167+
168+
### Step 5: Validate Results
169+
170+
Review the evaluation outputs to ensure proper integration:
171+
172+
#### Check Output Files
173+
174+
```bash title="Verify Results"
175+
# List generated files
176+
ls -la logs/your-benchmark/
177+
178+
# Review a sample task log
179+
cat logs/your-benchmark/task_*_attempt_1.json | head -50
180+
```
181+
182+
#### Expected Output Structure
183+
184+
Your benchmark should generate:
185+
186+
- Individual task execution logs
187+
- Aggregate benchmark results (`benchmark_results.jsonl`)
188+
- Accuracy summary files
189+
- Hydra configuration logs
190+
104191

105-
**Last Updated:** Sep 2025
106-
**Doc Contributor:** Team @ MiroMind AI
192+
!!! info "Documentation Info"
193+
**Last Updated:** September 2025 · **Doc Contributor:** Team @ MiroMind AI

0 commit comments

Comments
 (0)