MiroMindAI
diff --git a/‎config/agent_gaia-validation-text-only.yaml‎
Lines changed: 75 additions & 0 deletions b/‎config/agent_gaia-validation-text-only.yaml‎
Lines changed: 75 additions & 0 deletions
diff --git a/‎config/benchmark/gaia-validation-text-only.yaml‎
Lines changed: 16 additions & 0 deletions b/‎config/benchmark/gaia-validation-text-only.yaml‎
Lines changed: 16 additions & 0 deletions
diff --git a/‎docs/mkdocs/docs/claude-3.7-sonnet.md‎
Lines changed: 12 additions & 11 deletions b/‎docs/mkdocs/docs/claude-3.7-sonnet.md‎
Lines changed: 12 additions & 11 deletions
diff --git a/‎docs/mkdocs/docs/contribute_benchmarks.md‎
Lines changed: 133 additions & 46 deletions b/‎docs/mkdocs/docs/contribute_benchmarks.md‎
Lines changed: 133 additions & 46 deletions
@@ -0,0 +1,75 @@
+defaults:
+  - benchmark: gaia-validation-text-only
+  - override hydra/job_logging: none
+  - _self_  # Allow defining variables at the top of this file
+
+
+main_agent:
+  prompt_class: MainAgentPrompt_GAIA
+  llm: 
+    provider_class: "ClaudeOpenRouterClient"
+    model_name: "anthropic/claude-3.7-sonnet"
+    async_client: true
+    temperature: 0.3
+    top_p: 0.95
+    min_p: 0.0
+    top_k: -1
+    max_tokens: 32000
+    openrouter_api_key: "${oc.env:OPENROUTER_API_KEY,???}"
+    openrouter_base_url: "${oc.env:OPENROUTER_BASE_URL,https://openrouter.ai/api/v1}"
+    openrouter_provider: "anthropic"
+    disable_cache_control: false
+    keep_tool_result: -1
+    oai_tool_thinking: false
+  
+  tool_config:
+    - tool-reasoning
+
+  max_turns: -1  # Maximum number of turns for main agent execution
+  max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+  
+  input_process:
+    o3_hint: true
+  output_process:
+    o3_final_answer: true
+
+  openai_api_key: "${oc.env:OPENAI_API_KEY,???}" # used for o3 hints and final answer extraction
+  add_message_id: true
+  keep_tool_result: -1
+  chinese_context: "${oc.env:CHINESE_CONTEXT,false}"
+
+
+sub_agents:
+  agent-worker:
+    prompt_class: SubAgentWorkerPrompt
+    llm: 
+      provider_class: "ClaudeOpenRouterClient"
+      model_name: "anthropic/claude-3.7-sonnet"
+      async_client: true
+      temperature: 0.3
+      top_p: 0.95
+      min_p: 0.0
+      top_k: -1
+      max_tokens: 32000
+      openrouter_api_key: "${oc.env:OPENROUTER_API_KEY,???}"
+      openrouter_base_url: "${oc.env:OPENROUTER_BASE_URL,https://openrouter.ai/api/v1}"
+      openrouter_provider: "anthropic"
+      disable_cache_control: false
+      keep_tool_result: -1
+      oai_tool_thinking: false
+    
+    tool_config:
+      - tool-searching
+      - tool-image-video
+      - tool-reading
+      - tool-code
+      - tool-audio
+
+    max_turns: -1  # Maximum number of turns for main agent execution
+    max_tool_calls_per_turn: 10  # Maximum number of tool calls per turn
+
+
+# Can define some top-level or default parameters here
+output_dir: logs/
+data_dir: "${oc.env:DATA_DIR,data}"  # Points to where data is stored
+
@@ -0,0 +1,16 @@
+# config/benchmark/gaia-validation.yaml
+defaults:
+  - default
+  - _self_
+
+name: "gaia-validation-text-only"
+
+data:
+  data_dir: "${data_dir}/gaia-val-text-only"
+
+execution:
+  max_tasks: null  # null means no limit
+  max_concurrent: 3
+  pass_at_k: 1
+
+openai_api_key: "${oc.env:OPENAI_API_KEY,???}"
@@ -1,37 +1,38 @@
 # Claude 3.7 Sonnet
 
-## What This Is
 Anthropic's Claude 3.7 Sonnet model with 200K context, strong reasoning, and tool use capabilities.
 
 ## Available Clients
 
 ### ClaudeAnthropicClient (Direct API)
-**Environment:**
-```bash
+
+**Environment Setup:**
+
+```bash title="Environment Variables"
 export ANTHROPIC_API_KEY="your-key"
 export ANTHROPIC_BASE_URL="https://api.anthropic.com"  # optional
 ```
 
-**Config:**
-```yaml
+**Configuration:**
+
+```yaml title="Agent Configuration"
 main_agent:
   llm: 
     provider_class: "ClaudeAnthropicClient"
     model_name: "claude-3-7-sonnet-20250219"  # Use actual model name from Anthropic API
     anthropic_api_key: "${oc.env:ANTHROPIC_API_KEY,???}"
     anthropic_base_url: "${oc.env:ANTHROPIC_BASE_URL,https://api.anthropic.com}"
-    ...
 ```
 
 ## Usage
-```bash
+
+```bash title="Example Command"
 # Use existing config
 uv run main.py trace --config_file_name=your_config_file \
     --task="Your task" --task_file_name="data/file.txt"
 ```
 
-
-
 ---
-**Last Updated:** Sep 2025  
-**Doc Contributor:** Team @ MiroMind AI
+
+!!! info "Documentation Info"
+    **Last Updated:** September 2025 · **Doc Contributor:** Team @ MiroMind AI
@@ -1,42 +1,58 @@
-# 🧪 Adding New Benchmarks to MiroFlow
+# Contributing New Benchmarks to MiroFlow
 
-This guide provides a comprehensive walkthrough for adding new benchmarks to the MiroFlow framework. MiroFlow uses a modular benchmark architecture that allows for easy integration of new evaluation datasets.
+This comprehensive guide walks you through adding new evaluation benchmarks to the MiroFlow framework. MiroFlow's modular architecture makes it easy to integrate diverse evaluation datasets while maintaining consistency and reproducibility.
 
----
+## Overview
 
-## 🚀 Step-by-Step Implementation Guide
+!!! info "Why Add New Benchmarks?"
+    Adding new benchmarks serves multiple purposes:
+    
+    - **Internal Testing**: Validate your agent's performance on custom tasks and domains specific to your use case
+    - **Development Iteration**: Create targeted test sets to debug and improve specific agent capabilities
+    - **Domain-Specific Evaluation**: Test agents on proprietary or specialized datasets relevant to your application
+    - **Research Contributions**: Expand MiroFlow's benchmark coverage to advance the field with new evaluation paradigms
+    - **Comparative Analysis**: Benchmark your agent against custom baselines or competitors
+
+## Step-by-Step Implementation Guide
 
 ### Step 1: Prepare Your Dataset
 
-Your benchmark dataset should follow this structure:
+Your benchmark dataset must follow MiroFlow's standardized structure for seamless integration.
+
+#### Required Directory Structure
 
 ```
 your-benchmark/
 ├── standardized_data.jsonl    # Metadata file (required)
 ├── file1.pdf                  # Optional: Binary files referenced by tasks
-├── file2.png
-└── ...
+├── file2.png                  # Optional: Images, documents, etc.
+├── data.csv                   # Optional: Additional data files
+└── ...                        # Any other supporting files
 ```
 
-#### Metadata Format (JSONL)
+#### Metadata Format Specification
 
-Each line in `standardized_data.jsonl` should be a JSON object with these fields:
+Each line in `standardized_data.jsonl` must be a valid JSON object with the following schema:
+
+!!! example "Required Fields"
+    ```json
+    {
+      "task_id": "unique_task_identifier",
+      "task_question": "The question or instruction for the task",
+      "ground_truth": "The expected answer or solution",
+      "file_path": "path/to/file.pdf",  // Optional, can be null
+      "metadata": {                     // Optional, can be empty object or other structure
+        "difficulty": "hard",
+        "category": "reasoning",
+        "source": "original_dataset_name"
+      }
+    }
+    ```
 
-```json
-{
-  "task_id": "unique_task_identifier",
-  "task_question": "The question or instruction for the task",
-  "ground_truth": "The expected answer or solution",
-  "file_path": "path/to/file.pdf",  // Optional, can be null
-  "metadata": {                     // Optional, can be empty
-    "difficulty": "hard",
-    "category": "reasoning",
-    "source": "original_dataset_name"
-  }
-}
-```
 
-**Example:**
+#### Example Tasks
+
+**Simple Text-Only Task:**
 ```json
 {
   "task_id": "math_001",
@@ -45,62 +61,133 @@ Each line in `standardized_data.jsonl` should be a JSON object with these fields
   "file_path": null,
   "metadata": {
     "difficulty": "medium",
-    "category": "calculus"
+    "category": "calculus",
+    "source": "custom_math_problems"
+  }
+}
+```
+
+**File-Based Task:**
+```json
+{
+  "task_id": "doc_analysis_001",
+  "task_question": "Based on the provided financial report, what was the company's revenue growth rate?",
+  "ground_truth": "12.5%",
+  "file_path": "reports/financial_q3_2023.pdf",
+  "metadata": {
+    "difficulty": "hard",
+    "category": "document_analysis",
+    "file_type": "pdf"
   }
 }
 ```
 
-### Step 2: Create Configuration File
+### Step 2: Create Benchmark Configuration
+
+Create a configuration file to define how MiroFlow should handle your benchmark.
+
+#### Configuration File Location
 
-Create a new configuration file in `config/benchmark/your-benchmark.yaml`:
+Create: `config/benchmark/your-benchmark.yaml`
 
-```yaml
-# config/benchmark/your-benchmark.yaml
+#### Configuration Template
+
+```yaml title="config/benchmark/your-benchmark.yaml"
+# Benchmark configuration for your custom dataset
 defaults:
-  - default
-  - _self_
+  - default          # Use default benchmark settings
+  - _self_           # Allow overrides in this file
 
 name: "your-benchmark"
 
 data:
-  data_dir: "${data_dir}/your-benchmark"  # Path to your dataset
-  metadata_file: "standardized_data.jsonl"  # Metadata filename
-  whitelist: []  # Optional: List of specific task_ids to run
+  data_dir: "${data_dir}/your-benchmark"        # Dataset location
+  metadata_file: "standardized_data.jsonl"     # Metadata filename
 
 execution:
-  max_tasks: null      # null = no limit, or specify a number
-  max_concurrent: 5    # Number of parallel tasks
-  pass_at_k: 1         # Number of attempts per task
+  max_tasks: null          # null = no limit, number = max tasks to run
+  max_concurrent: 5        # Number of parallel task executions
+  pass_at_k: 1             # Number of attempts per task for pass@k evaluation
 
+# LLM judge configuration for evaluation
 openai_api_key: "${oc.env:OPENAI_API_KEY,???}"
 ```
 
+#### Configuration Options
+
+!!! tip "Execution Parameters"
+    - **max_tasks**: Control dataset size during development (use small numbers for testing)
+    - **max_concurrent**: Balance speed vs. resource usage
+    - **pass_at_k**: Enable multiple attempts for better success measurement
+
 ### Step 3: Set Up Data Directory
 
-Place your dataset in the appropriate data directory:
+Organize your dataset files in the MiroFlow data structure.
 
-```bash
+```bash title="Data Directory Setup"
 # Create the benchmark data directory
 mkdir -p data/your-benchmark
 
 # Copy your dataset files
 cp your-dataset/* data/your-benchmark/
+
+# Verify the structure
+ls -la data/your-benchmark/
 ```
 
+!!! warning "File Path Consistency"
+    Ensure that all `file_path` entries in your JSONL metadata correctly reference files in your data directory.
+
 ### Step 4: Test Your Benchmark
 
-Run your benchmark using the MiroFlow CLI:
+Validate your benchmark integration with comprehensive testing.
 
-```bash
-# Test with a small subset 
+#### Initial Testing
+
+Start with a small subset to verify everything works correctly:
+
+```bash title="Test Benchmark Integration"
 uv run main.py common-benchmark \
   --config_file_name=agent_quickstart_1 \
   benchmark=your-benchmark \
-  benchmark.execution.max_tasks=5 \
-  output_dir=logs/test-your-benchmark
+  benchmark.execution.max_tasks=3 \
+  output_dir="logs/test-your-benchmark/$(date +"%Y%m%d_%H%M")"
 ```
 
----
+#### Full Evaluation
+
+Once testing passes, run the complete benchmark:
+
+```bash title="Run Full Benchmark"
+uv run main.py common-benchmark \
+  --config_file_name=agent_quickstart_1 \
+  benchmark=your-benchmark \
+  output_dir="logs/your-benchmark/$(date +"%Y%m%d_%H%M")"
+```
+
+### Step 5: Validate Results
+
+Review the evaluation outputs to ensure proper integration:
+
+#### Check Output Files
+
+```bash title="Verify Results"
+# List generated files
+ls -la logs/your-benchmark/
+
+# Review a sample task log
+cat logs/your-benchmark/task_*_attempt_1.json | head -50
+```
+
+#### Expected Output Structure
+
+Your benchmark should generate:
+
+- Individual task execution logs
+- Aggregate benchmark results (`benchmark_results.jsonl`)
+- Accuracy summary files
+- Hydra configuration logs
+
 
-**Last Updated:** Sep 2025  
-**Doc Contributor:** Team @ MiroMind AI
+!!! info "Documentation Info"
+    **Last Updated:** September 2025 · **Doc Contributor:** Team @ MiroMind AI