feat(doc): add tool-vqa and tool-reasoning doc (#42)

xuan-dong-shanda · web-flow · commit e5c92d060907 · 2025-09-18T17:59:57.000+08:00
* add new tools doc, support xbench-ds benchmark preparation

* docs(prepare-benchmark): add xbench-ds

* make doc clearer
diff --git a/docs/mkdocs/docs/contribute_tools.md b/docs/mkdocs/docs/contribute_tools.md
@@ -62,13 +62,7 @@ sub_agents:
       - tool-audio
       - new-tool-name # 👈 Add your new tool here
     ...
-```
-
-
-## Examples
-- `tool-reasoning` – reasoning utilities  
-- `tool-image-video` – visual understanding  
-- `new-tool-name` – your custom tool  
+``` 
 
 ---
 
diff --git a/docs/mkdocs/docs/tool_reasoning.md b/docs/mkdocs/docs/tool_reasoning.md
@@ -1,7 +1,41 @@
+# Reasoning Tools (`reasoning_mcp_server.py`)
 
-# - Coming Soon -
+The Reasoning MCP Server provides a **pure text-based reasoning engine**. It supports logical analysis, problem solving, and planning, using LLM backends (OpenAI or Anthropic) with retry and exponential backoff for robustness.
 
+## Environment Variables
+!!! warning "Where to Modify"
+    The `reasoning_mcp_server.py` reads environment variables that are passed through the `tool-reasoning.yaml` configuration file, not directly from `.env` file.
+- OpenAI Configuration:
+    - `OPENAI_API_KEY`
+    - `OPENAI_BASE_URL` : default = `https://api.openai.com/v1`
+    - `OPENAI_MODEL_NAME` : default = `o3`
+
+- Anthropic Configuration:
+    - `ANTHROPIC_API_KEY`
+    - `ANTHROPIC_BASE_URL` : default = `https://api.anthropic.com`
+    - `ANTHROPIC_MODEL_NAME` : default = `claude-3-7-sonnet-20250219`
 
 ---
+
+## `reasoning(question: str)`
+Perform step-by-step reasoning, analysis, and planning over a **text-only input**. This tool is specialized for **complex thinking tasks**.
+
+**Parameters**
+
+- `question`:  A detailed, complex question or problem statement that includes all necessary information. The tool will not fetch external data or context.
+
+**Returns**
+
+- `str`: A structured, step-by-step reasoned answer.
+
+**Features**
+
+- Runs on OpenAI or Anthropic models, depending on available API keys.
+- Exponential backoff retry logic (up to 5 attempts).
+- For Anthropic, uses **Thinking mode** with token budget (21k max, 19k thinking).
+- Ensures non-empty responses with fallback error reporting.
+
+---
+
 **Last Updated:** Sep 2025  
 **Doc Contributor:** Team @ MiroMind AI
diff --git a/docs/mkdocs/docs/tool_vqa.md b/docs/mkdocs/docs/tool_vqa.md
@@ -1,7 +1,75 @@
+# Vision Tools (`vision_mcp_server.py`)
 
-# - Coming Soon -
+The Vision MCP Server enables OCR + Visual Question Answering (VQA) over images and multimodal understanding of YouTube videos, with pluggable backends (Anthropic, OpenAI, Google Gemini).
+
+---
+
+## Environment Variables
+!!! warning "Where to Modify"
+    The `vision_mcp_server.py` reads environment variables that are passed through the `tool-image-video.yaml` configuration file, not directly from `.env` file.
+- Vision Backend Control:
+    - `ENABLE_CLAUDE_VISION`: `"true"` to allow Anthropic Vision backend.
+    - `ENABLE_OPENAI_VISION`: `"true"` to allow OpenAI Vision backend.
+- Anthropic Configuration:
+    - `ANTHROPIC_API_KEY`
+    -  `ANTHROPIC_BASE_URL` : default = `https://api.anthropic.com`
+    -  `ANTHROPIC_MODEL_NAME` : default = `claude-3-7-sonnet-20250219`
+- OpenAI Configuration:
+    - `OPENAI_API_KEY`
+    -  `OPENAI_BASE_URL` : default = `https://api.openai.com/v1`
+    -  `OPENAI_MODEL_NAME` : default = `gpt-4o`
+- Gemini Configuration:
+    - `GEMINI_API_KEY`
+    -  `GEMINI_MODEL_NAME` : default = `gemini-2.5-pro`
 
 
 ---
+
+## `visual_question_answering(image_path_or_url: str, question: str)`
+Ask questions about an image. Runs **two passes**:
+
+1. **OCR pass** using the selected vision backend with a meticulous extraction prompt.
+
+2. **VQA pass** that analyzes the image and cross-checks against OCR text.
+
+**Parameters**
+
+- `image_path_or_url`: Local path (accessible to server) or web URL. HTTP URLs are auto-upgraded/validated to HTTPS for some backends.
+- `question`: The user’s question about the image.
+
+**Returns**
+
+- `str`: Concatenated text with:
+    - `OCR results: ...`
+    - `VQA result: ...`
+
+**Features**
+
+- Automatic MIME detection, reads magic bytes, falls back to extension, final default is `image/jpeg`.
+
+---
+
+## `visual_audio_youtube_analyzing(url: str, question: str = "", provide_transcribe: bool = False)`
+Analyze **public YouTube videos** (audio + visual). Supports watch pages, Shorts, and Live VODs.
+
+- Accepted URL patterns: `youtube.com/watch`, `youtube.com/shorts`, `youtube.com/live`.
+
+**Parameters**
+
+- `url`: YouTube video URL (publicly accessible).
+- `question` (optional): A specific question about the video. You can scope by time using `MM:SS` or `MM:SS-MM:SS` (e.g., `01:45`, `03:20-03:45`).
+- `provide_transcribe` (optional, default `False`): If `True`, returns a **timestamped transcription** including salient events and brief visual descriptions.
+
+**Returns**
+
+- `str`: transcription of the video (if asked) and answer to the question.
+
+**Features**
+
+- **Gemini-powered** video analysis (requires `GEMINI_API_KEY`).
+- Dual mode: full transcript, targeted Q&A, or both.
+
+---
+
 **Last Updated:** Sep 2025  
-**Doc Contributor:** Team @ MiroMind AI
+**Doc Contributor:** Team @ MiroMind AI
diff --git a/docs/mkdocs/mkdocs.yml b/docs/mkdocs/mkdocs.yml
@@ -59,7 +59,7 @@ nav:
     - Overview: tool_overview.md
     - Tools:
       - tool-reasoning: tool_reasoning.md
-      - tool-vqa: tool_vqa.md
+      - tool-image-video: tool_vqa.md
       - tool-searching: tool_searching.md
       - tool-python: tool_python.md
     - Advanced Features:
diff --git a/scripts/run_prepare_benchmark.sh b/scripts/run_prepare_benchmark.sh
@@ -21,4 +21,4 @@ uv run main.py prepare-benchmark get browsecomp-test
 uv run main.py prepare-benchmark get browsecomp-zh-test
 uv run main.py prepare-benchmark get hle
 uv run main.py prepare-benchmark get xbench-ds
-uv run main.py prepare-benchmark get futurex
+uv run main.py prepare-benchmark get futurex