Skip to content

Commit e5c92d0

Browse files
feat(doc): add tool-vqa and tool-reasoning doc (#42)
* add new tools doc, support xbench-ds benchmark preparation * docs(prepare-benchmark): add xbench-ds * make doc clearer
1 parent d041bbc commit e5c92d0

File tree

5 files changed

+108
-12
lines changed

5 files changed

+108
-12
lines changed

docs/mkdocs/docs/contribute_tools.md

Lines changed: 1 addition & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -62,13 +62,7 @@ sub_agents:
6262
- tool-audio
6363
- new-tool-name # 👈 Add your new tool here
6464
...
65-
```
66-
67-
68-
## Examples
69-
- `tool-reasoning` – reasoning utilities
70-
- `tool-image-video` – visual understanding
71-
- `new-tool-name` – your custom tool
65+
```
7266

7367
---
7468

docs/mkdocs/docs/tool_reasoning.md

Lines changed: 35 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,41 @@
1+
# Reasoning Tools (`reasoning_mcp_server.py`)
12

2-
# - Coming Soon -
3+
The Reasoning MCP Server provides a **pure text-based reasoning engine**. It supports logical analysis, problem solving, and planning, using LLM backends (OpenAI or Anthropic) with retry and exponential backoff for robustness.
34

5+
## Environment Variables
6+
!!! warning "Where to Modify"
7+
The `reasoning_mcp_server.py` reads environment variables that are passed through the `tool-reasoning.yaml` configuration file, not directly from `.env` file.
8+
- OpenAI Configuration:
9+
- `OPENAI_API_KEY`
10+
- `OPENAI_BASE_URL` : default = `https://api.openai.com/v1`
11+
- `OPENAI_MODEL_NAME` : default = `o3`
12+
13+
- Anthropic Configuration:
14+
- `ANTHROPIC_API_KEY`
15+
- `ANTHROPIC_BASE_URL` : default = `https://api.anthropic.com`
16+
- `ANTHROPIC_MODEL_NAME` : default = `claude-3-7-sonnet-20250219`
417

518
---
19+
20+
## `reasoning(question: str)`
21+
Perform step-by-step reasoning, analysis, and planning over a **text-only input**. This tool is specialized for **complex thinking tasks**.
22+
23+
**Parameters**
24+
25+
- `question`: A detailed, complex question or problem statement that includes all necessary information. The tool will not fetch external data or context.
26+
27+
**Returns**
28+
29+
- `str`: A structured, step-by-step reasoned answer.
30+
31+
**Features**
32+
33+
- Runs on OpenAI or Anthropic models, depending on available API keys.
34+
- Exponential backoff retry logic (up to 5 attempts).
35+
- For Anthropic, uses **Thinking mode** with token budget (21k max, 19k thinking).
36+
- Ensures non-empty responses with fallback error reporting.
37+
38+
---
39+
640
**Last Updated:** Sep 2025
741
**Doc Contributor:** Team @ MiroMind AI

docs/mkdocs/docs/tool_vqa.md

Lines changed: 70 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,75 @@
1+
# Vision Tools (`vision_mcp_server.py`)
12

2-
# - Coming Soon -
3+
The Vision MCP Server enables OCR + Visual Question Answering (VQA) over images and multimodal understanding of YouTube videos, with pluggable backends (Anthropic, OpenAI, Google Gemini).
4+
5+
---
6+
7+
## Environment Variables
8+
!!! warning "Where to Modify"
9+
The `vision_mcp_server.py` reads environment variables that are passed through the `tool-image-video.yaml` configuration file, not directly from `.env` file.
10+
- Vision Backend Control:
11+
- `ENABLE_CLAUDE_VISION`: `"true"` to allow Anthropic Vision backend.
12+
- `ENABLE_OPENAI_VISION`: `"true"` to allow OpenAI Vision backend.
13+
- Anthropic Configuration:
14+
- `ANTHROPIC_API_KEY`
15+
- `ANTHROPIC_BASE_URL` : default = `https://api.anthropic.com`
16+
- `ANTHROPIC_MODEL_NAME` : default = `claude-3-7-sonnet-20250219`
17+
- OpenAI Configuration:
18+
- `OPENAI_API_KEY`
19+
- `OPENAI_BASE_URL` : default = `https://api.openai.com/v1`
20+
- `OPENAI_MODEL_NAME` : default = `gpt-4o`
21+
- Gemini Configuration:
22+
- `GEMINI_API_KEY`
23+
- `GEMINI_MODEL_NAME` : default = `gemini-2.5-pro`
324

425

526
---
27+
28+
## `visual_question_answering(image_path_or_url: str, question: str)`
29+
Ask questions about an image. Runs **two passes**:
30+
31+
1. **OCR pass** using the selected vision backend with a meticulous extraction prompt.
32+
33+
2. **VQA pass** that analyzes the image and cross-checks against OCR text.
34+
35+
**Parameters**
36+
37+
- `image_path_or_url`: Local path (accessible to server) or web URL. HTTP URLs are auto-upgraded/validated to HTTPS for some backends.
38+
- `question`: The user’s question about the image.
39+
40+
**Returns**
41+
42+
- `str`: Concatenated text with:
43+
- `OCR results: ...`
44+
- `VQA result: ...`
45+
46+
**Features**
47+
48+
- Automatic MIME detection, reads magic bytes, falls back to extension, final default is `image/jpeg`.
49+
50+
---
51+
52+
## `visual_audio_youtube_analyzing(url: str, question: str = "", provide_transcribe: bool = False)`
53+
Analyze **public YouTube videos** (audio + visual). Supports watch pages, Shorts, and Live VODs.
54+
55+
- Accepted URL patterns: `youtube.com/watch`, `youtube.com/shorts`, `youtube.com/live`.
56+
57+
**Parameters**
58+
59+
- `url`: YouTube video URL (publicly accessible).
60+
- `question` (optional): A specific question about the video. You can scope by time using `MM:SS` or `MM:SS-MM:SS` (e.g., `01:45`, `03:20-03:45`).
61+
- `provide_transcribe` (optional, default `False`): If `True`, returns a **timestamped transcription** including salient events and brief visual descriptions.
62+
63+
**Returns**
64+
65+
- `str`: transcription of the video (if asked) and answer to the question.
66+
67+
**Features**
68+
69+
- **Gemini-powered** video analysis (requires `GEMINI_API_KEY`).
70+
- Dual mode: full transcript, targeted Q&A, or both.
71+
72+
---
73+
674
**Last Updated:** Sep 2025
7-
**Doc Contributor:** Team @ MiroMind AI
75+
**Doc Contributor:** Team @ MiroMind AI

docs/mkdocs/mkdocs.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ nav:
5959
- Overview: tool_overview.md
6060
- Tools:
6161
- tool-reasoning: tool_reasoning.md
62-
- tool-vqa: tool_vqa.md
62+
- tool-image-video: tool_vqa.md
6363
- tool-searching: tool_searching.md
6464
- tool-python: tool_python.md
6565
- Advanced Features:

scripts/run_prepare_benchmark.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,4 +21,4 @@ uv run main.py prepare-benchmark get browsecomp-test
2121
uv run main.py prepare-benchmark get browsecomp-zh-test
2222
uv run main.py prepare-benchmark get hle
2323
uv run main.py prepare-benchmark get xbench-ds
24-
uv run main.py prepare-benchmark get futurex
24+
uv run main.py prepare-benchmark get futurex

0 commit comments

Comments
 (0)