|
| 1 | +# Vision Tools (`vision_mcp_server.py`) |
1 | 2 |
|
2 | | -# - Coming Soon - |
| 3 | +The Vision MCP Server enables OCR + Visual Question Answering (VQA) over images and multimodal understanding of YouTube videos, with pluggable backends (Anthropic, OpenAI, Google Gemini). |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Environment Variables |
| 8 | +!!! warning "Where to Modify" |
| 9 | + The `vision_mcp_server.py` reads environment variables that are passed through the `tool-image-video.yaml` configuration file, not directly from `.env` file. |
| 10 | +- Vision Backend Control: |
| 11 | + - `ENABLE_CLAUDE_VISION`: `"true"` to allow Anthropic Vision backend. |
| 12 | + - `ENABLE_OPENAI_VISION`: `"true"` to allow OpenAI Vision backend. |
| 13 | +- Anthropic Configuration: |
| 14 | + - `ANTHROPIC_API_KEY` |
| 15 | + - `ANTHROPIC_BASE_URL` : default = `https://api.anthropic.com` |
| 16 | + - `ANTHROPIC_MODEL_NAME` : default = `claude-3-7-sonnet-20250219` |
| 17 | +- OpenAI Configuration: |
| 18 | + - `OPENAI_API_KEY` |
| 19 | + - `OPENAI_BASE_URL` : default = `https://api.openai.com/v1` |
| 20 | + - `OPENAI_MODEL_NAME` : default = `gpt-4o` |
| 21 | +- Gemini Configuration: |
| 22 | + - `GEMINI_API_KEY` |
| 23 | + - `GEMINI_MODEL_NAME` : default = `gemini-2.5-pro` |
3 | 24 |
|
4 | 25 |
|
5 | 26 | --- |
| 27 | + |
| 28 | +## `visual_question_answering(image_path_or_url: str, question: str)` |
| 29 | +Ask questions about an image. Runs **two passes**: |
| 30 | + |
| 31 | +1. **OCR pass** using the selected vision backend with a meticulous extraction prompt. |
| 32 | + |
| 33 | +2. **VQA pass** that analyzes the image and cross-checks against OCR text. |
| 34 | + |
| 35 | +**Parameters** |
| 36 | + |
| 37 | +- `image_path_or_url`: Local path (accessible to server) or web URL. HTTP URLs are auto-upgraded/validated to HTTPS for some backends. |
| 38 | +- `question`: The user’s question about the image. |
| 39 | + |
| 40 | +**Returns** |
| 41 | + |
| 42 | +- `str`: Concatenated text with: |
| 43 | + - `OCR results: ...` |
| 44 | + - `VQA result: ...` |
| 45 | + |
| 46 | +**Features** |
| 47 | + |
| 48 | +- Automatic MIME detection, reads magic bytes, falls back to extension, final default is `image/jpeg`. |
| 49 | + |
| 50 | +--- |
| 51 | + |
| 52 | +## `visual_audio_youtube_analyzing(url: str, question: str = "", provide_transcribe: bool = False)` |
| 53 | +Analyze **public YouTube videos** (audio + visual). Supports watch pages, Shorts, and Live VODs. |
| 54 | + |
| 55 | +- Accepted URL patterns: `youtube.com/watch`, `youtube.com/shorts`, `youtube.com/live`. |
| 56 | + |
| 57 | +**Parameters** |
| 58 | + |
| 59 | +- `url`: YouTube video URL (publicly accessible). |
| 60 | +- `question` (optional): A specific question about the video. You can scope by time using `MM:SS` or `MM:SS-MM:SS` (e.g., `01:45`, `03:20-03:45`). |
| 61 | +- `provide_transcribe` (optional, default `False`): If `True`, returns a **timestamped transcription** including salient events and brief visual descriptions. |
| 62 | + |
| 63 | +**Returns** |
| 64 | + |
| 65 | +- `str`: transcription of the video (if asked) and answer to the question. |
| 66 | + |
| 67 | +**Features** |
| 68 | + |
| 69 | +- **Gemini-powered** video analysis (requires `GEMINI_API_KEY`). |
| 70 | +- Dual mode: full transcript, targeted Q&A, or both. |
| 71 | + |
| 72 | +--- |
| 73 | + |
6 | 74 | **Last Updated:** Sep 2025 |
7 | | -**Doc Contributor:** Team @ MiroMind AI |
| 75 | +**Doc Contributor:** Team @ MiroMind AI |
0 commit comments