-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Open
Copy link
Description
🎯 Goal
Multimodal Attachments for Vision-Enabled Chat (Files & Images → LLM Vision)
📖 Context
Users want to ask questions about screenshots, diagrams, PDFs, and photos inside chat. We’ll let them attach files/images and have Jan-Server preprocess and feed them to vision-capable LLMs (via vLLM) so the model can “see” and reason over content alongside text.
✅ Scope
- Upload & reference: TBD
- Vision routing: TBD
- Preprocessing pipeline: TBD
- Context assembly: TBD
- Storage & access: TBD
❌ Out of Scope
- Image generation/editing (diffusion/paint)
- Video or audio modalities (transcription, VQA on video)
- Full document management features (versioning, sharing UI)
- Complex table extraction or layout-aware PDF QA beyond basic OCR + page images
❓Open questions
- Model targeting: default vision model(s) to ship? auto-fallback to text-only with OCR text only?
- Limits: max image resolution (e.g., 2048px longest side), max pages per PDF, max total bytes per request?
- OCR: which engine by default; language packs; accuracy vs speed tradeoffs; opt-out per org?
- Security: default denylist/allowlist domains for remote fetches; redaction for sensitive OCR text?
- Budgeting: how to prioritize pages/regions when context budget is tight (first N pages, heuristic on images with text density)?
- Caching: cache preprocessed artifacts (thumbnails, OCR text) by content hash to cut latency—default TTLs?