Skip to content

goal: Multimodal Attachments for Vision-Enabled Chat (Files & Images → LLM Vision) #217

@locnguyen1986

Description

@locnguyen1986

🎯 Goal

Multimodal Attachments for Vision-Enabled Chat (Files & Images → LLM Vision)

📖 Context

Users want to ask questions about screenshots, diagrams, PDFs, and photos inside chat. We’ll let them attach files/images and have Jan-Server preprocess and feed them to vision-capable LLMs (via vLLM) so the model can “see” and reason over content alongside text.

✅ Scope

  • Upload & reference: TBD
  • Vision routing: TBD
  • Preprocessing pipeline: TBD
  • Context assembly: TBD
  • Storage & access: TBD

❌ Out of Scope

  • Image generation/editing (diffusion/paint)
  • Video or audio modalities (transcription, VQA on video)
  • Full document management features (versioning, sharing UI)
  • Complex table extraction or layout-aware PDF QA beyond basic OCR + page images

❓Open questions

  • Model targeting: default vision model(s) to ship? auto-fallback to text-only with OCR text only?
  • Limits: max image resolution (e.g., 2048px longest side), max pages per PDF, max total bytes per request?
  • OCR: which engine by default; language packs; accuracy vs speed tradeoffs; opt-out per org?
  • Security: default denylist/allowlist domains for remote fetches; redaction for sensitive OCR text?
  • Budgeting: how to prioritize pages/regions when context budget is tight (first N pages, heuristic on images with text density)?
  • Caching: cache preprocessed artifacts (thumbnails, OCR text) by content hash to cut latency—default TTLs?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions