🎯 Goal
Self-Hosted, Authenticated, OpenAI-Compatible Chat (vLLM-backed)
📖 Context
We need a drop-in, OpenAI-style API that teams can self-host. vLLM provides fast inference; Keycloak+Kong provide auth and gateway controls. This is the entry point for all other features.
✅ Scope
- OpenAI-compatible endpoints:
GET /v1/models, POST /v1/chat/completions (JSON + SSE)
- vLLM runner integration (single/multi-GPU, prompt caching)
- Auth via Keycloak (OIDC) and Kong (key-auth); guest mode with quotas
- Usage counters in responses; consistent error envelopes
- Minimal “hello world” examples (curl, Python, TypeScript)
🛠 Deliverables
- OpenAPI spec & conformance tests
- vLLM runner + health probes; model registry config
- Gateway policy (rate limits, CORS/CSRF)
- Docker Compose for local; Helm values for prod
- Example clients + Postman collection