-
Notifications
You must be signed in to change notification settings - Fork 585
docs: explain control plane services #4185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
449b4f2
52c2faa
f5ec66a
0dc2d76
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,4 +1,44 @@ | ||
| --- | ||
| title: Overview | ||
| description: What do we run where and how? | ||
| description: System architecture and deployment model | ||
| --- | ||
|
|
||
| import { Cards, Card } from 'fumadocs-ui/components/card'; | ||
|
|
||
| Unkey runs on AWS across multiple regions, using Kubernetes for container orchestration. The architecture is split between the control plane that manages customer deployments and the data plane that serves traffic. | ||
|
|
||
| ## Core Services | ||
|
|
||
| <Cards> | ||
| <Card | ||
| title="Control Plane (Ctrl)" | ||
| description="Orchestrates deployments, builds containers via Depot, provisions TLS certificates, and configures routing using durable Restate workflows" | ||
| href="./services/ctrl" | ||
| /> | ||
| <Card | ||
| title="Krane" | ||
| description="Kubernetes deployment abstraction that manages StatefulSets across multiple clusters and regions without replicating control plane logic" | ||
| href="./services/krane" | ||
| /> | ||
| <Card | ||
| title="API" | ||
| description="Handles key verification, analytics queries, and management operations in Go. Deployed to multiple AWS regions behind Global Accelerator" | ||
| href="./services/api/config" | ||
| /> | ||
| <Card | ||
| title="Gateway (GW)" | ||
| description="Routes traffic to customer deployments by querying the partition database, terminating TLS, and proxying requests to Kubernetes pods" | ||
| href="./services/gateway" | ||
| /> | ||
| <Card | ||
| title="ClickHouse" | ||
| description="Stores analytics events for key verification logs, API usage metrics, and audit trails with automatic scaling and replication" | ||
| href="./services/clickhouse" | ||
| /> | ||
| <Card | ||
| title="Vault" | ||
| description="Encrypts sensitive data using envelope encryption with AWS KMS, decrypting on demand without storing plaintext secrets" | ||
| href="./services/vault" | ||
| /> | ||
| </Cards> | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,96 @@ | ||
| --- | ||
| title: Build System | ||
| description: Container image building for customer deployments | ||
| --- | ||
|
|
||
| import { Mermaid } from "@/app/components/mermaid" | ||
|
|
||
|
|
||
| When a customer deploys their application, the following process occurs: | ||
|
|
||
| The CLI first requests a deployment from the control plane, which returns a presigned S3 URL. The CLI packages the source code into a tarball and uploads it directly to S3, bypassing the control plane for efficient transfer. Once uploaded, the CLI triggers the build by sending the S3 path to the control plane. | ||
|
|
||
| The control plane retrieves or creates a dedicated Depot project for the customer, then initiates a build with Depot. Depot provisions an isolated BuildKit machine, downloads the build context from S3, executes the Docker build, and pushes the resulting image to its registry. The image name is returned to the control plane. | ||
chronark marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| With the built image ready, the control plane instructs Krane to create a deployment with specified resources (replicas, CPU, memory). Krane creates the necessary Kubernetes resources (StatefulSet and Service) and K8s begins scheduling pods. | ||
|
|
||
| The control plane polls Krane every second (for up to 5 minutes) to check instance status. As instances become ready, their details are registered in the partition database. Once all instances are running, the control plane attempts to scrape an OpenAPI specification from the deployed service. | ||
|
|
||
| Finally, the control plane calls the RoutingService to atomically assign domains and create gateway configurations, and marks the deployment as ready in the database. Meanwhile, the CLI continuously polls the control plane every 2 seconds to check the deployment status until it becomes ready. | ||
|
|
||
| <Mermaid chart={`sequenceDiagram | ||
| autonumber | ||
| participant CLI | ||
| participant Ctrl as Ctrl Plane | ||
| participant S3 | ||
| participant Depot | ||
| participant Krane | ||
| participant K8s as Kubernetes | ||
| participant DB as Partition DB | ||
| CLI->>Ctrl: Create Deployment | ||
| Ctrl->>CLI: Presigned S3 upload URL | ||
| CLI->>S3: PUT tar file directly | ||
| S3->>CLI: Upload complete | ||
| CLI->>Ctrl: CreateBuild(s3_path) | ||
| Ctrl->>Depot: Get/Create Depot Project | ||
| Depot->>Ctrl: Project ID | ||
| Ctrl->>Depot: Create Build | ||
| Depot->>Ctrl: Build ID | ||
| Depot->>S3: Download build context | ||
| Depot->>Depot: Execute Docker build & push to registry | ||
| Depot->>Ctrl: Image name & build ID | ||
| Ctrl->>Krane: CreateDeployment(image, replicas, resources) | ||
| Krane->>K8s: Create StatefulSet & Service | ||
| K8s->>K8s: Schedule & start pods | ||
| loop Poll until ready (max 5 min) | ||
| Ctrl->>Krane: GetDeployment() | ||
| Krane->>K8s: AppsV1.StatefulSets.Get | ||
| K8s->>Krane: Instances: [{id, addr, status}] | ||
| Krane->>Ctrl: Instances: [{id, addr, status}] | ||
| Ctrl->>DB: Upsert VM records | ||
| end | ||
| K8s->>K8s: Pods running | ||
| Ctrl->>K8s: HTTP GET /openapi.yaml | ||
| K8s->>Ctrl: OpenAPI spec | ||
| Ctrl->>Ctrl: AssignDomains (RoutingService)<br/>- Create gateway configs<br/>- Assign domains | ||
| Ctrl->>DB: Update deployment status: READY | ||
| loop CLI polls every 2s | ||
| CLI->>Ctrl: GetDeployment() | ||
| Ctrl->>CLI: Deployment status | ||
| end | ||
| CLI->>CLI: Status = READY, deployment complete | ||
| `} /> | ||
|
|
||
| ## Build Backends | ||
|
|
||
| We support two build backends, configurable via the `BUILD_BACKEND` environment variable. | ||
|
|
||
| ### Depot (Production) | ||
|
|
||
| Depot.dev provides isolated, cached, and high-performance container builds. Builds are fast thanks to persistent layer caching across builds. Each customer project gets an isolated build environment with its own cache. No local Docker daemon is required since builds run on remote BuildKit machines. Multi-architecture support allows building for both amd64 and arm64. Registry integration is built-in, pushing images directly to Depot's registry after the build completes. | ||
|
|
||
| **Location:** `go/apps/ctrl/services/build/backend/depot/` | ||
|
|
||
| ### Docker (Local Development) | ||
|
|
||
| The Docker backend uses standard Docker builds for local testing. It connects to the local Docker daemon and builds images on the host machine. This backend is simpler to set up for development but lacks the caching and isolation benefits of Depot. | ||
|
|
||
| **Location:** `go/apps/ctrl/services/build/backend/docker/` | ||
|
|
||
| ## Storage | ||
|
|
||
| Build contexts are stored in S3-compatible storage. The upload process gives customers presigned URLs to directly upload their build context, bypassing the control plane for efficient transfer. During the build, Depot receives presigned download URLs to fetch the context from S3. Build contexts are retained for the lifecycle of the deployment, allowing rebuilds and rollbacks when needed. | ||
|
|
||
| **Location:** `go/apps/ctrl/services/build/storage/s3.go` | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,89 @@ | ||
| --- | ||
| title: Control Plane (Ctrl) | ||
| description: The control plane service for managing deployments and infrastructure | ||
| --- | ||
|
|
||
| import { Mermaid } from "@/app/components/mermaid"; | ||
|
|
||
| **Location:** `go/apps/ctrl/` | ||
| **CLI Command:** [`unkey run ctrl`](/cli/run/ctrl) | ||
| **Protocol:** Connect RPC (HTTP/2) | ||
|
|
||
| ## What It Does | ||
|
|
||
| The ctrl service provides a deployment platform similar to Vercel, Railway, or Fly.io. When a customer deploys their application, ctrl: | ||
|
|
||
| 1. **Builds** container images from source code using Depot.dev | ||
| 2. **Deploys** containers to Kubernetes via Krane | ||
| 3. **Assigns** domains to route traffic and configure gateways | ||
| 4. **Secures** applications with automatic TLS certificate provisioning | ||
|
|
||
| All multi-step operations are durable, using Restate workflows to ensure consistency even during failures, network partitions, or process crashes. | ||
|
|
||
| ## Architecture | ||
|
|
||
| ### Service Composition | ||
|
|
||
| The ctrl service is composed of several specialized services and workflows. The RPC services handle synchronous operations like container image building through `BuildService`, deployment creation and management through `DeploymentService`, ACME challenge coordination through `AcmeService`, OpenAPI spec management through `OpenApiService`, and health checks through `CtrlService`. | ||
|
|
||
| Running alongside these are the Restate workflows that provide durable orchestration. The `DeploymentService` workflow orchestrates the full deployment lifecycle, the `RoutingService` workflow manages domain and gateway configuration, and the `CertificateService` workflow handles TLS certificate provisioning through the ACME protocol. | ||
|
|
||
| ### Technology Stack | ||
|
|
||
| The ctrl service is built on Connect RPC for service-to-service communication using HTTP/2. Restate provides durable workflow orchestration with exactly-once semantics, ensuring operations complete reliably even during failures. Two MySQL databases store persistent state: the main database for projects, deployments, and domains, and the partition database for VM instances and gateway configurations. S3 stores build contexts and encrypted vault data. Krane provides a Kubernetes deployment abstraction, and Depot.dev handles remote container image building with persistent layer caching. | ||
|
|
||
| ## Services | ||
|
|
||
| ### Build Service | ||
|
|
||
| The build service manages container image building for customer deployments. It supports two backends: Depot for production deployments, which provides remote BuildKit with persistent layer caching for fast rebuilds, and Docker for local development, which uses standard Docker builds on the local machine. | ||
|
|
||
| The service provides two key operations. `GenerateUploadURL` returns a presigned S3 URL where the CLI can upload a tarball of the build context. `CreateBuild` then builds a Docker image from that uploaded source, coordinating with either Depot or Docker depending on configuration. | ||
|
|
||
| [Read detailed Build System docs →](./build) | ||
|
|
||
| ### Deployment Service | ||
|
|
||
| The deployment service orchestrates the complete deployment lifecycle through durable workflows. It provides four key operations: `CreateDeployment` initiates a new deployment, `GetDeployment` queries the current status, `Promote` promotes a deployment to live, and `Rollback` rolls back to a previous deployment. | ||
|
|
||
| The deployment workflow progresses through several phases. It first builds the container image if building from source, then creates the deployment in Krane, our Kubernetes abstraction layer. Next it polls for instance readiness for up to 5 minutes, checking every second whether all pods are running. Once instances are ready, it registers them in the partition database so gateways can route traffic to them. It attempts to scrape an OpenAPI spec from the running service, though this is optional. Finally, it assigns domains and creates gateway configurations via the routing service, then marks the deployment as ready. | ||
|
|
||
| Restate implements [durable executions](https://www.restate.dev/what-is-durable-execution) by recording progress in a distributed persistent log. The log is managed by the Restate server. If ctrl crashes during deployment, Restate resumes from the last completed phase rather than restarting from the beginning. This ensures deployments complete reliably even during system failures. | ||
|
|
||
| Deployments are keyed by `project_id` in Restate's virtual object model. This ensures only one deployment operation per project runs at a time, preventing race conditions during concurrent deploy, rollback, or promote operations that could leave the system in an inconsistent state. | ||
|
|
||
| [Read detailed Deployment Workflow docs →](/docs/architecture/workflows/deployment-service) | ||
|
|
||
| ### ACME Service | ||
|
|
||
| The ACME service handles ACME protocol coordination for TLS certificate provisioning. It provides three key operations: `CreateACMEUser` registers an ACME account for a workspace, `ValidateDomain` validates domain ownership, and `GetCertificate` retrieves issued certificates. | ||
|
|
||
| The service coordinates with the Certificate workflow for actual certificate issuance. It supports both HTTP-01 challenges for custom domains and DNS-01 challenges via the Cloudflare provider for wildcard certificates on the default domain. | ||
|
|
||
| Private keys are encrypted using the vault service before storage. Certificates are stored in the partition database for fast gateway access without encryption overhead. Challenge records track certificate expiry with 90-day validity periods. | ||
|
|
||
| [Read detailed Certificate docs →](./certificates) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chainVerify "./certificates" cross-reference. Line 65 links to 🏁 Script executed: #!/bin/bash
# Check if certificates documentation exists for Ctrl service
fd --type f "certificates.mdx" --path "*architecture/services/ctrl*"Length of output: 233 🏁 Script executed: #!/bin/bash
# Check if certificates.mdx exists in the Ctrl service documentation directory
fd "certificates.mdx" "apps/engineering/content/docs/architecture/services/ctrl/"Length of output: 0 🏁 Script executed: #!/bin/bash
# List all .mdx files in the Ctrl service documentation directory to verify structure
fd --type f "\.mdx$" "apps/engineering/content/docs/architecture/services/ctrl/"Length of output: 133 Remove or create the missing certificates documentation file. The link to 🤖 Prompt for AI Agents |
||
|
|
||
| ### OpenAPI Service | ||
|
|
||
| The OpenAPI service manages OpenAPI specifications scraped from deployed applications. It provides two key operations: `GetDiff` compares OpenAPI specs between deployments to detect breaking changes, and `GetSpec` retrieves the spec for a specific deployment. | ||
|
|
||
| Specs are scraped from `GET /openapi.yaml` on running instances during the deployment workflow. They're stored in the database and used for API documentation generation, request validation in gateways, and breaking change detection between deployments. | ||
|
|
||
| ## Workflows | ||
|
|
||
| Workflows are implemented as Restate services for durable execution. The Deployment Workflow handles deploy, rollback, and promote operations. The Routing Workflow manages domain assignment and gateway configuration. The Certificate Workflow processes ACME challenges for TLS certificate provisioning. See the individual workflow documentation pages for detailed implementation specifics. | ||
|
|
||
| ## Database Schema | ||
|
|
||
| The ctrl service uses two MySQL databases. The main database (`unkey`) stores projects, environments, and workspaces, along with deployments and deployment history, domains and SSL certificates, and ACME users and challenges. The partition database (`partition_*`) stores VMs representing container instances, gateway configurations as JSON blobs, and certificate storage in PEM format. | ||
|
|
||
| The partition database is designed for horizontal sharding. Each partition can live on a separate database server, and gateway instances only need access to their assigned partition. This reduces the blast radius if a partition is compromised and allows scaling the gateway infrastructure independently. | ||
|
|
||
| ## Monitoring | ||
|
|
||
| The ctrl service exposes metrics and logs through OpenTelemetry. Key metrics include deployment duration broken down by phase, build success and failure rates, the number of Krane poll iterations required for deployments to become ready, domain assignment latency, and ACME challenge success rates. | ||
|
|
||
| All operations include structured logging fields for correlation and debugging. Common fields include `deployment_id`, `project_id`, and `workspace_id` across all operations. Build operations add `build_id` and `depot_project_id`. System-level logs include `instance_id`, `region`, and `platform` to identify which ctrl instance handled the operation. | ||
|
|
||
| Logs are shipped to Grafana Loki in production for centralized log aggregation and querying. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| { | ||
| "title": "Ctrl", | ||
| "icon": "Pencil", | ||
| "root": false, | ||
| "pages": ["index", "build"] | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify all service card links resolve correctly.
The new Cards section references six service pages with relative paths. Confirm all destinations exist:
./services/ctrl→ should resolve to ctrl/index.mdx ✓ (in PR)./services/krane→ should resolve to krane.mdx ✓ (in PR)./services/api/config→ verify this path exists./services/gateway→ verify this path exists./services/clickhouse→ verify this path exists./services/vault→ verify this path exists🏁 Script executed:
Length of output: 377
🏁 Script executed:
Length of output: 282
🏁 Script executed:
Length of output: 244
🏁 Script executed:
Length of output: 1510
🏁 Script executed:
Length of output: 1115
Fix broken service card link:
gatewaypage does not exist.Verification found that five of the six referenced service pages exist, but the "Gateway (GW)" card links to
./services/gatewaywhich has no corresponding documentation file (neither gateway.mdx nor gateway/index.mdx exists in the services directory).Confirmed valid links:
./services/ctrl→ ctrl/index.mdx./services/krane→ krane.mdx./services/api/config→ api/config.mdx./services/clickhouse→ clickhouse.mdx./services/vault→ vault.mdxBroken link:
./services/gateway→ file not foundCreate gateway.mdx in the services directory or update the card href to point to an existing page.
🤖 Prompt for AI Agents