-
Notifications
You must be signed in to change notification settings - Fork 111
Description
Feature Summary
Texera currently lacks systematic observability instrumentation, making it difficult to monitor and debug the status of live services and the distributed workflows. This feature request proposes implementing an observability solution using OpenTelemetry standards to enable centralized logging, metrics collection, and distributed tracing across backend services. These data can then be integrated with open-source observability tools.
Proposed Solution or Design
Current Observability Gaps
Logging:
- Logback (Scala) and loguru (Python) with file/console output only
- Logs are ephemeral in Kubernetes (lost when pods restart)
- Cannot correlate logs across services for a single workflow execution
Metrics:
- No application-level metrics (request rates, error rates, latency, database query times)
Tracing:
- No distributed tracing implementation
- Cannot trace a workflow execution across multiple services, Python workers, database queries, or external API calls
Health Checks:
- Basic
/api/healthcheckendpoints return{"status": "ok"}only - No real health checks or detailed status
Proposed Solution
High-Level Approach
Add OpenTelemetry instrumentation throughout the codebase to emit logs, metrics, and traces in a standardized format. These signals can then be collected and exported to various open-source observability tools.
Implementation Strategy
Instrumentation Layer:
- Add OpenTelemetry SDK to all services (Scala/Java and Python)
- Add auto-instrumentation (no code changes) where possible (HTTP, JDBC, akka)
- Migrate current logging to use OpenTelemetry
- Based on need and use cases, add manual instrumentation for metrics and traces
Collection Layer:
- Deploy OpenTelemetry Collector (as DaemonSet in Kubernetes) to collect logs, metrics, and traces
- Collector can export to various backends (configurable, not hardcoded)
Observability Backends:
- The standardized OpenTelemetry data can be integrated with open-source tools like Grafana, Elastic, etc..
Benefits
- Faster debugging: Search logs by workflow_id or trace_id to see complete request flow across all services.
- Proactive issue detection: Monitor errors and set alerts before users are affected. Example: workflows fail to run.
- Operational insights: Track which workflows/operators are most used, average execution time etc..
Impact / Priority
(P2)Medium – useful enhancement
Affected Area
Deployment / Infrastructure