Skip to content

Enhancement Request: Modern Observability for Texera #4070

@zuozhiw

Description

@zuozhiw

Feature Summary

Texera currently lacks systematic observability instrumentation, making it difficult to monitor and debug the status of live services and the distributed workflows. This feature request proposes implementing an observability solution using OpenTelemetry standards to enable centralized logging, metrics collection, and distributed tracing across backend services. These data can then be integrated with open-source observability tools.

Proposed Solution or Design

Current Observability Gaps

Logging:

  • Logback (Scala) and loguru (Python) with file/console output only
  • Logs are ephemeral in Kubernetes (lost when pods restart)
  • Cannot correlate logs across services for a single workflow execution

Metrics:

  • No application-level metrics (request rates, error rates, latency, database query times)

Tracing:

  • No distributed tracing implementation
  • Cannot trace a workflow execution across multiple services, Python workers, database queries, or external API calls

Health Checks:

  • Basic /api/healthcheck endpoints return {"status": "ok"} only
  • No real health checks or detailed status

Proposed Solution

High-Level Approach

Add OpenTelemetry instrumentation throughout the codebase to emit logs, metrics, and traces in a standardized format. These signals can then be collected and exported to various open-source observability tools.

Implementation Strategy

Instrumentation Layer:

  • Add OpenTelemetry SDK to all services (Scala/Java and Python)
  • Add auto-instrumentation (no code changes) where possible (HTTP, JDBC, akka)
  • Migrate current logging to use OpenTelemetry
  • Based on need and use cases, add manual instrumentation for metrics and traces

Collection Layer:

  • Deploy OpenTelemetry Collector (as DaemonSet in Kubernetes) to collect logs, metrics, and traces
  • Collector can export to various backends (configurable, not hardcoded)

Observability Backends:

  • The standardized OpenTelemetry data can be integrated with open-source tools like Grafana, Elastic, etc..

Benefits

  1. Faster debugging: Search logs by workflow_id or trace_id to see complete request flow across all services.
  2. Proactive issue detection: Monitor errors and set alerts before users are affected. Example: workflows fail to run.
  3. Operational insights: Track which workflows/operators are most used, average execution time etc..

Impact / Priority

(P2)Medium – useful enhancement

Affected Area

Deployment / Infrastructure

Metadata

Metadata

Assignees

No one assigned

    Labels

    triagePending for triaging

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions