Enhancement Request: Modern Observability for Texera

### Feature Summary

Texera currently lacks systematic observability instrumentation, making it difficult to monitor and debug the status of live services and the distributed workflows. This feature request proposes implementing an observability solution using OpenTelemetry standards to enable centralized logging, metrics collection, and distributed tracing across backend services. These data can then be integrated with open-source observability tools.


### Proposed Solution or Design

### Current Observability Gaps

**Logging**:
- Logback (Scala) and loguru (Python) with file/console output only
- Logs are ephemeral in Kubernetes (lost when pods restart)
- Cannot correlate logs across services for a single workflow execution

**Metrics**:
- No application-level metrics (request rates, error rates, latency, database query times)

**Tracing**:
- No distributed tracing implementation
- Cannot trace a workflow execution across multiple services, Python workers, database queries, or external API calls

**Health Checks**:
- Basic `/api/healthcheck` endpoints return `{"status": "ok"}` only
- No real health checks or detailed status



## Proposed Solution

### High-Level Approach

Add **OpenTelemetry instrumentation** throughout the codebase to emit logs, metrics, and traces in a standardized format. These signals can then be collected and exported to various open-source observability tools.

### Implementation Strategy

**Instrumentation Layer**:
- Add OpenTelemetry SDK to all services (Scala/Java and Python)
- Add auto-instrumentation (no code changes) where possible (HTTP, JDBC, akka)
- Migrate current logging to use OpenTelemetry
- Based on need and use cases, add manual instrumentation for metrics and traces

**Collection Layer**:
- Deploy OpenTelemetry Collector (as DaemonSet in Kubernetes) to collect logs, metrics, and traces
- Collector can export to various backends (configurable, not hardcoded)

**Observability Backends**:
- The standardized OpenTelemetry data can be integrated with open-source tools like Grafana, Elastic, etc..

## Benefits

1. **Faster debugging**: Search logs by workflow_id or trace_id to see complete request flow across all services.
2. **Proactive issue detection**: Monitor errors and set alerts before users are affected. Example: workflows fail to run.
3. **Operational insights**: Track which workflows/operators are most used, average execution time etc..

### Impact / Priority

(P2)Medium – useful enhancement

### Affected Area

Deployment / Infrastructure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhancement Request: Modern Observability for Texera #4070

Feature Summary

Proposed Solution or Design

Current Observability Gaps

Proposed Solution

High-Level Approach

Implementation Strategy

Benefits

Impact / Priority

Affected Area

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Enhancement Request: Modern Observability for Texera #4070

Description

Feature Summary

Proposed Solution or Design

Current Observability Gaps

Proposed Solution

High-Level Approach

Implementation Strategy

Benefits

Impact / Priority

Affected Area

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions