-
Notifications
You must be signed in to change notification settings - Fork 701
Description
Component
Instrumentation: host
Problem Statement
Linux kernel 4.20+ provides Pressure Stall Information (PSI) metrics that offer valuable insights into resource contention and system performance bottlenecks. PSI tracks the amount of time processes spend stalled waiting for CPU, memory, and I/O resources, providing both some (at least one process stalled) and full (all non-idle processes stalled) metrics.
Currently, the OpenTelemetry Go host metrics instrumentation does not collect PSI metrics, which are increasingly used by modern observability platforms and are particularly valuable for:
- Detecting resource saturation before traditional utilization metrics show problems
- Identifying performance degradation in containerized environments
- Understanding the real impact of resource limits on application performance
- Proactive capacity planning and alerting
These metrics are available at /proc/pressure/{cpu,memory,io} and are already widely adopted by tools like systemd, Facebook's resource management systems, and various monitoring solutions.
Proposed Solution
Implement a new PSI metrics collector within the host metrics instrumentation package that:
-
Reads PSI files from
/proc/pressure/for CPU, memory, and I/O -
Parses the format:
some avg10=0.00 avg60=0.00 avg300=0.00 total=12345 full avg10=0.00 avg60=0.00 avg300=0.00 total=67890 -
Exposes metrics following OpenTelemetry semantic conventions:
system.psi.cpu.some.pct- Percentage of time some processes stalled on CPUsystem.psi.cpu.full.pct- Percentage of time all processes stalled on CPUsystem.psi.memory.some.pct- Memory pressure (some)system.psi.memory.full.pct- Memory pressure (full)system.psi.io.some.pct- I/O pressure (some)system.psi.io.full.pct- I/O pressure (full)
-
Implementation considerations:
- Make PSI collection opt-in or configurable
- Add appropriate unit tests and documentation
- Consider rate of collection (PSI files are updated every 2 seconds by the kernel)
Alternatives
No response
Prior Art
No response
Additional Context
- Linux PSI documentation: https://docs.kernel.org/accounting/psi.html
- PSI has been backported to some older LTS kernels
- Prometheus node_exporter already exposes these metrics: https://github.com/prometheus/node_exporter