+Wall-clock profiling builds on decades of research in performance analysis that distinguishes on-CPU computation from off-CPU waiting. Curtsinger and Berger's Coz (ASPLOS'15) introduced causal profiling, which experimentally determines which code regions, if optimized, actually reduce end-to-end latency—addressing the fundamental question of where optimization effort pays off. Zhou et al.'s wPerf (OSDI'18) presented a generic off-CPU analysis framework that identifies critical waiting events (locks, I/O) bounding throughput with low overhead, while the more recent work by Ahn et al. (OSDI'24) unified on- and off-CPU analysis through blocked-sample profiling that captures both running and blocked thread states. The visualization techniques we employ draw from Gregg's flame graph methodology (CACM'16, USENIX ATC'17), which transforms stack-trace aggregations into intuitive hierarchical diagrams; his off-CPU flame graphs specifically highlight blocking patterns by rendering sleep stacks in contrasting colors. Timing accuracy itself poses challenges, as Najafi et al. (HotOS'21) argue that modern systems research increasingly depends on precise wall-clock measurements, and earlier work on time-sensitive Linux (Goel et al., OSDI'02) explored kernel techniques for low-latency timing under load. Practical eBPF-based profiling has been demonstrated in production contexts, including Java profiling with off-CPU "offwaketime" analysis (ICPE'19) and comprehensive workflows outlined in recent eBPF performance tutorials (Gregg, SIGCOMM'24). Together, these techniques and tools provide the foundation for understanding where applications spend time and how to optimize holistically across both compute and blocking dimensions.
0 commit comments