docs: add tutorial for tracking Intel NPU kernel driver operations using eBPF

yunwei37 · yunwei37 · commit 29e8113081c9 · 2025-10-05T19:40:13.000-07:00
diff --git a/src/xpu/gpu-kernel-driver/README.md b/src/xpu/gpu-kernel-driver/README.md
@@ -191,9 +191,15 @@ Verify tracepoints exist on your system before running scripts:
 sudo cat /sys/kernel/debug/tracing/available_events | grep -E '(gpu_scheduler|i915|amdgpu|^drm:)'
 ```
 
+## Limitations: Kernel Tracing vs GPU-Side Observability
+
+This tutorial focuses on kernel-side GPU driver tracing, which provides visibility into job scheduling, memory management, and driver-firmware communication. However, kernel tracepoints have fundamental limitations. When `drm_run_job` fires, we know a job started executing on GPU hardware, but we cannot observe what happens inside the GPU itself. The execution of thousands of parallel threads, their memory access patterns, branch divergence, warp occupancy, and instruction-level behavior remain invisible. These details are critical for understanding performance bottlenecks - whether memory coalescing is failing, whether thread divergence is killing efficiency, or whether shared memory bank conflicts are stalling execution.
+
+To achieve fine-grained GPU observability, eBPF programs must run directly on the GPU. This is the direction explored by the eGPU paper and [bpftime GPU examples](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu). bpftime converts eBPF bytecode to PTX instructions that GPUs can execute, then dynamically patches CUDA binaries at runtime to inject these eBPF programs at kernel entry/exit points. This enables observing GPU-specific information like block indices, thread indices, global timers, and warp-level metrics. Developers can instrument critical paths inside GPU kernels to measure execution behavior and diagnose complex performance issues that kernel-side tracing cannot reach. This GPU-internal observability complements kernel tracepoints - together they provide end-to-end visibility from API calls through kernel drivers to GPU execution.
+
 ## Summary
 
-GPU kernel tracepoints provide zero-overhead visibility into driver internals. DRM scheduler's stable uAPI tracepoints work across all vendors for production monitoring. Vendor-specific tracepoints expose detailed memory management and command submission pipelines. The bpftrace script demonstrates tracking job scheduling, measuring latency, and identifying dependency stalls - all critical for diagnosing performance issues in games, ML training, and cloud GPU workloads.
+GPU kernel tracepoints provide zero-overhead visibility into driver internals. DRM scheduler's stable uAPI tracepoints work across all vendors for production monitoring. Vendor-specific tracepoints expose detailed memory management and command submission pipelines. The bpftrace script demonstrates tracking job scheduling, measuring latency, and identifying dependency stalls - all critical for diagnosing performance issues in games, ML training, and cloud GPU workloads. For GPU-internal observability beyond kernel tracing, explore bpftime's GPU eBPF capabilities.
 
 > If you'd like to dive deeper into eBPF, check out our tutorial repository at <https://github.com/eunomia-bpf/bpf-developer-tutorial> or visit our website at <https://eunomia.dev/tutorials/>.
 
diff --git a/src/xpu/gpu-kernel-driver/README.zh.md b/src/xpu/gpu-kernel-driver/README.zh.md
@@ -191,9 +191,15 @@ Waits per ring:
 sudo cat /sys/kernel/debug/tracing/available_events | grep -E '(gpu_scheduler|i915|amdgpu|^drm:)'
 ```
 
+## 局限性：内核追踪 vs GPU 侧可观测性
+
+本教程主要关注内核侧的 GPU 驱动追踪，这为我们提供了作业调度、内存管理和驱动-固件通信的可见性。然而，内核跟踪点存在根本性的局限。当 `drm_run_job` 触发时，我们知道作业开始在 GPU 硬件上执行，但无法观察到 GPU 内部实际发生了什么。成千上万个并行线程的执行、它们的内存访问模式、分支分化、warp 占用率和指令级行为都是不可见的。这些细节对于理解性能瓶颈至关重要 - 内存合并是否失败、线程分化是否降低了效率，或者共享内存 bank 冲突是否导致执行停顿。
+
+要实现细粒度的 GPU 可观测性，eBPF 程序必须直接在 GPU 上运行。这正是 eGPU 论文和 [bpftime GPU 示例](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu)所探索的方向。bpftime 将 eBPF 字节码转换为 GPU 可以执行的 PTX 指令，然后在运行时动态修补 CUDA 二进制文件，将这些 eBPF 程序注入到内核入口/出口点。这使得开发者可以观察 GPU 特有的信息，如块索引、线程索引、全局计时器和 warp 级指标。开发者可以在 GPU 内核的关键路径上进行插桩，测量执行行为并诊断内核侧追踪无法触及的复杂性能问题。这种 GPU 内部的可观测性与内核跟踪点互补 - 它们一起提供了从 API 调用通过内核驱动到 GPU 执行的端到端可见性。
+
 ## 总结
 
-GPU 内核跟踪点提供零开销的驱动内部可见性。DRM 调度器的稳定 uAPI 跟踪点跨所有供应商工作，适合生产监控。供应商特定跟踪点暴露详细的内存管理和命令提交管道。bpftrace 脚本演示了跟踪作业调度、测量延迟和识别依赖停顿 - 所有这些对于诊断游戏、ML 训练和云 GPU 工作负载中的性能问题都至关重要。
+GPU 内核跟踪点提供零开销的驱动内部可见性。DRM 调度器的稳定 uAPI 跟踪点跨所有供应商工作，适合生产监控。供应商特定跟踪点暴露详细的内存管理和命令提交管道。bpftrace 脚本演示了跟踪作业调度、测量延迟和识别依赖停顿 - 所有这些对于诊断游戏、ML 训练和云 GPU 工作负载中的性能问题都至关重要。对于超越内核追踪的 GPU 内部可观测性，请探索 bpftime 的 GPU eBPF 能力。
 
 > 如果你想深入了解 eBPF，请查看我们的教程仓库 <https://github.com/eunomia-bpf/bpf-developer-tutorial> 或访问我们的网站 <https://eunomia.dev/tutorials/>。
 
diff --git a/src/xpu/npu-kernel-driver/README.zh.md b/src/xpu/npu-kernel-driver/README.zh.md