docs: enhance README with detailed explanation of CUDA API tracing and eBPF integration

yunwei37 · yunwei37 · commit c9d3d65c15fb · 2025-09-30T22:28:48.000-07:00
diff --git a/src/47-cuda-events/README.md b/src/47-cuda-events/README.md
@@ -4,25 +4,15 @@ Have you ever wondered what's happening under the hood when your CUDA applicatio
 
 ## Introduction to CUDA and GPU Tracing
 
-CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model that enables developers to use NVIDIA GPUs for general-purpose processing. When you run a CUDA application, several things happen behind the scenes:
+CUDA (Compute Unified Device Architecture) is NVIDIA's parallel computing platform and programming model that enables developers to use NVIDIA GPUs for general-purpose processing. When you run a CUDA application, a typical workflow begins with the host (CPU) allocating memory on the device (GPU), followed by data transfer from host memory to device memory, then GPU kernels (functions) are launched to process the data, after which results are transferred back from device to host, and finally device memory is freed.
 
-1. The host (CPU) allocates memory on the device (GPU)
-2. Data is transferred from host to device memory
-3. GPU kernels (functions) are launched to process the data
-4. Results are transferred back from device to host
-5. Device memory is freed
+Each operation in this process involves CUDA API calls, such as `cudaMalloc` for memory allocation, `cudaMemcpy` for data transfer, and `cudaLaunchKernel` for kernel execution. Tracing these calls can provide valuable insights for debugging and performance optimization, but this isn't straightforward. GPU operations are asynchronous, meaning the CPU can continue executing after submitting work to the GPU without waiting, and traditional debugging tools often can't penetrate this asynchronous boundary to access GPU internal state.
 
-Each of these operations involves CUDA API calls like `cudaMalloc`, `cudaMemcpy`, and `cudaLaunchKernel`. Tracing these calls can provide valuable insights for debugging and performance optimization, but this isn't straightforward. GPU operations happen asynchronously, and traditional debugging tools often can't access GPU internals.
+This is where eBPF comes to the rescue! By using uprobes, we can intercept CUDA API calls in the user-space CUDA runtime library (`libcudart.so`) before they reach the GPU driver, capturing critical information. This approach allows us to gain deep insights into memory allocation sizes and patterns, data transfer directions and volumes, kernel launch parameters, error codes and failure reasons returned by the API, and precise timing information for each operation. By intercepting these calls on the CPU side, we can build a complete view of an application's GPU usage behavior without modifying application code or relying on proprietary profiling tools.
 
-This is where eBPF comes to the rescue! By using uprobes, we can intercept CUDA API calls in the user-space CUDA runtime library (`libcudart.so`) before they reach the GPU. This gives us visibility into:
+This tutorial primarily focuses on CPU-side CUDA API tracing, which provides a macro view of how applications interact with the GPU. However, CPU-side tracing alone has clear limitations. When a CUDA API function like `cudaLaunchKernel` is called, it merely submits a work request to the GPU. We can see when the kernel was launched, but we cannot observe what actually happens inside the GPU. Critical details such as how thousands of threads access memory, their execution patterns, branching behavior, and synchronization operations remain invisible. These details are crucial for understanding performance bottlenecks, such as whether memory access patterns cause coalesced access failures or whether severe thread divergence reduces execution efficiency.
 
-- Memory allocation sizes and patterns
-- Data transfer directions and sizes
-- Kernel launch parameters
-- Error codes and failures
-- Timing of operations
-
-This blog mainly focuses on the CPU side of the CUDA API calls, for fined-grained tracing of GPU operations, you can see [eGPU](https://dl.acm.org/doi/10.1145/3723851.3726984) paper and [bpftime](https://github.com/eunomia-bpf/bpftime) project.
+To achieve fine-grained tracing of GPU operations, eBPF programs need to run directly on the GPU. This is exactly what the eGPU paper and [bpftime GPU examples](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu) explore. bpftime converts eBPF programs into PTX instructions that GPUs can execute, then dynamically modifies CUDA binaries at runtime to inject these eBPF programs at kernel entry and exit points, enabling observation of GPU internal behavior. This approach allows developers to access GPU-specific information such as block indices, thread indices, global timers, and perform measurements and tracing on critical paths during kernel execution. This GPU-internal observability is essential for diagnosing complex performance issues, understanding kernel execution behavior, and optimizing GPU computation—capabilities that CPU-side tracing simply cannot provide.
 
 ## Key CUDA Functions We Trace
 
@@ -504,5 +494,7 @@ The code of this tutorial is in [https://github.com/eunomia-bpf/bpf-developer-tu
 - NVIDIA CUDA Runtime API: [https://docs.nvidia.com/cuda/cuda-runtime-api/](https://docs.nvidia.com/cuda/cuda-runtime-api/)
 - libbpf Documentation: [https://libbpf.readthedocs.io/](https://libbpf.readthedocs.io/)
 - Linux uprobes Documentation: [https://www.kernel.org/doc/Documentation/trace/uprobetracer.txt](https://www.kernel.org/doc/Documentation/trace/uprobetracer.txt)
+- eGPU: eBPF on GPUs: <https://dl.acm.org/doi/10.1145/3723851.3726984>
+- bpftime GPU Examples: <https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu>
 
 If you'd like to dive deeper into eBPF, check out our tutorial repository at [https://github.com/eunomia-bpf/bpf-developer-tutorial](https://github.com/eunomia-bpf/bpf-developer-tutorial) or visit our website at [https://eunomia.dev/tutorials/](https://eunomia.dev/tutorials/). 
diff --git a/src/47-cuda-events/README.zh.md b/src/47-cuda-events/README.zh.md
@@ -1,28 +1,14 @@
-# eBPF 与机器学习可观测：追踪 CUDA GPU 操作
+# eBPF 教程：追踪 CUDA GPU 操作
 
 你是否曾经想知道CUDA应用程序在运行时底层发生了什么？GPU操作由于发生在具有独立内存空间的设备上，因此调试和性能分析变得极为困难。在本教程中，我们将构建一个强大的基于eBPF的追踪工具，让你实时查看CUDA API调用。
 
 ## CUDA和GPU追踪简介
 
-CUDA（Compute Unified Device Architecture，计算统一设备架构）是NVIDIA的并行计算平台和编程模型，使开发者能够利用NVIDIA GPU进行通用计算。当你运行CUDA应用程序时，后台会发生以下步骤：
+CUDA（Compute Unified Device Architecture，计算统一设备架构）是NVIDIA的并行计算平台和编程模型，使开发者能够利用NVIDIA GPU进行通用计算。当你运行CUDA应用程序时，一个典型的工作流程从主机（CPU）在设备（GPU）上分配内存开始，随后数据从主机内存传输到设备内存，接着GPU内核（函数）被启动以处理数据，处理完成后结果从设备传回主机，最后设备内存被释放。
 
-1. 主机（CPU）在设备（GPU）上分配内存
-2. 数据从主机内存传输到设备内存
-3. GPU内核（函数）被启动以处理数据
-4. 结果从设备传回主机
-5. 设备内存被释放
+这个过程中的每个操作都涉及CUDA API调用，例如用于内存分配的`cudaMalloc`、用于数据传输的`cudaMemcpy`以及用于启动内核的`cudaLaunchKernel`。追踪这些调用可以提供宝贵的调试和性能优化信息，但这并不简单。GPU操作是异步的，这意味着CPU在提交工作给GPU后可以继续执行而无需等待，而传统调试工具通常无法穿透这层异步边界来访问GPU内部状态。
 
-每个操作都涉及CUDA API调用，如`cudaMalloc`、`cudaMemcpy`和`cudaLaunchKernel`。追踪这些调用可以提供宝贵的调试和性能优化信息，但这并不简单。GPU操作是异步的，传统调试工具通常无法访问GPU内部。
-
-这时eBPF就派上用场了！通过使用uprobes，我们可以在用户空间CUDA运行库（`libcudart.so`）中拦截CUDA API调用，在它们到达GPU之前。这使我们能够了解：
-
-- 内存分配大小和模式
-- 数据传输方向和大小
-- 内核启动参数
-- 错误代码和失败原因
-- 操作的时间信息
-
-本教程主要关注CPU侧的CUDA API调用，对于细粒度的GPU操作追踪，你可以参考[eGPU](https://dl.acm.org/doi/10.1145/3723851.3726984)论文和[bpftime](https://github.com/eunomia-bpf/bpftime)项目。
+这时eBPF就派上用场了！通过使用uprobes，我们可以在用户空间CUDA运行库（`libcudart.so`）中拦截CUDA API调用，在它们到达GPU驱动之前捕获关键信息。这种方法使我们能够深入了解内存分配的大小和模式、数据传输的方向和大小、内核启动时使用的参数、API返回的错误代码和失败原因，以及各个操作的精确时间信息。通过在CPU侧拦截这些调用，我们可以构建应用程序GPU使用行为的完整视图，而无需修改应用程序代码或依赖专有的性能分析工具。
 
 ## eBPF技术背景与GPU追踪的挑战
 
@@ -32,6 +18,10 @@ eBPF（Extended Berkeley Packet Filter）最初是为网络数据包过滤而设
 
 GPU追踪面临着独特的挑战。现代GPU是高度并行的处理器，包含数千个小型计算核心，这些核心可以同时执行数万个线程。GPU具有自己的内存层次结构，包括全局内存、共享内存、常数内存和纹理内存，这些内存的访问模式对性能有着巨大影响。更复杂的是，GPU操作通常是异步的，这意味着当CPU启动一个GPU操作后，它可以继续执行其他任务，而无需等待GPU操作完成。另外，CUDA编程模型的异步特性使得调试变得特别困难。当一个内核函数在GPU上执行时，CPU无法直接观察到GPU的内部状态。错误可能在GPU上发生，但直到后续的同步操作（如cudaDeviceSynchronize或cudaStreamSynchronize）时才被检测到，这使得错误源的定位变得困难。此外，GPU内存错误（如数组越界访问）可能导致静默的数据损坏，而不是立即的程序崩溃，这进一步增加了调试的复杂性。
 
+本教程主要关注CPU侧的CUDA API调用，这为我们提供了应用程序与GPU交互的宏观视图。然而，仅在CPU侧进行追踪存在明显的局限性。当CUDA API函数如`cudaLaunchKernel`被调用时，它只是向GPU提交了一个工作请求，我们可以看到内核何时被启动，但无法观察到GPU内部实际发生了什么。GPU上运行的成千上万个线程如何访问内存、它们的执行模式、分支行为和同步操作等关键细节都是不可见的。这些细节对于理解性能瓶颈至关重要，比如内存访问模式是否导致了合并访问失败，或者是否存在严重的线程分化导致执行效率降低。
+
+要实现对GPU操作的细粒度追踪，需要直接在GPU上运行eBPF程序。这正是eGPU论文和[bpftime GPU示例](https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu)所探索的方向。bpftime通过将eBPF程序转换为GPU可以执行的PTX指令，然后在运行时动态修改CUDA二进制文件，将这些eBPF程序注入到GPU内核的入口和出口点，从而实现对GPU内部行为的观测。这种方法使得开发者可以访问GPU特有的信息，如块索引、线程索引、全局计时器等，并且可以在GPU内核执行的关键路径上进行测量和追踪。这种GPU内部的可观测性对于诊断复杂的性能问题、理解内核执行行为以及优化GPU计算至关重要，是CPU侧追踪无法企及的。
+
 ## 我们追踪的关键CUDA函数
 
 我们的追踪工具监控几个关键CUDA函数，这些函数代表GPU计算中的主要操作。了解这些函数有助于解释追踪结果并诊断CUDA应用程序中的问题：
@@ -511,5 +501,7 @@ cudaFree:               0.00 µs
 - NVIDIA CUDA运行时API：[https://docs.nvidia.com/cuda/cuda-runtime-api/](https://docs.nvidia.com/cuda/cuda-runtime-api/)
 - libbpf文档：[https://libbpf.readthedocs.io/](https://libbpf.readthedocs.io/)
 - Linux uprobes文档：[https://www.kernel.org/doc/Documentation/trace/uprobetracer.txt](https://www.kernel.org/doc/Documentation/trace/uprobetracer.txt)
+- eGPU: eBPF on GPUs: <https://dl.acm.org/doi/10.1145/3723851.3726984>
+- bpftime GPU示例: <https://github.com/eunomia-bpf/bpftime/tree/master/example/gpu>
 
 如果你想深入了解eBPF，请查看我们的教程仓库：[https://github.com/eunomia-bpf/bpf-developer-tutorial](https://github.com/eunomia-bpf/bpf-developer-tutorial) 或访问我们的网站：[https://eunomia.dev/tutorials/](https://eunomia.dev/tutorials/)。