Tutorial Github repo: https://github.com/eunomia-bpf/nvbit-tutorial
Official NVBit repo: https://github.com/NVlabs/NVBit
This repository provides a comprehensive, blog-style tutorial (Unofficial) for learning NVIDIA Binary Instrumentation Tool (NVBit). It offers detailed, step-by-step guidance with in-depth explanations of code to help you understand GPU binary instrumentation concepts and techniques.
NVBit is covered by the same End User License Agreement as that of the NVIDIA CUDA Toolkit. By using NVBit you agree to End User License Agreement described in the EULA.txt file.
- Quick Start (5 Minutes)
- About This Tutorial Repository
- Prerequisites
- Introduction to NVBit
- Requirements
- Building the Tools
- Using NVBit Tools
- Key Concepts
- Creating Your Own Tools
- FAQ
- Troubleshooting
- Contributing
Get started with NVBit in just a few minutes:
# Ensure CUDA toolkit is installed and nvdisasm is in PATH
export PATH=/usr/local/cuda/bin:$PATH
# Clone this repository (if you haven't already)
cd nvbit-tutorial
# Build all tools
cd tools && make && cd ..
# Build test applications
cd test-apps && make && cd ..
# Run your first instrumentation (instruction counting)
LD_PRELOAD=./tools/instr_count/instr_count.so ./test-apps/vectoradd/vectoradd
# You should see output like:
# kernel 0 - vecAdd(...) - #thread-blocks 98, kernel instructions 50077, total instructions 50077Next Steps: Read the instr_count tutorial, try opcode_hist to analyze instruction mix, or see FAQ if you encounter issues.
This tutorial repository goes beyond basic examples with detailed blog-style documentation for each tool with comprehensive code explanations, step-by-step implementation guides showing how each tool is built, visual diagrams and examples to illustrate key concepts, best practices and performance considerations, and extension ideas for developing your own custom tools.
The repository contains the core NVBit library (core/) with the main library and header files, example tools (tools/) with practical instrumentation tools and detailed explanations, and test applications (test-apps/) with simple CUDA applications to demonstrate the tools.
Each tool in the tools/ directory includes a comprehensive tutorial README that walks through the code line-by-line, explains the build process, and describes how to interpret the output.
CUDA Toolkit (>= 12.0) - Download from NVIDIA CUDA Toolkit. Verify installation: nvcc --version
nvdisasm (included with CUDA Toolkit)
# Add CUDA bin directory to PATH
export PATH=/usr/local/cuda/bin:$PATH
# Or for specific version:
export PATH=/usr/local/cuda-12.8/bin:$PATH
# Verify nvdisasm is accessible
which nvdisasmnvdisasm to be in your PATH. Without it, tools will fail to run.
GCC (>= 8.5.0 for x86_64; >= 8.5.0 for aarch64) - Verify: gcc --version
Make - Verify: make --version
- GPU Compute Capability: SM 3.5 to SM 12.1
- Architecture: x86_64, aarch64 (ARM64)
Add these to your ~/.bashrc or ~/.zshrc:
# CUDA paths
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATHThen reload: source ~/.bashrc
NVBit (NVIDIA Binary Instrumentation Tool) is a research prototype of a dynamic binary instrumentation library for NVIDIA GPUs.
NVBit provides a set of simple APIs that enable writing a variety of instrumentation tools. Example of instrumentation tools are: dynamic instruction counters, instruction tracers, memory reference tracers, profiling tools, etc.
NVBit allows writing instrumentation tools (which we call NVBit tools) that can inspect and modify the assembly code (SASS) of a GPU application without requiring recompilation, thus dynamic. NVBit allows instrumentation tools to inspect the SASS instructions of each function (__global__ or __device__) as it is loaded for the first time in the GPU. During this phase is possible to inject one or more instrumentation calls to arbitrary device functions before (or after) a SASS instruction. It is also possible to remove SASS instructions, although in this case NVBit does not guarantee that the application will continue to work correctly.
NVBit tries to be as low overhead as possible, although any injection of instrumentation function has an associated cost due to saving and restoring application state before and after jumping to/from the instrumentation function.
Because NVBit does not require application source code, any pre-compiled GPU application should work regardless of which compiler (or version) has been used (i.e. nvcc, pgicc, etc).
- SM compute capability: >= 3.5 && <= 12.1
- Host CPU: x86_64, aarch64
- OS: Linux
- GCC version: >= 8.5.0 for x86_64; >= 8.5.0 for aarch64
- CUDA version: >= 12.0
- CUDA driver version: <= 575.xx
| Component | Supported Range | Tested Version |
|---|---|---|
| SM Architecture | 3.5 - 12.1 | 12.0 (RTX 5090) |
| CUDA Toolkit | 12.0+ | 12.8 |
| NVIDIA Driver | 520.xx - 575.xx | 575.57.08 |
| GCC | 8.5.0 - 14.x | 14.2.0 |
| Operating System | Linux (Ubuntu 20.04+, RHEL 8+, etc.) | Ubuntu 24.04.3 LTS |
This repository has been successfully tested on the following configuration:
| Component | Version/Details |
|---|---|
| GPU | NVIDIA GeForce RTX 5090 (SM 12.0) |
| CUDA Toolkit | 12.8 (V12.8.93) |
| NVIDIA Driver | 575.57.08 |
| NVBit Version | 1.7.6 |
| Operating System | Ubuntu 24.04.3 LTS |
| Kernel | Linux 6.14.0-1007-intel |
| GCC | 14.2.0 |
| Architecture | x86_64 |
Build Configuration:
- Tools linked with g++ (not nvcc) to avoid CUDA 12.8+ device linking issues
- Test applications compiled with
-cudart sharedflag - All 8 example tools compile successfully
- 7 out of 8 tools tested and working (mem_printf2 has known device function issue - see FAQ)
This repository uses NVBit v1.7.6 which includes support for newer CUDA versions and SM architectures up to SM_120.
NVBit is provided in this repository with the core folder containing the main static library libnvbit.a and header files (including nvbit.h with all main NVBit APIs declarations), the tools folder with various source code examples and detailed tutorial documentation, and the test-apps folder with simple applications to test NVBit tools (like a vector addition program). After learning from these examples, you can copy and modify one to create your own tool.
cd tools
makeThis compiles all 8 NVBit tools into shared libraries (.so files).
cd test-apps
makeThis compiles the test applications that you can use to try out the tools.
cd tools/instr_count
make# Clean all tools
cd tools && make clean
# Clean test apps
cd test-apps && make cleanWhy we use g++ instead of nvcc for linking:
CUDA 12.8+ introduced stricter device code linking requirements. To avoid "undefined reference to device functions" errors, we:
- Compile device code with
nvcc -dc(device code compilation) - Link the final shared library with g++ (not nvcc)
- Include necessary CUDA libraries explicitly:
-lcuda -lcudart_static -lpthread -ldl
Makefile Pattern:
# Compile device functions
$(NVCC) -dc inject_funcs.cu -o inject_funcs.o
# Link with g++ (NOT nvcc)
g++ -shared -o tool.so $(OBJECTS) -L$(CUDA_LIB) -lcuda -lcudart_staticThe provided Makefiles handle this automatically, but if you create your own tool, follow this pattern.
Test Application Compilation:
- Use
-cudart sharedflag:nvcc -cudart shared your_app.cu -o your_app - This ensures the CUDA runtime is dynamically linked
If you modify device code (inject_funcs.cu), you must rebuild the entire tool:
cd tools/your_tool
make clean
makeBefore running an NVBit tool, make sure nvdisasm is in your PATH. In
Ubuntu distributions, this is typically done by adding /usr/local/cuda/bin or
/usr/local/cuda-"version"/bin to the PATH environment variable.
To use an NVBit tool, either LD_PRELOAD the tool before the application command:
LD_PRELOAD=./tools/instr_count/instr_count.so ./test-apps/vectoradd/vectoraddOr use CUDA_INJECTION64_PATH:
CUDA_INJECTION64_PATH=./tools/instr_count/instr_count.so ./test-apps/vectoradd/vectoraddNOTE: NVBit uses the same mechanism as nvprof, nsight system, and nsight compute, thus they cannot be used together.
Throughout the tutorial, you'll learn important concepts in GPU binary instrumentation: SASS instruction analysis (understanding GPU assembly), function instrumentation (adding code to existing GPU functions), basic block analysis (working with control flow graphs), memory access tracking (capturing and analyzing memory patterns), efficient GPU-CPU communication, register manipulation (reading and writing GPU registers directly), instruction replacement (modifying GPU code behavior), and performance optimization (minimizing instrumentation overhead).
After working through the examples, you'll be ready to create your own custom instrumentation tools. The repository includes templates and guidance for tool structure (understanding host/device code organization), build systems (setting up Makefiles), common patterns (reusing code for frequently needed functionality), and debugging techniques (troubleshooting instrumentation issues).
What is NVBit? NVBit is a research tool that lets you analyze and modify GPU code at the binary level without recompiling.
Do I need source code? No! NVBit works on compiled CUDA binaries.
Can I use NVBit with nvprof or Nsight? No. They use the same injection mechanism and cannot run simultaneously.
Which GPUs are supported? SM 3.5 to SM 12.1 (Kepler to Blackwell architecture).
Which tool should I start with?
Start with instr_count, then try opcode_hist.
The tool produces too much output. How do I reduce it?
# Instrument only first kernel
KERNEL_BEGIN=0 KERNEL_END=1 LD_PRELOAD=./tools/instr_count/instr_count.so ./app
# Instrument only first 100 instructions
INSTR_END=100 LD_PRELOAD=./tools/instr_count/instr_count.so ./appMy application runs 100x slower with instrumentation
Normal for instruction-level tools. Use instr_count_bb for lower overhead, or instrument selectively.
What's the overhead of each tool?
instr_count_bb: 2-5xinstr_count_cuda_graph: 5-20xinstr_count/opcode_hist: 20-100xmem_trace/mov_replace/record_reg_vals: 100-1000x
Can I use these tools in production?
Only instr_count_bb and instr_count_cuda_graph have low enough overhead.
mem_printf2 doesn't work - why? Known device function call issue. It's included as an educational example.
What's the difference between warp-level and thread-level counting?
- Warp-level (default): Counts 1 per warp (32 threads)
- Thread-level: Counts each thread separately (32x higher)
Set via
COUNT_WARP_LEVEL=0for thread-level.
How do I create my own tool?
- Copy existing tool directory (e.g.,
instr_count) - Modify host code and device code (inject_funcs.cu)
- Update Makefile
- Rebuild with
make
Problem: NVBit cannot find the nvdisasm tool.
Solution:
# Add CUDA bin to PATH
export PATH=/usr/local/cuda/bin:$PATH
# Verify it works
which nvdisasmAdd this to your ~/.bashrc to make it permanent.
Problem: CUDA 12.8+ linking error when building tools.
Solution: Make sure you're using g++ for linking (not nvcc). Our Makefiles already handle this, but if you're creating a custom tool:
# Use g++ for linking, NOT nvcc
g++ -shared -o mytool.so $(OBJECTS) -L$(CUDA_LIB) -lcuda -lcudart_staticPossible Causes: Check if nvdisasm is in PATH, verify the tool compiled successfully (check for .so file), or try with TOOL_VERBOSE=1:
TOOL_VERBOSE=1 LD_PRELOAD=./tools/instr_count/instr_count.so ./your_appPossible Causes: GPU compute capability not supported (need SM 3.5+), driver version too old or too new (need <= 575.xx), or concurrent use with nvprof/nsight (NVBit cannot run with these tools).
Solution:
# Check GPU compute capability
nvidia-smi --query-gpu=compute_cap --format=csv
# Check driver version
nvidia-smiSolution:
# Add CUDA library path
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATHDebugging Steps: Verify the app works without instrumentation (./your_app), test with the simplest tool (LD_PRELOAD=./tools/instr_count/instr_count.so ./your_app), or use selective instrumentation:
KERNEL_BEGIN=0 KERNEL_END=1 LD_PRELOAD=./tools/instr_count/instr_count.so ./your_appRead tool-specific READMEs in tools/TOOLNAME/README.md, examine core/nvbit.h for API documentation, or report issues at the GitHub repository.
Before asking for help, try with TOOL_VERBOSE=1 to get diagnostic output, verify your setup meets all requirements in Prerequisites, and test with provided test applications first.
We welcome contributions to improve the tutorial! If you find issues or have suggestions, open an issue describing the problem or enhancement, submit a pull request with your proposed changes, and follow the coding style of the existing examples.
For more details on the NVBit APIs, see the comments in core/nvbit.h.
You may also find these resources helpful:
Happy learning!