You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During my usual work with GPU inference, I noticed that load_model was incredibly slow every time. Since ncnn's GPU backend is implemented with Vulkan, I wondered if it shared any common ground with video games. Every time I launch the game Victoria 3 on a new computer, it shows a "compiling shaders" message, but this only happens on the first launch. In contrast, I saw that ncnn's code recompiles GLSL to SPIR-V and then creates a pipeline from scratch every single time. This got me thinking: could I cache these compilation results?
At this point, a single piece of ncnn shader code would be offline-compiled into 10 different versions of SPIR-V binaries (fp32, fp16p, fp16s, fp16a, image-fp32, image-fp16p, image-fp16s, image-fp16a). All these binaries would be packed into the ncnn Vulkan library, causing its size to explode to tens of megabytes, which is unacceptable for an app's installation package.
That's what nihui said. However, in a real-world application, the runtime configuration and GPU environment are relatively fixed. Moreover, a model typically only uses a handful of operators, which means the storage overhead from offline caching is quite low. In comparison, the trade-off of a slight increase in storage for a significant reduction in startup time is well worth it.
Even taking a step back, caching SPIR-V—without even saving it offline to local storage—can prevent the same shader from being recompiled repeatedly.
The solution is simple: just store the compilation result based on its index and options every time a SPIR-V module is compiled. Then, check the cache before any new compilation.
if (d->cache_spirv_module[i].first.d0 == PipelineCachePrivate::spv_param({shader_type_index, opt_bits}).d0) // hit cache
{
spirv = d->cache_spirv_module[i].second;
goto hit_cache;
}
2. Pipeline Cache
The SPIR-V cache only saves time on the GLSL-to-SPIR-V compilation step. But SPIR-V can't run directly on the GPU; we've only taken the first step of a long journey.
Fortunately, after consulting the Vulkan documentation, I found that implementing a pipeline cache is actually very straightforward. You don't need to worry about the internal construction of the pipeline. You just use the vkCreatePipelineCache API and then provide the pipeline cache object when creating each pipeline. The user doesn't need to worry about the details; the driver handles it all for you.
For convenience, I created a new wrapper for this API in gpu.h:
However, this high level of abstraction is a double-edged sword, as it prevents the user from controlling individual pipelines. Creating a separate pipeline cache for each and every pipeline is also completely unnecessary.
Implementations should not internally limit the total number of entries added to a pipeline cache object or the total host memory consumed.
(From the official Vulkan specification)
Still, we need to ensure the pipeline cache is compatible with the target device. To achieve this, we can define a file header for our pipeline cache:
Compatibility is ensured by validating the magic number, vendorID, deviceID, driverVersion, and uuid every time the cache is loaded.
Conclusion
Results
Test Environment: Intel Core i9-13900HX with its integrated GPU Model: mobilenet_v3
==================================================
Verification and Summary
==================================================
Output verification: SUCCESS
--------------------------------------------------
Performance Summary:
- Without Cache: 423.934 ms
- With Cache: 99.2737 ms
- Speedup: 76.5828%
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
During my usual work with GPU inference, I noticed that
load_modelwas incredibly slow every time. Since ncnn's GPU backend is implemented with Vulkan, I wondered if it shared any common ground with video games. Every time I launch the game Victoria 3 on a new computer, it shows a "compiling shaders" message, but this only happens on the first launch. In contrast, I saw that ncnn's code recompiles GLSL to SPIR-V and then creates a pipeline from scratch every single time. This got me thinking: could I cache these compilation results?1. SPIR-V Cache
First, I looked into nihui's article, ncnn generating spirv at runtime.
That's what nihui said. However, in a real-world application, the runtime configuration and GPU environment are relatively fixed. Moreover, a model typically only uses a handful of operators, which means the storage overhead from offline caching is quite low. In comparison, the trade-off of a slight increase in storage for a significant reduction in startup time is well worth it.
Even taking a step back, caching SPIR-V—without even saving it offline to local storage—can prevent the same shader from being recompiled repeatedly.
The solution is simple: just store the compilation result based on its index and options every time a SPIR-V module is compiled. Then, check the cache before any new compilation.
2. Pipeline Cache
The SPIR-V cache only saves time on the GLSL-to-SPIR-V compilation step. But SPIR-V can't run directly on the GPU; we've only taken the first step of a long journey.
Fortunately, after consulting the Vulkan documentation, I found that implementing a pipeline cache is actually very straightforward. You don't need to worry about the internal construction of the pipeline. You just use the
vkCreatePipelineCacheAPI and then provide the pipeline cache object when creating each pipeline. The user doesn't need to worry about the details; the driver handles it all for you.For convenience, I created a new wrapper for this API in
gpu.h:Then, you can export the cache data using the relevant API:
However, this high level of abstraction is a double-edged sword, as it prevents the user from controlling individual pipelines. Creating a separate pipeline cache for each and every pipeline is also completely unnecessary.
Still, we need to ensure the pipeline cache is compatible with the target device. To achieve this, we can define a file header for our pipeline cache:
Compatibility is ensured by validating the
magicnumber,vendorID,deviceID,driverVersion, anduuidevery time the cache is loaded.Conclusion
Results
Test Environment: Intel Core i9-13900HX with its integrated GPU
Model: mobilenet_v3
Feel free to check out my PR: #6221
Beta Was this translation helpful? Give feedback.
All reactions