Persistent Vulkan Pipeline and SPIR-V Cache #6277

futz12 · 2025-08-22T13:37:49Z

futz12
Aug 22, 2025

Original Chinese Version: https://zhuanlan.zhihu.com/p/1942333183849918487

During my usual work with GPU inference, I noticed that load_model was incredibly slow every time. Since ncnn's GPU backend is implemented with Vulkan, I wondered if it shared any common ground with video games. Every time I launch the game Victoria 3 on a new computer, it shows a "compiling shaders" message, but this only happens on the first launch. In contrast, I saw that ncnn's code recompiles GLSL to SPIR-V and then creates a pipeline from scratch every single time. This got me thinking: could I cache these compilation results?

1. SPIR-V Cache

First, I looked into nihui's article, ncnn generating spirv at runtime.

At this point, a single piece of ncnn shader code would be offline-compiled into 10 different versions of SPIR-V binaries (fp32, fp16p, fp16s, fp16a, image-fp32, image-fp16p, image-fp16s, image-fp16a). All these binaries would be packed into the ncnn Vulkan library, causing its size to explode to tens of megabytes, which is unacceptable for an app's installation package.

That's what nihui said. However, in a real-world application, the runtime configuration and GPU environment are relatively fixed. Moreover, a model typically only uses a handful of operators, which means the storage overhead from offline caching is quite low. In comparison, the trade-off of a slight increase in storage for a significant reduction in startup time is well worth it.

Even taking a step back, caching SPIR-V—without even saving it offline to local storage—can prevent the same shader from being recompiled repeatedly.

retc = compile_spirv_module(shader_type_index, opt, spirv);
if (retc != 0)
{
    NCNN_LOGE("compile_spirv_module failed %d", retc);
    return -1;
}
d->cache_spirv_module.push_back({{shader_type_index, opt_bits}, spirv});

The solution is simple: just store the compilation result based on its index and options every time a SPIR-V module is compiled. Then, check the cache before any new compilation.

if (d->cache_spirv_module[i].first.d0 == PipelineCachePrivate::spv_param({shader_type_index, opt_bits}).d0) // hit cache
{
    spirv = d->cache_spirv_module[i].second;
    goto hit_cache;
}

2. Pipeline Cache

The SPIR-V cache only saves time on the GLSL-to-SPIR-V compilation step. But SPIR-V can't run directly on the GPU; we've only taken the first step of a long journey.

Fortunately, after consulting the Vulkan documentation, I found that implementing a pipeline cache is actually very straightforward. You don't need to worry about the internal construction of the pipeline. You just use the vkCreatePipelineCache API and then provide the pipeline cache object when creating each pipeline. The user doesn't need to worry about the details; the driver handles it all for you.

For convenience, I created a new wrapper for this API in gpu.h:

int VulkanDevice::create_pipeline_cache(const VkPipelineCacheCreateInfo* pCreateInfo, const VkAllocationCallbacks* pAllocator, VkPipelineCache* pPipelineCache) const
{
    VkResult ret = vkCreatePipelineCache(d->device, pCreateInfo, pAllocator, pPipelineCache);
    if (ret != VK_SUCCESS)
    {
        NCNN_LOGE("vkCreatePipelineCache failed %d", ret);
        return -1;
    }

    return 0;
}

Then, you can export the cache data using the relevant API:

size_t buf_size = 0;
if (vkGetPipelineCacheData(vkdev->vkdevice(), d->vk_pipeline_cache, &buf_size, nullptr) != VK_SUCCESS)
{
    NCNN_LOGE("vkGetPipelineCacheData failed");
    return -1;
}
header.pipeline_cache_size = (uint32_t)buf_size;

std::vector<unsigned char> pipe_data(header.pipeline_cache_size);
if (vkGetPipelineCacheData(vkdev->vkdevice(), d->vk_pipeline_cache, &buf_size, pipe_data.data()) != VK_SUCCESS)
{
    NCNN_LOGE("vkGetPipelineCacheData failed");
    return -1;
}

However, this high level of abstraction is a double-edged sword, as it prevents the user from controlling individual pipelines. Creating a separate pipeline cache for each and every pipeline is also completely unnecessary.

Implementations should not internally limit the total number of entries added to a pipeline cache object or the total host memory consumed.
(From the official Vulkan specification)

Still, we need to ensure the pipeline cache is compatible with the target device. To achieve this, we can define a file header for our pipeline cache:

struct pipeline_cache_header
{
    uint32_t magic = 0x5a545546;
    uint32_t vendorID;          // VkPhysicalDeviceProperties::vendorID
    uint32_t deviceID;          // VkPhysicalDeviceProperties::deviceID
    uint32_t driverVersion;     // VkPhysicalDeviceProperties::driverVersion
    uint8_t uuid[VK_UUID_SIZE]; // VkPhysicalDeviceProperties::pipelineCacheUUID

    uint32_t spv_size;          // size of spirv data
    uint32_t pipeline_cache_size;
};

Compatibility is ensured by validating the magic number, vendorID, deviceID, driverVersion, and uuid every time the cache is loaded.

Conclusion

Results

Test Environment: Intel Core i9-13900HX with its integrated GPU
Model: mobilenet_v3

==================================================
              Verification and Summary
==================================================
Output verification: SUCCESS
--------------------------------------------------
Performance Summary:
  - Without Cache: 423.934 ms
  - With Cache:    99.2737 ms
  - Speedup:       76.5828%

Feel free to check out my PR: #6221

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Persistent Vulkan Pipeline and SPIR-V Cache #6277

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Persistent Vulkan Pipeline and SPIR-V Cache #6277

Uh oh!

futz12 Aug 22, 2025

1. SPIR-V Cache

2. Pipeline Cache

Conclusion

Results

Replies: 0 comments

futz12
Aug 22, 2025