Workaround: Build failure with CUDA 13.0: "Unsupported CUDA version: 13" error in onnxruntime-node #26586

erkkimon · 2025-11-16T20:40:57Z

erkkimon
Nov 16, 2025

Environment:

OS: Arch Linux
CUDA: 13.0.88 (nvcc release 13.0, V13.0.88)
Node.js: v20.19.5
Package: @huggingface/transformers → onnxruntime-node

Problem Description:

When installing dependencies on a system with CUDA 13, npm install fails during the onnxruntime-node post-install script with the following error:

npm error Failed to detect CUDA version from `nvcc --version`: Unsupported CUDA version: 13
npm error node:events:502
npm error       throw er; // Unhandled 'error' event
npm error       ^
...
npm error Error: Failed to find specific package versions for Microsoft.ML.OnnxRuntime.Gpu.Linux in https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/nuget/v3/index.json 
npm error     at nugetPackageUrlResolver (/home/lab/Git/Perplexica/node_modules/onnxruntime-node/script/install-utils.js:129:13)

The error Failed to detect CUDA version from nvcc --version: Unsupported CUDA version: 13 occurs because onnxruntime-node doesn't yet have pre-built binaries for CUDA 13.

Workaround - Force CPU Mode Installation:

Clean previous installation attempts:

rm -rf node_modules package-lock.json

Install dependencies while skipping post-install scripts (this prevents onnxruntime-node from attempting to download unsupported GPU binaries):

npm install --ignore-scripts

Rebuild the better-sqlite3 native bindings manually (required for database operations):

npm rebuild better-sqlite3

Build and start the application normally:

npm run build
npm run start

Important Limitation:
This workaround installs onnxruntime-node in CPU-only mode so any local model inference using @huggingface/transformers will run on CPU, which may be slower for embeddings and other operations.

Alternative Solution for GPU Acceleration:

If you need GPU acceleration for embeddings but cannot downgrade CUDA 13, I recommend using Ollama for embeddings while running vllama for LLM inference which is is a drop-in replacement for Ollama that runs inference on GPU while maintaining full API compatibility.

My Working Configuration:

I'm successfully using this setup with Perplexica with no speed penalty:

Embeddings: ollama running nomic-embed-text:latest on CPU
LLM Inference: vllama running huihui_ai/deepseek-r1-abliterated:14b on GPU

This separation provides optimal performance:

Embeddings are fast enough on CPU with Ollama
LLM inference gets full GPU acceleration via vllama
No model swapping delays since each service runs its own dedicated model

To configure this in Perplexica:

Install both Ollama and vllama
Pull model for embeddings using ollama: ollama pull nomic-embed-text:latest
Pull model for LLM using ollama: ollama pull huihui_ai/deepseek-r1-abliterated:14b
In Perplexica settings:
- Set Embedder Provider to "Ollama" (uses CPU for embeddings) using port http://localhost:11434
- Set LLM Provider to "LiteLLM" but point it to vllama's API endpoint http://localhost:11435
Both services can run simultaneously on different ports

I hope Perplexica will work soon with Cuda 13, but at least here is a workaround until that!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Workaround: Build failure with CUDA 13.0: "Unsupported CUDA version: 13" error in onnxruntime-node #26586

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Workaround: Build failure with CUDA 13.0: "Unsupported CUDA version: 13" error in onnxruntime-node #26586

Uh oh!

Uh oh!

erkkimon Nov 16, 2025

Replies: 0 comments

erkkimon
Nov 16, 2025