-
-
Notifications
You must be signed in to change notification settings - Fork 155
Description
Issue description
Qwen3-Embedding-8B-Q4_K_M.gguf (downloaded from huggingface) works on gpu: false (cpu) but not when run on vulkan.
Expected Behavior
await embedContext.getEmbeddingFor(...) returns successfully, or throws an error that is understood.
Actual Behavior
Logs:
Embedding model loaded, priming with large data.
[node-llama-cpp] state_write_data: writing state
[node-llama-cpp] state_write_data: - writing model info
[node-llama-cpp] state_write_data: - writing output ids
[node-llama-cpp] state_write_data: - writing logits
[node-llama-cpp] state_write_data: - writing embeddings
[node-llama-cpp] state_write_data: - writing memory module
[node-llama-cpp] init: embeddings required but some input tokens were not marked as outputs -> overriding
[node-llama-cpp] output_reserve: reallocating output buffer from size 0.59 MiB to 152.11 MiB
[node-llama-cpp] init: embeddings required but some input tokens were not marked as outputs -> overriding
[node-llama-cpp] init: embeddings required but some input tokens were not marked as outputs -> overriding
D:/a/node-llama-cpp/node-llama-cpp/llama/llama.cpp/src/llama-context.cpp:622: fatal error
Exit code -1073740791
I get the impression that the warnings init: embeddings required but some input tokens were not marked as outputs -> overriding can be safely ignored, as that's what I do with the chat version of the qwen model and it hasn't caused issues (not only that but despite my best efforts I don't know how to fix that warning).
Steps to reproduce
I've omitted the text I was using to test, I just copied text from my internal wiki, you can probably copy text from any available wiki to do the test, as long as it's between 2000 and 3000 characters it should suffice. Running the embedder on a sample size of ~100 tokens seems to work, so I suspect the crash is related to the buffer reallocation.
I did manage to get a similar setup to work reliably on the cpu (but veeeery slowly) on a forked thread in my development program, but I haven't been able to replicate the setup in my sample test. The only things I think are different is {gpu: false} was provided to getLlama, the context size was 8192 and the batch-size wasn't set.
import { getLlama, Llama, LlamaLogLevel } from "node-llama-cpp";
(async function main() {
let llamacpp = await getLlama();
llamacpp.logLevel = LlamaLogLevel.debug;
let embedModel = await llamacpp.loadModel({
modelPath: "../Models/Qwen3-Embedding-8B-Q4_K_M.gguf",
useMlock: true,
useMmap: true,
gpuLayers: "auto",
metadataOverrides: {
general: {}
}
});
let embedContext = await embedModel.createEmbeddingContext({contextSize: 4096 , batchSize: 256 /*, threads: 4*/});
console.log("\nEmbedding model loaded, priming with large data.");
await embedContext.getEmbeddingFor(`
A really long text of 2000+ characters.
`.trim()); // pre-reserve big outputs/KV
console.log("Embedding model primed.");
})();package.json
{
"type": "module",
"scripts": {
"start": "node vulcan.js",
"test": "echo \"Error: no test specified\" && exit 1",
"build": "tsc --build",
"rebuild": "tsc --build --force"
},
"dependencies": {
"@types/node": "^18.0.0",
"node-llama-cpp": "^3.14.2",
"typescript": "^5.4.5"
}
}tsconfig
{
"compileOnSave": true,
"compilerOptions": {
"module": "ESNext", /* Specify what module code is generated. */
"target": "ES2020", /* Set the JavaScript language version for emitted JavaScript and include compatible library declarations. */
"moduleResolution": "node",
"types": ["node"],
"sourceMap": true,
"esModuleInterop": true, /* Emit additional JavaScript to ease support for importing CommonJS modules. This enables 'allowSyntheticDefaultImports' for type compatibility. */
"forceConsistentCasingInFileNames": true, /* Ensure that casing is correct in imports. */
"strict": true, /* Enable all strict type-checking options. */
"skipLibCheck": true, /* Skip type checking all .d.ts files. */
"useDefineForClassFields": false,
"experimentalDecorators": true,
"emitDecoratorMetadata": false,
}
}
My Environment
| Dependency | Version |
|---|---|
| Operating System | Windows 11 Pro (10.0.22621) |
| CPU | Intel i7-11700 |
| Node.js version | 20.11.1 |
| Typescript version | 5.4.5 |
node-llama-cpp version |
3.14.2 |
npx --yes node-llama-cpp inspect gpu output:
npm WARN deprecated [email protected]: This package is no longer supported.
npm WARN deprecated [email protected]: This package is no longer supported.
npm WARN deprecated [email protected]: This package is no longer supported.
OS: Windows 10.0.22621 (x64)
Node: 20.11.1 (x64)
node-llama-cpp: 3.14.2
Prebuilt binaries: b6845
Vulkan: available
Vulkan device: AMD Radeon RX 6800 XT
Vulkan used VRAM: 4.88% (786.42MB/15.73GB)
Vulkan free VRAM: 95.11% (14.97GB/15.73GB)
CPU model: 11th Gen Intel(R) Core(TM) i7-11700 @ 2.50GHz
Math cores: 0
Used RAM: 70.61% (22.5GB/31.86GB)
Free RAM: 29.38% (9.36GB/31.86GB)
Used swap: 51.28% (33.78GB/65.86GB)
Max swap size: 65.86GB
mmap: supported
Additional Context
A slightly different setup results in the program crashing without throwing the error.
Relevant Features Used
- Metal support
- CUDA support
- Vulkan support
- Grammar
- Function calling
Are you willing to resolve this issue by submitting a Pull Request?
No, I don’t have the time and I’m okay to wait for the community / maintainers to resolve this issue.