A high-performance Go binding for llama.cpp using purego and libffi for cross-platform compatibility without CGO.
- Pure Go: No CGO required, uses purego and libffi for C interop
 - Cross-Platform: Supports macOS (CPU/Metal), Linux (CPU/NVIDIA/AMD), Windows (CPU/NVIDIA/AMD)
 - Struct Support: Uses libffi for calling C functions with struct parameters/returns on all platforms
 - Performance: Direct bindings to llama.cpp shared libraries
 - Compatibility: Version-synchronized with llama.cpp releases
 - Easy Integration: Simple Go API for LLM inference
 - GPU Acceleration: Supports Metal, CUDA, HIP, Vulkan, OpenCL, SYCL, and other backends
 - Embedded Runtime Libraries: Optional go:embed bundle for all supported platforms
 - GGML Bindings: Low-level GGML tensor library bindings for advanced use cases
 
Gollama.cpp uses a platform-specific architecture with build tags to ensure optimal compatibility and performance across all operating systems.
- CPU: Intel x64, Apple Silicon (ARM64)
 - GPU: Metal (Apple Silicon)
 - Status: Full feature support with purego
 - Build Tags: Uses 
!windowsbuild tag 
- CPU: x86_64, ARM64
 - GPU: NVIDIA (CUDA/Vulkan), AMD (HIP/ROCm/Vulkan), Intel (SYCL/Vulkan)
 - Status: Full feature support with purego and libffi
 - Build Tags: Uses 
!windowsbuild tag 
- CPU: x86_64, ARM64
 - GPU: NVIDIA (CUDA/Vulkan), AMD (HIP/Vulkan), Intel (SYCL/Vulkan), Qualcomm Adreno (OpenCL)
 - Status: Full feature support with libffi
 - Build Tags: Uses 
windowsbuild tag with syscall-based library loading - Current State:
- ✅ Compiles without errors on Windows
 - ✅ Cross-compilation from other platforms works
 - ✅ Runtime functionality fully enabled via libffi and GetProcAddress
 - ✅ Full struct parameter/return support through function registration
 - 🚧 GPU acceleration being tested
 
 
Windows runtime notes
- The loader now adds the DLL's directory to the Windows DLL search path and uses
 LoadLibraryExWwith safe search flags to reliably resolve sibling dependencies (ggml, libomp, libcurl, etc.).- If you see “The specified module could not be found.” while loading
 llama.dll, it often indicates a missing system runtime (e.g., Microsoft Visual C++ Redistributable 2015–2022). Installing the latest x64/x86 redistributable typically resolves it.- CI runners set PATH for later steps, but the downloader verifies loading immediately after download; the improved loader handles dependency resolution without relying on PATH.
 
Our platform abstraction layer uses Go build tags to provide:
- Unix-like systems (
!windows): Uses purego for dynamic library loading - Windows (
windows): Uses native Windows syscalls (LoadLibraryW,FreeLibrary,GetProcAddress) - All platforms: Uses libffi for calling C functions with struct parameters/returns
 - Cross-compilation: Supports building for any platform from any platform
 - Automatic detection: Runtime platform capability detection
 
go get github.com/dianlight/gollama.cppThe Go module automatically downloads pre-built llama.cpp libraries from the official ggml-org/llama.cpp releases on first use. No manual compilation required!
For reproducible builds you can embed the pre-built libraries directly into the Go module. A helper Makefile target downloads the configured llama.cpp build (LLAMA_CPP_BUILD) for every supported platform and synchronises the ./libs directory which is picked up by go:embed:
# Download all platform builds for the configured llama.cpp version and populate ./libs
make populate-libs
# Alternatively, use the CLI directly
go run ./cmd/gollama-download -download-all -version b6862 -copy-libsOnly a single llama.cpp version is stored in ./libs at a time. Running populate-libs removes outdated directories automatically. Subsequent go build invocations embed the freshly synchronised libraries and LoadLibraryWithVersion("") will prefer the embedded bundle.
Our CI system tests compilation across all platforms:
| Target Platform | Build From Linux | Build From macOS | Build From Windows | 
|---|---|---|---|
| Linux (amd64) | ✅ | ✅ | ✅ | 
| Linux (arm64) | ✅ | ✅ | ✅ | 
| macOS (amd64) | ✅ | ✅ | ✅ | 
| macOS (arm64) | ✅ | ✅ | ✅ | 
| Windows (amd64) | ✅ | ✅ | ✅ | 
| Windows (arm64) | ✅ | ✅ | ✅ | 
# Test cross-compilation for all platforms
make test-cross-compile
# Build for specific platform
GOOS=windows GOARCH=amd64 go build ./...
GOOS=linux GOARCH=arm64 go build ./...
GOOS=darwin GOARCH=arm64 go build ./...
# Run platform-specific tests
go test -v -run TestPlatformSpecific ./...package main
import (
    "fmt"
    "log"
    "github.com/dianlight/gollama.cpp"
)
func main() {
    // Initialize the library
    gollama.Backend_init()
    defer gollama.Backend_free()
    // Load model
    params := gollama.Model_default_params()
    model, err := gollama.Model_load_from_file("path/to/model.gguf", params)
    if err != nil {
        log.Fatal(err)
    }
    defer gollama.Model_free(model)
    // Create context
    ctxParams := gollama.Context_default_params()
    ctx, err := gollama.Init_from_model(model, ctxParams)
    if err != nil {
        log.Fatal(err)
    }
    defer gollama.Free(ctx)
    // Tokenize and generate
    prompt := "The future of AI is"
    tokens, err := gollama.Tokenize(model, prompt, true, false)
    if err != nil {
        log.Fatal(err)
    }
    // Create batch and decode
    batch := gollama.Batch_init(len(tokens), 0, 1)
    defer gollama.Batch_free(batch)
    for i, token := range tokens {
        gollama.Batch_add(batch, token, int32(i), []int32{0}, false)
    }
    if err := gollama.Decode(ctx, batch); err != nil {
        log.Fatal(err)
    }
    // Sample next token
    logits := gollama.Get_logits_ith(ctx, -1)
    candidates := gollama.Token_data_array_init(model)
    
    sampler := gollama.Sampler_init_greedy()
    defer gollama.Sampler_free(sampler)
    
    newToken := gollama.Sampler_sample(sampler, ctx, candidates)
    
    // Convert token to text
    text := gollama.Token_to_piece(model, newToken, false)
    fmt.Printf("Generated: %s\n", text)
}For advanced use cases, gollama.cpp provides direct access to GGML (the tensor library powering llama.cpp):
// Check GGML type information
typeSize, err := gollama.Ggml_type_size(gollama.GGML_TYPE_F32)
if err != nil {
    log.Fatal(err)
}
fmt.Printf("F32 type size: %d bytes\n", typeSize)
// Check if a type is quantized
isQuantized, err := gollama.Ggml_type_is_quantized(gollama.GGML_TYPE_Q4_0)
if err != nil {
    log.Fatal(err)
}
fmt.Printf("Q4_0 is quantized: %v\n", isQuantized)
// Enumerate backend devices
devCount, err := gollama.Ggml_backend_dev_count()
if err == nil && devCount > 0 {
    for i := uint64(0); i < devCount; i++ {
        dev, _ := gollama.Ggml_backend_dev_get(i)
        name, _ := gollama.Ggml_backend_dev_name(dev)
        fmt.Printf("Device %d: %s\n", i, name)
    }
}Supported GGML Features:
- 31 tensor type definitions (F32, F16, Q4_0, Q8_0, BF16, etc.)
 - Type size and quantization utilities
 - Backend device enumeration and management
 - Buffer allocation and management
 - Type information queries
 
Note: GGML functions may not be exported in all llama.cpp builds. The library gracefully handles missing functions without errors.
Gollama.cpp automatically downloads the appropriate pre-built binaries with GPU support and configures the optimal backend:
// Automatic GPU detection and configuration
params := gollama.Context_default_params()
params.n_gpu_layers = 32 // Offload layers to GPU (if available)
// Detect available GPU backend
backend := gollama.DetectGpuBackend()
fmt.Printf("Using GPU backend: %s\n", backend.String())
// Platform-specific optimizations:
// - macOS: Uses Metal when available  
// - Linux: Supports CUDA, HIP, Vulkan, and SYCL
// - Windows: Supports CUDA, HIP, Vulkan, OpenCL, and SYCL
params.split_mode = gollama.LLAMA_SPLIT_MODE_LAYER| Platform | GPU Type | Backend | Status | 
|---|---|---|---|
| macOS | Apple Silicon | Metal | ✅ Supported | 
| macOS | Intel/AMD | CPU only | ✅ Supported | 
| Linux | NVIDIA | CUDA | ✅ Available in releases | 
| Linux | NVIDIA | Vulkan | ✅ Available in releases | 
| Linux | AMD | HIP/ROCm | ✅ Available in releases | 
| Linux | AMD | Vulkan | ✅ Available in releases | 
| Linux | Intel | SYCL | ✅ Available in releases | 
| Linux | Intel/Other | Vulkan | ✅ Available in releases | 
| Linux | Intel/Other | CPU | ✅ Fallback | 
| Windows | NVIDIA | CUDA | ✅ Available in releases | 
| Windows | NVIDIA | Vulkan | ✅ Available in releases | 
| Windows | AMD | HIP | ✅ Available in releases | 
| Windows | AMD | Vulkan | ✅ Available in releases | 
| Windows | Intel | SYCL | ✅ Available in releases | 
| Windows | Qualcomm Adreno | OpenCL | ✅ Available in releases | 
| Windows | Intel/Other | Vulkan | ✅ Available in releases | 
| Windows | Intel/Other | CPU | ✅ Fallback | 
The library automatically downloads pre-built binaries from the official llama.cpp releases with the appropriate GPU support for your platform. The download happens automatically on first use!
params := gollama.Model_default_params()
params.n_ctx = 4096           // Context size
params.use_mmap = true        // Memory mapping
params.use_mlock = true       // Memory locking
params.vocab_only = false     // Load full modelGollama.cpp automatically downloads pre-built binaries from the official llama.cpp releases. You can also manage libraries manually:
// Load a specific version
err := gollama.LoadLibraryWithVersion("b6862")
// Clean cache to force re-download
err := gollama.CleanLibraryCache()# Download libraries for current platform
make download-libs
# Download libraries for all platforms  
make download-libs-all
# Test download functionality
make test-download
# Test GPU detection and functionality
make test-gpu
# Detect available GPU backends
make detect-gpu
# Clean library cache
make clean-libsThe downloader automatically selects the best variant for your platform:
- macOS: Metal-enabled binaries (arm64/x64)
 - Linux: CPU-optimized binaries (CUDA/HIP/Vulkan/SYCL versions available)
 - Windows: CPU-optimized binaries (CUDA/HIP/Vulkan/OpenCL/SYCL versions available)
 
Downloaded libraries are cached in platform-specific locations:
- Linux/macOS: 
~/.cache/gollama/libs/ - Windows: 
%LOCALAPPDATA%/gollama/libs/ 
You can customize the cache directory in several ways:
Environment Variable:
export GOLLAMA_CACHE_DIR=/custom/path/to/cacheConfiguration File:
{
  "cache_dir": "/custom/path/to/cache"
}Programmatically:
config := gollama.DefaultConfig()
config.CacheDir = "/custom/path/to/cache"
gollama.SetGlobalConfig(config)To get the current cache directory:
cacheDir, err := gollama.GetLibraryCacheDir()
if err != nil {
    log.Fatal(err)
}
fmt.Printf("Using cache directory: %s\n", cacheDir)- Go 1.21 or later
 - Make
 
# Clone and build
git clone https://github.com/dianlight/gollama.cpp
cd gollama.cpp
# Build for current platform
make build
# Run tests (downloads libraries automatically)
make test
# Build examples
make build-examples
# Run tests
make test
# Generate release packages
make releasemake testTests use github.com/stretchr/testify/suite along with a shared BaseSuite (see test_base_suite_test.go) that automatically snapshots/restores configuration and environment variables and unloads the llama library after each test. See the Contributing guide for details.
The Makefile implements intelligent GPU detection:
- CUDA Detection: Checks for 
nvcccompiler and CUDA toolkit - HIP Detection: Checks for 
hipconfigand ROCm installation - Priority Order: CUDA > HIP > CPU (on Linux/Windows)
 - Metal: Always enabled on macOS when Xcode is available
 
No manual configuration or environment variables required!
This library tracks llama.cpp versions. The version number format is:
vX.Y.Z-llamacpp.ABCD
Where:
X.Y.Zis the gollama.cpp semantic versionABCDis the corresponding llama.cpp build number
For example: v0.2.0-llamacpp.b6862 uses llama.cpp build b6862.
- API Reference
- High-Level API: 
gollama.go- Complete llama.cpp bindings - Low-Level API: 
goggml.go- GGML tensor library bindings 
 - High-Level API: 
 - GGML Low-Level API Guide - Detailed GGML bindings documentation
 - Examples
 - Build Guide
 - GPU Setup
 
See the examples directory for complete examples:
- Simple Chat
 - Chat with History
 - Embedding Generation
 - Model Quantization
 - Batch Processing
 - GPU Acceleration
 
Contributions are welcome! Please read our Contributing Guide for details.
If you find this project helpful and would like to support its development, you can:
- ⭐ Star this repository on GitHub
 - 🐛 Report bugs and suggest improvements
 - 📖 Improve documentation
 
Your support helps maintain and improve this project for the entire community!
This project is licensed under the MIT License - see the LICENSE file for details.
This license is compatible with llama.cpp's MIT license.
- llama.cpp - The underlying C++ library
 - purego - Pure Go C interop library
 - ggml - Machine learning tensor library
 
- Issues - Bug reports and feature requests
 - Discussions - Questions and community support