Skip to content

pyannote.audio 4.0.3 uses 6x more VRAM than 3.3.2 (>9.54GB vs 2.59GB peak) #1963

@DevAtShopot

Description

@DevAtShopot

Tested versions

Version Model VRAM Peak Reproducible
pyannote.audio 3.3.2 speaker-diarization-3.1 1.59GB ✅ No issue
pyannote.audio 4.0.3 speaker-diarization-community-1 9.54GB ✅ Bug present
pyannote.audio 4.0.3 speaker-diarization-3.1 9.54GB ✅ Bug present

The VRAM spike occurs regardless of which model is used with 4.0.3.

System information

  • OS: Ubuntu 22.04 LTS - GPU: NVIDIA RTX A5000 (24GB VRAM) - Python: 3.12 - PyTorch: 2.5.1+cu124 - CUDA: 12.x

Issue description

When running speaker diarization on a 72-minute audio file, pyannote.audio 4.0.3 uses 6x more VRAM than 3.3.2.

Step-by-step VRAM comparison

Processing Step 3.3.2 + diarization-3.1 4.0.3 + community-1
segmentation 0.40GB 0.43GB
embeddings 0.05GB 0.05GB
discrete_diarization 1.59GB 9.54GB

The spike occurs during the discrete_diarization step (after clustering, during reconstruction).

Expected behavior

VRAM usage should be comparable between versions for similar workloads.

Actual behavior

4.0.3 allocates ~8GB more VRAM during reconstruction, making it impractical for GPUs with less than 12GB or for concurrent
processing.

Additional observations

  1. Both versions use identical:
    - Embedding model: pyannote/wespeaker-voxceleb-resnet34-LM
    - Clustering: AgglomerativeClustering
    - batch_size=32
  2. The issue persists whether audio is passed as file path or preloaded waveform
  3. In 4.0.3, exclusive_speaker_diarization is always computed even when legacy=True, potentially contributing to overhead

Minimal reproduction example (MRE)

Minimal reproduction example (MRE) import os import torch import torchaudio torch.cuda.set_device(0) torch.cuda.empty_cache() from pyannote.audio import Pipeline def log_mem(label): peak = torch.cuda.max_memory_allocated()/10243 print(f"{label}: peak={peak:.2f}GB") # Load model pipeline = Pipeline.from_pretrained( "pyannote/speaker-diarization-3.1", token=os.environ.get("HF_TOKEN") # or use_auth_token for 3.3.2 ) pipeline.to(torch.device("cuda")) # Load audio (use any 60+ minute audio file) audio_file = "your_audio.wav" waveform, sample_rate = torchaudio.load(audio_file) # Track memory at each step last_step = [None] def hook(step_name, step_artefact, file, completed=None, total=None): if step_name != last_step[0]: log_mem(f"Step: {step_name}") last_step[0] = step_name torch.cuda.reset_peak_memory_stats() torch.cuda.reset_peak_memory_stats() result = pipeline({"waveform": waveform, "sample_rate": sample_rate}, hook=hook) print(f"\nFinal peak VRAM: {torch.cuda.max_memory_allocated()/10243:.2f}GB") Results pyannote.audio 3.3.2: Step: segmentation: peak=0.40GB Step: embeddings: peak=0.05GB Step: discrete_diarization: peak=1.59GB Final peak VRAM: 0.04GB pyannote.audio 4.0.3: Step: segmentation: peak=0.43GB Step: embeddings: peak=0.05GB Step: discrete_diarization: peak=9.54GB Final peak VRAM: 0.04GB --- Impact - Users with 8-12GB GPUs cannot run 4.0.3 on long audio files - Concurrent processing severely limited (only ~2 jobs vs ~6+ with 3.3.2 on 24GB GPU) - Forces users to stay on 3.3.2 (which requires pinning huggingface_hub<=0.23.5)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions