-
Notifications
You must be signed in to change notification settings - Fork 982
Description
Tested versions
| Version | Model | VRAM Peak | Reproducible |
|---|---|---|---|
| pyannote.audio 3.3.2 | speaker-diarization-3.1 | 1.59GB | ✅ No issue |
| pyannote.audio 4.0.3 | speaker-diarization-community-1 | 9.54GB | ✅ Bug present |
| pyannote.audio 4.0.3 | speaker-diarization-3.1 | 9.54GB | ✅ Bug present |
The VRAM spike occurs regardless of which model is used with 4.0.3.
System information
- OS: Ubuntu 22.04 LTS - GPU: NVIDIA RTX A5000 (24GB VRAM) - Python: 3.12 - PyTorch: 2.5.1+cu124 - CUDA: 12.x
Issue description
When running speaker diarization on a 72-minute audio file, pyannote.audio 4.0.3 uses 6x more VRAM than 3.3.2.
Step-by-step VRAM comparison
| Processing Step | 3.3.2 + diarization-3.1 | 4.0.3 + community-1 |
|---|---|---|
| segmentation | 0.40GB | 0.43GB |
| embeddings | 0.05GB | 0.05GB |
| discrete_diarization | 1.59GB | 9.54GB |
The spike occurs during the discrete_diarization step (after clustering, during reconstruction).
Expected behavior
VRAM usage should be comparable between versions for similar workloads.
Actual behavior
4.0.3 allocates ~8GB more VRAM during reconstruction, making it impractical for GPUs with less than 12GB or for concurrent
processing.
Additional observations
- Both versions use identical:
- Embedding model: pyannote/wespeaker-voxceleb-resnet34-LM
- Clustering: AgglomerativeClustering
- batch_size=32 - The issue persists whether audio is passed as file path or preloaded waveform
- In 4.0.3, exclusive_speaker_diarization is always computed even when legacy=True, potentially contributing to overhead
Minimal reproduction example (MRE)
Minimal reproduction example (MRE) import os import torch import torchaudio torch.cuda.set_device(0) torch.cuda.empty_cache() from pyannote.audio import Pipeline def log_mem(label): peak = torch.cuda.max_memory_allocated()/10243 print(f"{label}: peak={peak:.2f}GB") # Load model pipeline = Pipeline.from_pretrained( "pyannote/speaker-diarization-3.1", token=os.environ.get("HF_TOKEN") # or use_auth_token for 3.3.2 ) pipeline.to(torch.device("cuda")) # Load audio (use any 60+ minute audio file) audio_file = "your_audio.wav" waveform, sample_rate = torchaudio.load(audio_file) # Track memory at each step last_step = [None] def hook(step_name, step_artefact, file, completed=None, total=None): if step_name != last_step[0]: log_mem(f"Step: {step_name}") last_step[0] = step_name torch.cuda.reset_peak_memory_stats() torch.cuda.reset_peak_memory_stats() result = pipeline({"waveform": waveform, "sample_rate": sample_rate}, hook=hook) print(f"\nFinal peak VRAM: {torch.cuda.max_memory_allocated()/10243:.2f}GB") Results pyannote.audio 3.3.2: Step: segmentation: peak=0.40GB Step: embeddings: peak=0.05GB Step: discrete_diarization: peak=1.59GB Final peak VRAM: 0.04GB pyannote.audio 4.0.3: Step: segmentation: peak=0.43GB Step: embeddings: peak=0.05GB Step: discrete_diarization: peak=9.54GB Final peak VRAM: 0.04GB --- Impact - Users with 8-12GB GPUs cannot run 4.0.3 on long audio files - Concurrent processing severely limited (only ~2 jobs vs ~6+ with 3.3.2 on 24GB GPU) - Forces users to stay on 3.3.2 (which requires pinning huggingface_hub<=0.23.5)