pyannote.audio 4.0.3 uses 6x more VRAM than 3.3.2 (>9.54GB vs 2.59GB peak)

### Tested versions

  | Version              | Model                           | VRAM Peak | Reproducible  |
  |----------------------|---------------------------------|-----------|---------------|
  | pyannote.audio 3.3.2 | speaker-diarization-3.1         | 1.59GB    | ✅ No issue    |
  | pyannote.audio 4.0.3 | speaker-diarization-community-1 | 9.54GB    | ✅ Bug present |
  | pyannote.audio 4.0.3 | speaker-diarization-3.1         | 9.54GB    | ✅ Bug present |

  The VRAM spike occurs regardless of which model is used with 4.0.3.

### System information

  - OS: Ubuntu 22.04 LTS   - GPU: NVIDIA RTX A5000 (24GB VRAM)   - Python: 3.12   - PyTorch: 2.5.1+cu124   - CUDA: 12.x

### Issue description

When running speaker diarization on a 72-minute audio file, pyannote.audio 4.0.3 uses 6x more VRAM than 3.3.2.

  Step-by-step VRAM comparison

  | Processing Step      | 3.3.2 + diarization-3.1 | 4.0.3 + community-1 |
  |----------------------|-------------------------|---------------------|
  | segmentation         | 0.40GB                  | 0.43GB              |
  | embeddings           | 0.05GB                  | 0.05GB              |
  | discrete_diarization | 1.59GB                  | 9.54GB              |

  The spike occurs during the discrete_diarization step (after clustering, during reconstruction).

  Expected behavior

  VRAM usage should be comparable between versions for similar workloads.

  Actual behavior

  4.0.3 allocates ~8GB more VRAM during reconstruction, making it impractical for GPUs with less than 12GB or for concurrent
  processing.

 Additional observations

  1. Both versions use identical:
    - Embedding model: pyannote/wespeaker-voxceleb-resnet34-LM
    - Clustering: AgglomerativeClustering
    - batch_size=32
  2. The issue persists whether audio is passed as file path or preloaded waveform
  3. In 4.0.3, exclusive_speaker_diarization is always computed even when legacy=True, potentially contributing to overhead

### Minimal reproduction example (MRE)

Minimal reproduction example (MRE)    import os   import torch   import torchaudio    torch.cuda.set_device(0)   torch.cuda.empty_cache()    from pyannote.audio import Pipeline    def log_mem(label):       peak = torch.cuda.max_memory_allocated()/1024**3       print(f"{label}: peak={peak:.2f}GB")    # Load model   pipeline = Pipeline.from_pretrained(       "pyannote/speaker-diarization-3.1",       token=os.environ.get("HF_TOKEN")  # or use_auth_token for 3.3.2   )   pipeline.to(torch.device("cuda"))    # Load audio (use any 60+ minute audio file)   audio_file = "your_audio.wav"   waveform, sample_rate = torchaudio.load(audio_file)    # Track memory at each step   last_step = [None]   def hook(step_name, step_artefact, file, completed=None, total=None):       if step_name != last_step[0]:           log_mem(f"Step: {step_name}")           last_step[0] = step_name           torch.cuda.reset_peak_memory_stats()    torch.cuda.reset_peak_memory_stats()   result = pipeline({"waveform": waveform, "sample_rate": sample_rate}, hook=hook)    print(f"\nFinal peak VRAM: {torch.cuda.max_memory_allocated()/1024**3:.2f}GB")    Results    pyannote.audio 3.3.2:   Step: segmentation: peak=0.40GB   Step: embeddings: peak=0.05GB   Step: discrete_diarization: peak=1.59GB   Final peak VRAM: 0.04GB    pyannote.audio 4.0.3:   Step: segmentation: peak=0.43GB   Step: embeddings: peak=0.05GB   Step: discrete_diarization: peak=9.54GB   Final peak VRAM: 0.04GB    ---   Impact    - Users with 8-12GB GPUs cannot run 4.0.3 on long audio files   - Concurrent processing severely limited (only ~2 jobs vs ~6+ with 3.3.2 on 24GB GPU)   - Forces users to stay on 3.3.2 (which requires pinning huggingface_hub<=0.23.5)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pyannote.audio 4.0.3 uses 6x more VRAM than 3.3.2 (>9.54GB vs 2.59GB peak) #1963

Tested versions

System information

Issue description

Minimal reproduction example (MRE)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Version	Model	VRAM Peak	Reproducible
pyannote.audio 3.3.2	speaker-diarization-3.1	1.59GB	✅ No issue
pyannote.audio 4.0.3	speaker-diarization-community-1	9.54GB	✅ Bug present
pyannote.audio 4.0.3	speaker-diarization-3.1	9.54GB	✅ Bug present

Processing Step	3.3.2 + diarization-3.1	4.0.3 + community-1
segmentation	0.40GB	0.43GB
embeddings	0.05GB	0.05GB
discrete_diarization	1.59GB	9.54GB

pyannote.audio 4.0.3 uses 6x more VRAM than 3.3.2 (>9.54GB vs 2.59GB peak) #1963

Description

Tested versions

System information

Issue description

Minimal reproduction example (MRE)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions