Tested versions
Reproducible in: pyannote.audio 3.1.1, 3.2.0.
System information
OS: Ubuntu 22.04, Python: 3.10, pyannote.audio: 3.1.1, PyTorch: 2.x (CUDA enabled) GPU: NVIDIA RTX 4070 Ti CUDA: 12.x Audio format: WAV, 16kHz, mono HuggingFace authentication: via HF_TOKEN
Issue description
I implemented a manual speaker diarization pipeline:
Audio → pyannote/segmentation → binarization → pyannote/embedding → AgglomerativeClustering → merge
Works well on:
Short audio (≤ 1 minute)
Fails on:
Long audio (> 1 minute, up to ~20 minutes, 2 speakers)
Observed issues on long audio:
Unstable speaker labels (frequent speaker flipping)
Severe over-segmentation
Global clustering becomes unreliable
Final diarization is not usable for production
Current setup
Segmentation: sliding window, duration=2.0, step=0.25
Binarization: onset=0.5, offset=0.5, min_duration_on=0.1
Embedding: Inference(window="whole")
Clustering: AgglomerativeClustering(n_clusters=2, metric="cosine")
Post-merge gap: 0.5s
Questions
Is global clustering on long conversations a known limitation?
Is chunking long audio recommended before segmentation/embedding?
Are there recommended hyperparameters for long 2-speaker dialogue audio?
Minimal reproduction example (MRE)
s