Skip to content

Manual diarization pipeline (segmentation + embedding + clustering) unstable on long audio (>1 min) but works on short audio #1966

@JanaSaadawi

Description

@JanaSaadawi

Tested versions

Reproducible in: pyannote.audio 3.1.1, 3.2.0.

System information

OS: Ubuntu 22.04, Python: 3.10, pyannote.audio: 3.1.1, PyTorch: 2.x (CUDA enabled) GPU: NVIDIA RTX 4070 Ti CUDA: 12.x Audio format: WAV, 16kHz, mono HuggingFace authentication: via HF_TOKEN

Issue description

I implemented a manual speaker diarization pipeline:

Audio → pyannote/segmentation → binarization → pyannote/embedding → AgglomerativeClustering → merge
Works well on:

Short audio (≤ 1 minute)
Fails on:

Long audio (> 1 minute, up to ~20 minutes, 2 speakers)

Observed issues on long audio:

Unstable speaker labels (frequent speaker flipping)

Severe over-segmentation

Global clustering becomes unreliable

Final diarization is not usable for production

Current setup

Segmentation: sliding window, duration=2.0, step=0.25

Binarization: onset=0.5, offset=0.5, min_duration_on=0.1

Embedding: Inference(window="whole")

Clustering: AgglomerativeClustering(n_clusters=2, metric="cosine")

Post-merge gap: 0.5s

Questions

Is global clustering on long conversations a known limitation?

Is chunking long audio recommended before segmentation/embedding?

Are there recommended hyperparameters for long 2-speaker dialogue audio?

Minimal reproduction example (MRE)

s

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions