Speaker diarization

The process of partitioning an audio recording by speaker — determining "who spoke when" — and labelling each segment (Speaker 1, Speaker 2, …).

Updated 2026-06-17

Speaker diarization answers the question “who spoke when?”. A diarization model segments audio by speaker and assigns labels, which a transcription pipeline then attaches to the text so you get Speaker 1: … / Speaker 2: … rather than an undifferentiated wall of words.

It is essential for multi-speaker recordings — meetings, interviews, calls — where knowing the speaker is as important as the words. Open models such as pyannote are commonly used.