ASR Lecture Log 18 - Speaker diarization

Lecture 18 - Speaker diarization

This is the first of two lectures on speaker recognition. This lecture concerns speaker diarization - the task of "who spoke when" in which a recording is split into segments, where wach segment corresponds to the speech of a single speaker. Unlike the settings we have previously considered, speaker diarization assumes there are multiple speakers in a recording. A good description of a current approach to speaker diarization is the ICASSP-2017 paper from Garcia-Romero et al, Speaker diarization using deep neural network embeddings.

Speaker diarization

Basic approach: Segment into short (~2s) fixed length segments, then do speaker verification between all segment pairs

Diarization error rate: Speaker diarization is measured by the diarization error rate (DER) which combines three types of error: missed speech, false alarm speech, and incorrect speaker labelling. Errors computed based on time, each is expressed as a fraction of the total time.

Diarization framework: Perform diarization by running the following pipeline: (1) split recoring into fixed-length segments; (2) speech activity detection (is a segment speech or non-speech?); (3) represent segment by embedding (x-vector or i-vector); (4) compare pairs of segments using scoring function (e.g. PLDA); (5) hierarchical clustering to merge segments judged to be the same speaker.

Speech activity detection (SAD): SAD is typically carried out using an LSTM or TDNN neural network trained on a large amount of diverse data

DIHARD challenge: Speaker diarization research has been very domain-dependent in the past. DIHARD is an initiative to measure speaker diarization performance using a diverse set of data sets.

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh