Lecture 17 - Speaker verification

This is the first of two lectures on speaker recognition. Speaker recognition includes a number of different tasks: speaker verification (determine if test speaker matches a specific target speaker), speaker identification (determine which of a set of enrolled speakers a test speaker matches), and speaker diarization (determine "who spoke when" in a recording). This lecture concerns speaker verification, the next lecture concerns speaker diarization.

Hansen and Hasan (2015) provide a tutorial overview on Speaker Recognition by Machines and Humans; for i-vectors, Dehak et al (2011) is probably still the best reference (Front-End Factor Analysis for Speaker Verification); for neural network x-vector approaches see Snyder et al (2018), X-Vectors: Robust DNN Embeddings for Speaker Recognition.

Speaker verfication

Training: Use enrollment data (set of target speakers) and background data (representing all possible speakers) to estimate the models - typically a background model, and a model for each target speakers
Testing: Score a test speaker against the named target speaker, and issue an accept or reject decision.
Two types of error: Need to take into account false accept (false alarm) and false reject (miss) errors. Setting a decision threshold changes the balance between the two types of error.
EER: Can represent error at a specific decision threshold - e.g. equal error rate (EER) is the error when the threshold is set such that the two types of error are equal.
DET curve: Can also represent error as a curve plotting false alarm against false accept - e.g. DET curve.
Detection cost function: Combine the two types of error weighting the importance of each (task-dependent) and taking into account the prior probability of the target speaker

GMM-based speaker verification

UBM: Universal background model (UBM) is a large GMM (typically 2000 components) trained over large set of speakers
MAP: Train a speaker-specific model using MAP adaptation starting from the UBM.
Log-likelihood ratio: Make a decision by computing a log likelihood ratio between the speaker specific model and the UBM, using a threshold to determing acceptance.

i-Vectors

GMM supervector: represent a speaker using a (high-dimensional) vector that concatenates the mean vectors of the GMM for that speaker.
i-vector: decompose a supervector for an utterance into the UBM supervector and the product of a matrix T and the i-vector w (dimension about 400). Estimate the T matrix using EM algorithmm then can make an estimate for the most probable value of the i-vector for that utterance.
Scoring using i-vectors: Treat the i-vector as a speaker embedding. Need to score target i-vector against test i-vector to performa verification. Can use cosine distance or (better) use the probabilistic linear discriminant analysis (PLDA) approach.

Neural network approcahes

d-vector: Train a DNN to recognise speakers, and extract speaker embedding from last hidden layer by averaging the hidden layer across all frame of the utterance
x-vector: Extracts a similar embedding to d-vector but (1) is based on a TDNN architecture, and (2) computes the embedding across frames using a statistics pooling layer.
x-vector and i-vector embeddings are the current state of the art.

Copyright (c) University of Edinburgh 2015-2019
The ASR course material is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License
licence.txt
This page maintained by Steve Renals.
Last updated: 2019/04/26 09:27:10UTC

Home : Teaching : Courses : Asr