ASR 2018-19
| News Archive
| Lectures
| Labs
| Coursework
| Piazza
Lecture 17 - Speaker verification
This is the first of two lectures on speaker recognition. Speaker recognition includes a number of different tasks: speaker verification (determine if test speaker matches a specific target speaker), speaker identification (determine which of a set of enrolled speakers a test speaker matches), and speaker diarization (determine "who spoke when" in a recording). This lecture concerns speaker verification, the next lecture concerns speaker diarization.
Hansen and Hasan (2015) provide a tutorial overview on Speaker Recognition by Machines and Humans; for i-vectors, Dehak et al (2011) is probably still the best reference (Front-End Factor Analysis for Speaker Verification); for neural network x-vector approaches see Snyder et al (2018), X-Vectors: Robust DNN Embeddings for Speaker Recognition.
Speaker verfication
- Training:
Use enrollment data (set of target speakers) and background data (representing all possible speakers) to estimate the models - typically a background model, and a model for each target speakers
- Testing:
Score a test speaker against the named target speaker, and issue an accept or reject decision.
- Two types of error:
Need to take into account false accept (false alarm) and false reject (miss) errors. Setting a decision threshold changes the balance between the two types of error.
- EER:
Can represent error at a specific decision threshold - e.g. equal error rate (EER) is the error when the threshold is set such that the two types of error are equal.
- DET curve:
Can also represent error as a curve plotting false alarm against false accept - e.g. DET curve.
- Detection cost function:
Combine the two types of error weighting the importance of each (task-dependent) and taking into account the prior probability of the target speaker
GMM-based speaker verification
- UBM:
Universal background model (UBM) is a large GMM (typically 2000 components) trained over large set of speakers
- MAP:
Train a speaker-specific model using MAP adaptation starting from the UBM.
- Log-likelihood ratio:
Make a decision by computing a log likelihood ratio between the speaker specific model and the UBM, using a threshold to determing acceptance.
i-Vectors
- GMM supervector:
represent a speaker using a (high-dimensional) vector that concatenates the mean vectors of the GMM for that speaker.
- i-vector:
decompose a supervector for an utterance into the UBM supervector and the product of a matrix T and the i-vector w (dimension about 400). Estimate the T matrix using EM algorithmm then can make an estimate for the most probable value of the i-vector for that utterance.
- Scoring using i-vectors:
Treat the i-vector as a speaker embedding. Need to score target i-vector against test i-vector to performa verification. Can use cosine distance or (better) use the probabilistic linear discriminant analysis (PLDA) approach.
Neural network approcahes
- d-vector:
Train a DNN to recognise speakers, and extract speaker embedding from last hidden layer by averaging the hidden layer across all frame of the utterance
- x-vector:
Extracts a similar embedding to d-vector but (1) is based on a TDNN architecture, and (2) computes the embedding across frames using a statistics pooling layer.
- x-vector and i-vector embeddings are the current state of the art.
Copyright (c) University of Edinburgh 2015-2019
The ASR course material is licensed under the
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License
licence.txt
This page maintained by Steve Renals.
Last updated: 2019/04/26 09:27:10UTC