Lecture 13 - Speaker adaptation

The aim of speaker adaptation is to use a small amount of speaker-specific data to adapt a speaker-independent speech recognition system such that is tuned to the speech of a specific speaker. One could imagine adaptation approaches at different levels - language model adaptation, pronunciation model adaptation, acoustic model adaptation. Although pronunciation model adaptation is a compelling idea, and has been explored, there hasn't (yet?) been a method in which has had consistent success. Language model adaptation has been more focused on domain of use and topic, and has had some success. This lecture focuses on acoustic model adaptation: there has been a lot of work in this area, and several successful, heavily used techniques.

Woodland's review article from 2001 still gives good coverage of adapatation for HMM/GMM systems. Swietojanski et al's LHUC paper is a good paper about adapting DNN based systems. Also Saon et al's paper on i-vector based adaptation.

An ideal speaker adaptation approach would be compact (limited set of parameters adapted), unsupervised (no need for a human transcription of the adaptation data), efficient (low compute requirement), flexible (not specific to a particular model).
Approaches: Three approaches to adaptation:
- Model-based: adapt the model's parameters to better match the speaker-specific data (eg MLLR, LHUC)
- Speaker normalisation: adapt the acoustic data to reduce the mismatch (eg cMLLR)
- Speaker codes: model the speaker space or learn specific codes for the speaker which affect the model (eg iVector)
MLLR: MLLR (maximum likelihood linear regression is a model-based technique for GMM-based systems. The model parameters are adapted indirectly by learning linear transforms to adapt the mean and covariance parameters. MLLR has been demonstrated to work consistently well, down to just 10s of adaptation data.
cMLLR: Constrained MLLR (cMLLR) is a version of MLLR in which the same linear transform is used for the mean and the covariance. This is very interesting because it also corresponds to a linear tramsform of the features. This means that given a GMM system, a cMLLR adaptation transform can be estimated and used as a feature space adaptation transform (speaker normalisation) for any acoustic modelling approach (e.g. a neural network).
SAT: In speaker adaptive training (SAT) adaptation transforms are computed for at training time as well as a test time. This has the advantage of consistency, as it assumes an adaptation transform is learned for every speaker seen by the system.
LHUC: LHUC is a model-based transform used with neural network acoustic models. Rather than adapting the many weights in a neural network, an scale parameter is defined for each hidden unit, and it is these parameters that are learned (by gradient descent) for each speaker. This is the most effective model-based adaptation for NNs.
i-vectors: With i-vectors, estimate a i-vector speaker codes for each talker and append these codes to the input units (and optionally the hidden units).
Multiple approaches: cMLLR, LHUC, i-vector auxiliary features are complementary and can be used together.

Copyright (c) University of Edinburgh 2015-2019
The ASR course material is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License
licence.txt
This page maintained by Steve Renals.
Last updated: 2019/04/23 17:41:15UTC

Home : Teaching : Courses : Asr