ASR Lecture Log 14

Lecture 14 - Sequence discriminative training

This lecture discussed sequence discriminative training for both GMM and NN based systems. In sequence discriminative training, the objective function used in training is discriminative (adjust the model to increase the probability of the correct sequence and decrease the probability of competing sequences) and is at the sequence level (discrininate between sequences rather than frames).

Maximum likelihood training - adjust the parameters to maximise the likelihood of the correct sequence; discriminative training aims to maximise a ratio of the probability of the correct sequence vs probability of competing sequences.

One such sequence discriminatice approach is maximum mutual information (MMI) estimation - expressed as a numerator part (clamped to a reference word sequence) and denominator part (free, evaluating all possible word sequences)

In order to use MMI training in practice, use lattices for computing the denominator

An alternative objective function, closer to the evaluation function we are interested in, is the minimum phone error criterion - explicitly weighting with a phone error rate term

NN acoustic models are discriminative, but at the frame level (using cross-entropy objective). It is possible to apply sequence discriminative training to neural network acoustic models. Use CE-trained model to generate alignments and lattices for sequence training and to initialise the weights, then train using back-propagation with sequence training objective function (e.g. MMI)

Results comparing ML and discriminatively trained GMM systems and framewise and sequence trained NN systems on Switchboard. Sequence discriminative training gives 10-15% (relative) reduction in word error rate.

Lattice-free MMI (LF-MMI) can be used for NN sequence training. It avoids the need to pre-compute lattices for the denominator and avoids the requirement to train using frame-based CE loss function, before sequence training.

LF-MMI directly applies forward-backward computations to the denominator, using WFST computations. Several approximations are applied to increase efficiency. Current state of the art training approach for NN acoustic models.

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh