Lecture 10 - Neural Network Acoustic Models 4: LSTM acoustic models; Sequence discriminative training

Speech recognition is about modelling sequences: given a sequence of acoustic frames, what should be the corresponding sequence of symbols. HMMs are a surprisingly strong sequence model, and have been at the heart of speech recognition since they were introduced in the 1970s. They were introduced earlier in the course along with powerful algorithms like Viterbi and EM; they are trained to a maximum likelihood criterion, which is a different to the classification criterion we are primarily interested in - building the best model of speech is merely a means to the end of classifying the speech as the correct sequence of symbols.

One of the powerful aspects of the neural network methods introduced in the three previous lectures is that they enable discriminative training using softmax and the cross-entropy error function. However their sequence modelling is limited. Using a context window over the input, and in particular using a TDNN architecture, enables the local matching score to take account of a wide receptive field of acoustic context. However these systems still use a frame-level loss function - in contrast to the sequence level loss function used when training HMMs.

Probably the best paper on LSTM acoustic models is the one by Graves et al: Hybrid speech recognition with deep bidirectional LSTM. Sequence discriminative training for HMM/GMMs is covered in Young's handbook article. Sequence training for DNNs is well-covered by Veseley et al, Sequence-discriminative training of deep neural networks.

In this lecture we explored two ways of better modelling sequences:

Using LSTM recurrent neural networks (RNNs) to allow potentially infinite context to be incorporated
Using a sequence-level discriminative loss function

RNNs and LSTM acoustic models

RNNs: An RNN is a network in which the hidden units at a particular time use recurrent connections so that they can take input from their value at the previous time step
State units: Recurrent hidden units can be thought of as state units which summarise the history of the sequence until the current time
Unfolding in time: A good way to view a recurrent network is too "unfold" the recurrent connections through time, so representing the network as a deep feed-forward network. This unfolding is used for training by the back-propagation in time algorithm
LSTM: A vanilla RNN uses the same types of units as a feed-forward network (e.g., weighted sum followed by a ReLU transfer function). However richer sequence modelling can obtained using units with gates - weights that are dependent on the current input and the previous state.
LSTM gates: An LSTM unit has three gates (and an internal state):
- An input gate which controls how much input to the unit is written to its internal state
- A forget gate which controls how much of the previous internal state is written to the current internal state
- An output gate which controls how much of the units activation is broadcast to the rest of the network.
LSTM RNNs can be trained using back-propagation through time.
Bidirectional RNN: Has each hidden layer has two components one with recurrent connections going forward in time, one goin backward in time. These recurrent state units can be combined to enable information to be incorporated across the entire sequence.
Deep RNNs: For better modelling can stack recurrent layers, so a recurrent layer takes its input from the previous recurrent layer and it its own state at the previous time step
Acoustic modelling with RNNs: Deep bidiredtional LSTM acoustic model has achieved state of the art performance on tasks such as Switchboard.

Sequence discriminative training

Maximum mutual information (MMI): MMI can be used as a loss function for HMMs - optimising MMI corresponds to maximising the posterior probability of the word sequence given the acoustics. Compare with ML training which maximises the probability of the acoustics given the word sequence.
MMI numerator and denominator: Optimising MMI corresponds to maximising a numerator which is the (clamped) likelihood of the data given the correct word sequence while minimising a denominator which is the (unclamped) likehood of the data given all possible word sequences
Discriminative criteria: MMI loss function is optimised by making the correct sequence more likely while making the competing sequences less likely - thus MMI is sequence discriminative training
Lattices: To achieve the computations required for MMI - in particular the denominator which requires summing over all possible word sequences - lattices are used as an approximation
Minimum phone error (MPE): MPE adjusts the optimization criterion so it is more related to word error rate by weighting the numerator by a term which is the phone accuracy of the utterance.
Sequence training of DNNs: So far DNNs have been discriminatively trained but only at the frame-level. MMI-type objective functions can be applied to DNN systems to enable sequence discriminative training.
ASR results: Sequence discriminative training when applied to either HMM/GMM systems or HMM/DNN results in about 10% relative decrease in error.

Copyright (c) University of Edinburgh 2015-2019
The ASR course material is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License
licence.txt
This page maintained by Steve Renals.
Last updated: 2019/04/23 16:53:29UTC

Home : Teaching : Courses : Asr