Lecture 16 - End-to-end systems 2: Sequence-to-sequence models

In this lecture we reviewed the pros and cons of CTC-based systems and looked at two sequence-to-sequence models:

RNN transducer
Encoder-decoder architecture using attention

Probably the best reading on this is the Google paper on Listen, Attend, and Spell by Chan et al, along with the Interspeech-2017 paper by Prabhavalkar et al C comparison of sequence-to-sequence models for speech recognition.

CTC recap

Another view of CTC: View a CTC network as having three components: Encoder (bidirectional RNN which maps input sequence to embedding sequence), Softmax (computes label probabilities), and CTC (sums over alignments to compute subword sequence)
CTC Pros:: Preserves monotonic relationship between input and output; alignment-free (sums over possible alignments)
CTC Cons: Assumes outputs are independent - so requires additional language model / pronunciation model to introduce dependences; end-to-end training does not update the language and pronunciation models

RNN Transducer

Architecture An encoder and a prediction network are combined by a joint network, followed by softmax and CTC as in CTC network.
Encoder: RNN which maps input sequence to embedding sequence (as in CTC)
Prediction network: Recurrent network which takes the previous output subword label as input and predicts the next subword label. Acts as a language model over subwords.
Joint network: Feed-forward network which combines encoder and prediction network outputs
Left-to-right: If the encoder is unidirectional, then an RNN treansducer can operate in an online, frame-synchronous way.

Attention-based encoder-decoder

Architecture An encoder maps to embeddings, a decoder maps a distribution over labels based on previously predicted labels and the embeddings, and an attention mechanism constructs a context vector for the decoder network based on attention weights computed over all frames in the encoder output.
Encoder: Stacked RNN which maps input sequence to embedding sequence (as before). For efficiency a pyramid architecture is often used in which each RNN layer takes concatenated consecutive hidden states from the previous layer, thus reducing the time resolution by a factor of 2. Three such layers reduce the resolution by a factor of 8.
Decoder: Generates the output subword sequence, using the the previously generated output, the previous decoder hidden state, and the previous context vector to compute its hidden state. The context vector is computed by the attention mechanism.
Attention mechanism: An alignment vector is computed using the current decoder hidden state and the complete sequence of encoder hidden states. The alignment vector is used to weight the sequence of encoder hidden states to compute the context vector used by the decoder.
"Matching clocks": The attention mechanism can be seen as matching the input "clock" (over acoustics) with the output clock (over subwords)
Learning: Model trained to maximise the log probability of correct sequences
Decoding: Simply decode the generate subword sequence. If the training data is enough, additional language and pronunciation models are not required (see Google results). Various other refinement in Google's ICASSP-2018 paper.
Hybrid CTC/Attention: Use CTC and attention jointly during training and recognition - regularises the system to favour monotonic alignments

Copyright (c) University of Edinburgh 2015-2019
The ASR course material is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License
licence.txt
This page maintained by Steve Renals.
Last updated: 2019/04/24 17:50:47UTC

Home : Teaching : Courses : Asr