ASR 2018-19
| News Archive
| Lectures
| Labs
| Coursework
| Piazza
Lecture 15 - End-to-end systems 1: CTC
End-to-end systems are systems which learn to directly map from an input sequence X to an output sequence Y. Sequence trained HMMs (using either a maximum likelihood or discriminative objective function) are a kind of end-to-end system involving a sequence-trained acoustic model and a language model. However, end-to-end systems usually refer to neural network approaches which directly map input sequences to output sequences, in the purest case not using a separate language model or lexicon. There are two main approaches:
- CTC (Connectionist Temporal Classification)
- Encoder-decoder architectures using attention
In this lecture we looked at CTC-based systems, and look at encoder-decoder systems in the next lecture. Probably the best reading on this is the paper about the DeepSpeech system by Hannun et al. And for an in-depth tutorial about CTC, look at the Distill article also by Hannun.
CTC
- CTC mapping:
CTC may be used to train an RNN to map an input sequence to an output sequence (typically of a different length) without requiring frame-level alignment (matching each input frame to an output token)
- Alignments:
CTC is "alignment-free" by summing over all possible alignments.
- Training:
Gradient descent training can be carried out as it is possible to back-propagate gradients through CTC
- Blank symbol:
CTC introduces an additional blank output symbol, which allows an input frame to be not mapped to an output symbol (useful for silence, noise, ...)
- Loss function:
CTC loss function is the negative log likelihood of the sum of possible alignments. Sum over alignments is performed using dynamic programming using a similar structure to forward-backward and Viterbi in HMMs.
- CTC assumption:
CTC has a conditional independence assumption: Given the inputs, each output is independent of the other outputs.
- HMM interpretation:
CTC can be interpreted as an HMM with additional (skippable) blank states, trained discriminatively.
Deep Speech
- DeepSpeech architecture:
Baidu's Deep Speech (now also implemented by Mozilla) is an example of an end-to-end system using CTC which maps from acoustic features to a character sequence, using a bidirectional recurrent layer.
- Deep Speech results:
DeepSpeech has competitive (close to state-of-the-art) results on Switchboard.
- Training:
Has several optimisations in the training set-up including augmenting the training data with synthetic additional data obtained by jittering the signal and adding noise
- Language model:
Raw CTC maps to a character sequence without an additional lexicon or language model. Can apply a language model by constraining character sequences using a lexicon, and them interpolating the CTC probability with a language model probability.
- WFST implementation:
Use the CTC network to create an FST mapping acoustics to characters (or subwords), which can then be composed with L and G transducers.
Copyright (c) University of Edinburgh 2015-2019
The ASR course material is licensed under the
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License
licence.txt
This page maintained by Steve Renals.
Last updated: 2019/04/24 17:11:26UTC