Lecture 15 - End-to-end systems 1: CTC

End-to-end systems are systems which learn to directly map from an input sequence X to an output sequence Y. Sequence trained HMMs (using either a maximum likelihood or discriminative objective function) are a kind of end-to-end system involving a sequence-trained acoustic model and a language model. However, end-to-end systems usually refer to neural network approaches which directly map input sequences to output sequences, in the purest case not using a separate language model or lexicon. There are two main approaches:

CTC (Connectionist Temporal Classification)
Encoder-decoder architectures using attention

In this lecture we looked at CTC-based systems, and look at encoder-decoder systems in the next lecture. Probably the best reading on this is the paper about the DeepSpeech system by Hannun et al. And for an in-depth tutorial about CTC, look at the Distill article also by Hannun.

CTC

CTC mapping: CTC may be used to train an RNN to map an input sequence to an output sequence (typically of a different length) without requiring frame-level alignment (matching each input frame to an output token)
Alignments: CTC is "alignment-free" by summing over all possible alignments.
Training: Gradient descent training can be carried out as it is possible to back-propagate gradients through CTC
Blank symbol: CTC introduces an additional blank output symbol, which allows an input frame to be not mapped to an output symbol (useful for silence, noise, ...)
Loss function: CTC loss function is the negative log likelihood of the sum of possible alignments. Sum over alignments is performed using dynamic programming using a similar structure to forward-backward and Viterbi in HMMs.
CTC assumption: CTC has a conditional independence assumption: Given the inputs, each output is independent of the other outputs.
HMM interpretation: CTC can be interpreted as an HMM with additional (skippable) blank states, trained discriminatively.

Deep Speech

DeepSpeech architecture: Baidu's Deep Speech (now also implemented by Mozilla) is an example of an end-to-end system using CTC which maps from acoustic features to a character sequence, using a bidirectional recurrent layer.
Deep Speech results: DeepSpeech has competitive (close to state-of-the-art) results on Switchboard.
Training: Has several optimisations in the training set-up including augmenting the training data with synthetic additional data obtained by jittering the signal and adding noise
Language model: Raw CTC maps to a character sequence without an additional lexicon or language model. Can apply a language model by constraining character sequences using a lexicon, and them interpolating the CTC probability with a language model probability.
WFST implementation: Use the CTC network to create an FST mapping acoustics to characters (or subwords), which can then be composed with L and G transducers.

Copyright (c) University of Edinburgh 2015-2019
The ASR course material is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License
licence.txt
This page maintained by Steve Renals.
Last updated: 2019/04/24 17:11:26UTC

Home : Teaching : Courses : Asr