Lecture 12 - Decoding, alignment, and WFSTs
In this lecture we discussed search, alignment, and decoding - and, in particular, the weighted finite state transducer (WFST) framework.
Search and decoding
-
The search problem in ASR: finding the most likely word sequence, given the observed acoustics
-
Viterbi decoding is the optimal approach to obtain the most probable state sequence, and it is straightforward to include a bigram language model. Longer span language models (e.g. trigram) require storing a word history in some way.
-
In continuous speech recognition, with a large vocabulary and longer-span language models, computational issues become important and approximations (e.g. beam search) are required.
WFSTs
Weighted finite state transducers (WFSTs) are a general formulation for computing with HMM-type systems.
-
WFSTs consist of states connected by transitions with input label, output label, and weight - examples given for language model and prounciation lexicon
-
There are three important algorithms on WFSTs: Composition (combine transducers at different levels, eg a grammar with a lexicon), Determinisation (ensure that each state has no more than one transition for each input label), and Minimisation (transform to a transducer with the same input/output functionality but with the minimum number of states)
-
Applying WFSTs to speech recognition - HCLG, which is a composition of grammar (G), lexicon (L), context-dependence (C), and HMM (H) transducers
-
Combined HCLG transducer gives an complete search graph for an ASR system - naive composition can blow up, need to apply determinisation and minimisation multiple times during the composition, in a careful order
Alignment
- "Noisy" alignment - match an audio recording to a transcript which may not include every word spoken, or may include paraphrasing (typical case for e.g. tv subtitles)
- Biased language model for alignment: transcribe the recording using a language model biased towards the transcript (interpolate an LM trained only on the transcript with a general LM)
- Factor transducer: decode using a wFST which matches to any substring of the subtitles; more generally, also allow word skips
Copyright (c) University of Edinburgh 2015-2018
The ASR course material is licensed under the
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License
licence.txt
This page maintained by Steve Renals.
Last updated: 2018/04/30 20:59:23UTC