ASR 2018-19
| News Archive
| Lectures
| Labs
| Coursework
| Piazza
Lecture 11 - Decoding, alignment, and WFSTs
In this lecture we discussed search, alignment, and decoding - and, in particular, the weighted finite state transducer (WFST) framework.
Mohri et al's handbook article is the best review of the WFST approach, although it goes into more detail than required for this course. Many other articles on this topic are based on the treatment of Mohri et al.
Search and decoding
- Search in ASR:
The search problem in ASR is finding the most likely word sequence, given the observed acoustics
- Viterbi:
Viterbi decoding is the optimal approach to obtain the most probable state sequence, and it is straightforward to include a bigram language model. Longer span language models (e.g. trigram) require storing a word history in some way.
- Beam search:
In continuous speech recognition, with a large vocabulary and longer-span language models, computational issues become important and approximations (e.g. beam search) are required.
WFSTs
Weighted finite state transducers (WFSTs) are a general formulation for computing with HMM-type systems.
- WFSTs:
WFSTs consist of states connected by transitions with input label, output label, and weight - examples given for language model and pronunciation lexicon
- WFST algorithms:
There are three important algorithms on WFSTs: Composition (combine transducers at different levels, eg a grammar with a lexicon), Determinisation (ensure that each state has no more than one transition for each input label), and Minimisation (transform to a transducer with the same input/output functionality but with the minimum number of states)
- HCLG:
Applying WFSTs to speech recognition - HCLG, which is a composition of grammar (G), lexicon (L), context-dependence (C), and HMM (H) transducers
- Applying WFSTs at scale:
Combined HCLG transducer gives an complete search graph for an ASR system - naive composition can blow up, need to apply determinisation and minimisation multiple times during the composition, in a careful order
Alignment
- "Noisy" alignment:
Match an audio recording to a transcript which may not include every word spoken, or may include paraphrasing (typical case for e.g. tv subtitles)
- Biased language model:
Transcribe the recording using a language model biased towards the transcript (interpolate an LM trained only on the transcript with a general LM)
- Factor transducer:
Decode using a wFST which matches to any substring of the subtitles; more generally, also allow word skips
Copyright (c) University of Edinburgh 2015-2019
The ASR course material is licensed under the
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License
licence.txt
This page maintained by Steve Renals.
Last updated: 2019/04/23 17:10:51UTC