ASR Lecture Log 11

Lecture 11 - Decoding, alignment, and WFSTs

In this lecture we discussed search, alignment, and decoding - and, in particular, the weighted finite state transducer (WFST) framework.

Mohri et al's handbook article is the best review of the WFST approach, although it goes into more detail than required for this course. Many other articles on this topic are based on the treatment of Mohri et al.

Search and decoding

Search in ASR: The search problem in ASR is finding the most likely word sequence, given the observed acoustics

Viterbi: Viterbi decoding is the optimal approach to obtain the most probable state sequence, and it is straightforward to include a bigram language model. Longer span language models (e.g. trigram) require storing a word history in some way.

Beam search: In continuous speech recognition, with a large vocabulary and longer-span language models, computational issues become important and approximations (e.g. beam search) are required.

WFSTs

Weighted finite state transducers (WFSTs) are a general formulation for computing with HMM-type systems.

WFSTs: WFSTs consist of states connected by transitions with input label, output label, and weight - examples given for language model and pronunciation lexicon

WFST algorithms: There are three important algorithms on WFSTs: Composition (combine transducers at different levels, eg a grammar with a lexicon), Determinisation (ensure that each state has no more than one transition for each input label), and Minimisation (transform to a transducer with the same input/output functionality but with the minimum number of states)

HCLG: Applying WFSTs to speech recognition - HCLG, which is a composition of grammar (G), lexicon (L), context-dependence (C), and HMM (H) transducers

Applying WFSTs at scale: Combined HCLG transducer gives an complete search graph for an ASR system - naive composition can blow up, need to apply determinisation and minimisation multiple times during the composition, in a careful order

Alignment

"Noisy" alignment: Match an audio recording to a transcript which may not include every word spoken, or may include paraphrasing (typical case for e.g. tv subtitles)

Biased language model: Transcribe the recording using a language model biased towards the transcript (interpolate an LM trained only on the transcript with a general LM)

Factor transducer: Decode using a wFST which matches to any substring of the subtitles; more generally, also allow word skips

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh