ASR Lecture Log 12

Search and decoding

The search problem in ASR: finding the most likely word sequence, given the observed acoustics

Viterbi decoding is the optimal approach to obtain the most probable state sequence, and it is straightforward to include a bigram language model. Longer span language models (e.g. trigram) require storing a word history in some way.

In continuous speech recognition, with a large vocabulary and longer-span language models, computational issues become important and approximations (e.g. beam search) are required.

WFSTs

Weighted finite state transducers (WFSTs) are a general formulation for computing with HMM-type systems.

WFSTs consist of states connected by transitions with input label, output label, and weight - examples given for language model and prounciation lexicon

There are three important algorithms on WFSTs: Composition (combine transducers at different levels, eg a grammar with a lexicon), Determinisation (ensure that each state has no more than one transition for each input label), and Minimisation (transform to a transducer with the same input/output functionality but with the minimum number of states)

Applying WFSTs to speech recognition - HCLG, which is a composition of grammar (G), lexicon (L), context-dependence (C), and HMM (H) transducers

Combined HCLG transducer gives an complete search graph for an ASR system - naive composition can blow up, need to apply determinisation and minimisation multiple times during the composition, in a careful order

Alignment

"Noisy" alignment - match an audio recording to a transcript which may not include every word spoken, or may include paraphrasing (typical case for e.g. tv subtitles)

Biased language model for alignment: transcribe the recording using a language model biased towards the transcript (interpolate an LM trained only on the transcript with a general LM)

Factor transducer: decode using a wFST which matches to any substring of the subtitles; more generally, also allow word skips

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh

Lecture 12 - Decoding, alignment, and WFSTs

Search and decoding

WFSTs

Alignment