Lecture 17 - Current Progress in Acoustic Modelling

This lecture discussed a recent paper by George Saon et al from IBM, which probably represents the current state of the art in speech recognition.

This paper discussed experiments on conversational telephone speech, training on about 2000 hours of data and testing on five different test sets.
Before discussing the automatic techniques we discussed an experiment to ascertain human speech recognition performance on conversational telephone speech - it turns out that people have a word error rate of 5-7%.
Acoustic modelling was primarily based on RNN and CNN architectures
- Bidirectional LSTM - a recurrent neural network architecture
- Speaker-adversarial multi-task training - a technique which aim s to force the network to perform well on classifiying context-dependent phone states, while explicitly not learning to identify the talker
- Feature fusion - combining PLP-based features (fmllr) with filter bank features
- Deep convolutional networks - residual networks (res nets) enable deep convolutional networks to be trained using layer-skipping connections
Language model experiments combining n-grams and LSTM recurrent networks
Final results give word error rates of 5.5% on Switchboard and 10.3% on call home. Extensive model and system combination used in these systems.

This page maintained by Steve Renals.
Last updated: 2017/03/28 12:58:24UTC

Home : Teaching : Courses : Asr

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh