Lecture 17 - Current Progress in Acoustic Modelling
This lecture discussed a recent paper by George Saon et al from IBM, which probably represents the current state of the art in speech recognition.
-
This paper discussed experiments on conversational telephone speech, training on about 2000 hours of data and testing on five different test sets.
-
Before discussing the automatic techniques we discussed an experiment to ascertain human speech recognition performance on conversational telephone speech - it turns out that people have a word error rate of 5-7%.
-
Acoustic modelling was primarily based on RNN and CNN architectures
- Bidirectional LSTM - a recurrent neural network architecture
- Speaker-adversarial multi-task training - a technique which aim s to force the network to perform well on classifiying context-dependent phone states, while explicitly not learning to identify the talker
- Feature fusion - combining PLP-based features (fmllr) with filter bank features
- Deep convolutional networks - residual networks (res nets) enable deep convolutional networks to be trained using layer-skipping connections
- Language model experiments combining n-grams and LSTM recurrent networks
- Final results give word error rates of 5.5% on Switchboard and 10.3% on call home. Extensive model and system combination used in these systems.
This page maintained by Steve Renals.
Last updated: 2017/03/28 12:58:24UTC