Lecture 9 - Neural Networks for Acoustic Modelling 2
After to the hybrid HMM/NN approach in the previous lecture, in which it was shown how a neural network can replace a GMM in an HMM system, in this lecture we looked at how this approach was used to develop accurate acoustic models for TIMIT phone recognition and Switchboard conversational speech recognition.
- First we recapped the hybrid HMM/NN approach introduced in the previous lecture, and then looked at a typical deep neural network architecture that could be used for TIMIT phone recognition
- After a digression to explain the idea of pretraining, we discussed the Mohamed et al (2012) paper which carried out a careful set of experiments on TIMIT, varying the depth and width of the hidden layers, as well as comparing MFCC with mel-scale filter bank (FBANK) acoustic features. These experiments indicated that wider layers improved the accuracy, as did depth up to about 6 hidden layers. FBANK features (which have correlated components) were somewhat more accurate than MFCCs.
- Hidden layer representations can be visualised using t-SNE which projects the high dimension features (the dimension is the nuimber of units in the layer) down to 2 or 3 dimensions which may be visualised. These visualisations showed that the learned representations for FBANK features resulted in slightly more structure when compared with MFCCs.
- Finally we discussed a DNN acoustic model for Switchboard. The main difference of this model is that it uses context-dependent HMMs, thus the neural network output layer has a unit for each state-clustered context-dependent HMM state. This can result in wide output layers (dimension of over 9000 in the experiments discussed).
- Both the TIMIT and Switchboard experiments relied on first training a context-dependent HMM/GMM system and using the context-dependent states inferred for the systems, and the frame-state alignment from the trained HMM/GMM system, in order to generate target label sequence required to train the neural network.
Copyright (c) University of Edinburgh 2015-2018
The ASR course material is licensed under the
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License
licence.txt
This page maintained by Steve Renals.
Last updated: 2018/04/30 20:59:23UTC