Lecture 8 - Neural Network Acoustic Models 2: Hybrid HMM/DNN Systems

After the introduction to neural networks in the previous lecture this lecture was about using neural networks for acoustic modelling. The main idea of the lecture was to show how a neural network could be trained to be a phone classifier, and such a trained neural network phone classifier could be used to either (1) replace the GMM output distributions in an HMM system (the hybrid HMM/NN approach), or (2) be used to generate discriminative features either from the neural network outputs (tandem or posteriorgram features) or from a narrow hidden layer in the neural network (bottleneck features).

A good paper to read for this lecture is Understanding how deep belief networks perform acoustic modelling by Mohamed et al.

Introduction

Recap: First we recapped the previous lecture and looked at how the hidden units in a multi-layer network can be thought of as learned feature extractors
Back-propagation of error: Training a multi-layer network can also be achieved with gradient descent, using the back-propagation algorithm which gives away to propagate gradients backwards through the network.
Neural network for phone classification: A feed forward network may be trained to map acoustic input frames to phone labels using one or more hidden layers. The input layer can have acoustic context, using a window of frames either side of the current frame.
TIMIT: Phone recognition task with 61 phones, often reduced to a set of 48 or 39. 630 speakers each reading 10 sentences. The final task of phone recognition is to output the sequence of phone labels for an utterance. Measured using phone error rate (PER) analogous to word error rate.

Hybrid HMM/NN systems

Posterior probabilities and scaled likelihoods: A network trained to classify HMM states will estimate the posterior probability P(state|acoustics). For an HMM we would like a likelihood of form P(acoustics|state). We obtained such (scaled) likelihoods by dividing the network outputs by the relatively frequency of each class in the training set, which gives us a scaled likelihood we can use for an HMM.
Neural network for phone recognition: In a hybrid HMM/NN system the neural network is used to generate the state output likelihoods in place of a GMM. In this case the network outputs will correspond to HMM states - if we have 3 states per phone model and 61 phones, this results in 117 states, and thus the network would have 183 outputs
NN vs GMM: Advantages of neural networks include being able to model correlated inputs - can use multiple frames of context (adjacent frames are correlated) and correlated feature components (e.g. log spectral features). Also NNs are more flexible than GMMs and can learn richer representations. Initially NNs were constrained for computational reasons - it was hard to model context dependent phones, and NN systems tended to have fewer parameters since the training process was not so straightforward to parallelise. However, with the advent of GPUs this has changed.

Deep neural networks

DNNs: Hybrid HMM/NN systems have been in use since the early 1990s, but it is only in recent years that compute power has been sufficient to train large neural network models which are now the state of the art in acoustic modelling. Modern systems are deeper (multiple hidden layers) and wider (using HMM context-dependent state alignment to define the outputs).
FBANK features: For NN-based systems, log mel-scaled filter bank features (FBANK) can result in greater accuracy than MFCCs. Correlation between feature components means that these features are harder to use in GMM systems (as covariance modelling would be necessary)
t-SNE: Hidden layer representations can be visualised using t-SNE which projects the high dimension features (the dimension is the number of units in the layer) down to 2 or 3 dimensions which may be visualised. These visualisations showed that the learned representations for FBANK features resulted in slightly more structure when compared with MFCCs.

Copyright (c) University of Edinburgh 2015-2019
The ASR course material is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License
licence.txt
This page maintained by Steve Renals.
Last updated: 2019/04/16 15:56:23UTC

Home : Teaching : Courses : Asr