Lecture 7 - Neural Network Acoustic Models 1: Introduction

This lecture was an introduction to neural networks. Different people have had different exposure to neural networks, and there have been introductory lectures on this topic in other courses this semester such as NLU+. Therefore this lecture focused more on aspects particular to speech recognition. If you are unfamiliar with this material, it is recommended to read the chapters 1 and 2, and the first part of chapter 3, of Michael Nielson's online book Neural Networks and Deep Learning, also chapter 7 (secs 7.1.-7.3) of the draft third edition of Jurafsky and Martin.

Basics

Local phonetic scores: Motivation using the notion of local phonetic scores - dynamic time warping and hidden Markov models can both be interpreted as having a sequence modelling part (the state sequence in an HMM) and a local phonetic score part - the output/emission probabilities in an HMM (typically Gaussians or Gaussian mixtures) or the Euclidean distance between frames in dynamic time warping. We can motivate the use of neural networks as a way of estimating such local scores.
Single layer network: If the score for each phoneme is weighted sum of the input features (which might be MFCCs), then this corresponds to a linear (single layer) network.
Error function: To train a network we use the concept of an error function - the difference between the output of the network and the target output. Because the output of the network depends on the weights, the error function can also be considered a function of the weights. Optimising the network consists of adjusting the weights to reduce the error function.
Output vector: If the network is trained to identify phonemes, then the output and target vector has a length equal to the number of phonemes. The target vector is one-hot, with the component corresponding to the target phone being 1, the other components being 0. You can think of the output vector giving an estimate of the probability of each phone given the input.
Gradient descent: Gradient descent is used to optimise the network. The key to this is estimating the gradient (first derivative) of the error function with respect to the weights. We can then use this to change the weights such that the error function is decreasing (move downhill). For a weight in a single-layer network (with the mean square error function) the gradient has the nice form of a product between the difference between the target and output multiplied by the input.
Softmax: For a network that is trained to be a classifier using 1-from-K outputs, can use a specific output functioned called Softmax which forces the output values to act like probabilities (outputs all between 0 and 1, outputs sum to 1). You can think of Softmax as a "soft argmax" function in that the class with the largest pre-activation gets the largest output (but less than 1), and the other classes get smaller outputs (but greater than 0)
Cross-entropy error function: If we are using softmax and 1-from-K outputs, then cross-entropy is a well matched error function: minimising the cross entropy corresponds to maximising the log likelihood of the target class.

Extending the model

Acoustic context: When using neural networks for speech recognition we are not restricted to using a single frame for input. We can also use multiple frames of acoustic context before and after the target frame.
Hidden layers: So far we have discussed linear single layer networks. We can construct more powerful networks if we add hidden layers. Hidden layers also have a non-linearity (e.g. relu) when the network is trained hidden layers can learn representations from the layer below. This hierarchical model of rivhser representations, layer-by-layer, is not something achievable with linear networks. (Note that a GMM can be considered as a network with a single hidden layer, the Gaussians.)
Back-propagation of error: Training a multi-layer network can also be achieved with gradient descent, using the back-propagation algorithm which gives away to propagate gradients backwards through the network.

Copyright (c) University of Edinburgh 2015-2019
The ASR course material is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License
licence.txt
This page maintained by Steve Renals.
Last updated: 2019/04/16 12:56:43UTC

Home : Teaching : Courses : Asr