Lecture 9 - Neural Network Acoustic Models 3: CD DNNs and TDNNs

This lecture discussed the DNN- and TDNN-based acoustic models used in state-of-the-art systems. These are context-dependent deep neural networls with very wide output layers (typically 10,000 or more) corresponding to the context-dependent tied states of an HMM/GMM system. We also introduce the time-delay neural network, an approach which can learn wide receptive fields onto the input layer using hidden layers which each process a window from the previous layer.

There are surprisingly few clear and comprehensive recent articles about DNN acoustic models. The bast is probably Maas et al (2017), Building DNN acoustic models for large vocabulary speech recognition. For TDNNs the best paper to read is Peddinti et al (2015), A time delay neural network architecture for efficient modeling of long temporal contexts.

Context-dependent DNNs

CD DNN for Switchboard: The HMM/DNN system for the Switchboard corpus (conversational telephone speech) uses context-dependent HMMs: the neural network output layer has a unit for each state-clustered context-dependent HMM state. This can result in wide output layers (dimension of over 9000 in the experiments discussed).
Using GMM-based alignments to train DNNs: Both the TIMIT and Switchboard experiments relied on first training a context-dependent HMM/GMM system and using the context-dependent states inferred for the systems, and the frame-state alignment from the trained HMM/GMM system, in order to generate target label sequence required to train the neural network.
Switchboard results (2012): Using 9k tied states, and 7 hidden layers each with 2048 units, HMM/DNN system obtained a word error rate reduction compared with state-of-the-art HMM/GMM system.

TDNNs

Time-delay neural networks (TDNNs): Each hidden layer processes a context window from the previous layer - thus each hidden layer has a wider receptive field across the input. This is in contrast to a DNN architecture in which only the first hidden layer uses a context window (onto the input)
ConvNet interpretation: One can view a TDNN as a one-dimension convolutional neural network (convolutions in time)
TDNN architecture: A vanilla TDNN will have many more weights than a DNN with the same number of layers and units per layer. This is because each TDNN hidden unit looks at the previous hidden layer using the context window to include previous and future hidden states. A TDNN layer using a context of [-2,+2] will have 5x as many weights as similar DNN layer.
Sub-smapled TDNN: Sub-sample window of hidden unit activations - e.g. just include t-2, t+2 (written as {-2,2} rather than t-2,t-1,t,t+1,t+2 (written at [-2,2]). Sub-sampling reduces the model size (number of weights). Can use asymmetric weights.
Sub-sampled TDNN results: Peddinti (2015) performed various experiments obtainiong an improvement over regular DNNs across several datasets. Typical sub-sampled TDNN configuration for 5-layer network: [-2,2], {-1,2}, {-3,3}, {-7,2}, {0}; output layer has a receptive field into the input with context width (-13,9).

Copyright (c) University of Edinburgh 2015-2019
The ASR course material is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License
licence.txt
This page maintained by Steve Renals.
Last updated: 2019/04/17 13:53:16UTC

Home : Teaching : Courses : Asr