Lecture 19 - WaveNet
This lecture discussed a recent paper by Aaron ven den Oord and colleagues at Deep Mind, which presents an exciting approach to generative models for speech at the sample level.
-
WaveNet is a probabilistic autoregressive model of speech, at the sample level, based on very deep convolutional networks.
-
The deep convolutional architecture (without pooling) stacks convolutional layers, with the aim of providing a long (but finite) context of samples to predict the next sample. In order to efficiently achieve long temporal context the approach of stacking diated convolutions is employed.
-
The output layer of wavenet is a softmax over possible audio sample valuies. In the paper they code each sample in 8 bits (256 possible values) using mu-law encoding. Wavenet can thus be vuewed as a language model over audio samples.
- Skip and residual connections are also used to help with training such deep networks.
- To generate speech auxiliary inputs are used to encode the talker, and (for TTS) linguistic features and F0
- WaveNet has also been used in ASR experiments, essentially as a trainable front end. In this case multi-task learning was used to simultaneously optimise the next sample prediction, and the phonetic class.
This page maintained by Steve Renals.
Last updated: 2017/03/31 11:55:45UTC