Lecture 19 - WaveNet
This lecture discussed a recent paper by Aaron ven den Oord and colleagues at Deep Mind, which presents an exciting approach to generative models for speech at the sample level.
WaveNet is a probabilistic autoregressive model of speech, at the sample level, based on very deep convolutional networks.
The deep convolutional architecture (without pooling) stacks convolutional layers, with the aim of providing a long (but finite) context of samples to predict the next sample. In order to efficiently achieve long temporal context the approach of stacking diated convolutions is employed.
The output layer of wavenet is a softmax over possible audio sample valuies. In the paper they code each sample in 8 bits (256 possible values) using mu-law encoding. Wavenet can thus be vuewed as a language model over audio samples.
- Skip and residual connections are also used to help with training such deep networks.
- To generate speech auxiliary inputs are used to encode the talker, and (for TTS) linguistic features and F0
- WaveNet has also been used in ASR experiments, essentially as a trainable front end. In this case multi-task learning was used to simultaneously optimise the next sample prediction, and the phonetic class.
This page maintained by Steve Renals.
Last updated: 2017/03/31 11:55:45UTC
|Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail:
Please contact our webadmin with
any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright ©
The University of Edinburgh