Automatic Speech Recognition (ASR) 2023-24: Lectures

There are 18 lectures, taking place in weeks 1-9. Lectures are held on Mondays and Thursdays at 14:10, starting Monday 15 January. Monday lectures are held in the Adam House Basement Lecture Theatre and Thursday lectures are held in the HRB Lecture Theatre in the Hugh Robson Building on George Square. Future lecture topics are subject to change.

Lecture live streaming is available via Media Hopper Replay for students not able to attend in person – the link can be found on Learn under “Course Materials”.

All listed reading is optional and will not be examinable. Works listed as Reading may be useful to improve your understand of the lecture content; Background reading is for interest only.

Monday 15 January 2024. Introduction to Speech Recognition
Slides
Reading: J&M: chapter 7, section 9.1; R&H review chapter (sec 1).
Thursday 18 January 2024. Speech Signal Analysis 1
Slides
Reading: O'Shaughnessy (2000), Speech Communications: Human and Machine, chapter 2; J&M: Sec 9.3; Paul Taylor (2009), Text-to-Speech Synthesis: Ch 10 and Ch 12.
SparkNG MATLAB realtime/interactive tools for speech science research and education
Monday 22 January 2024. Speech Signal Analysis 2
Slides
Reading: O'Shaughnessy (2000), Speech Communications: Human and Machine, chapter 3-4
Thursday 25 January 2024. Introduction to Hidden Markov Models
Slides (updated references 1 Feb)
Reading: Rabiner & Juang (1986) Tutorial.; J&M: Secs 6.1-6.5, 9.2, 9.4; R&H review chapter (sec 2.1, 2.2);
Monday 29 January 2024. HMM algorithms
Slides (updated references 1 Feb; corrections 14 Feb and 25 Apr; errata)) and introduction to the labs
Reading: J&M: Sec 9.7, G&Y review (sections 1, 2.1, 2.2); (J&M: Secs 9.5, 9.6, 9.8 for introduction to decoding).
Thursday 1 February 2024. Gaussian mixture models
Slides (updated 20 Feb and 25 Apr; errata)
Reading: R&H review chapter (sec 2.2)
Monday 5 February 2024. HMM acoustic modelling 3: Context-dependent phone modelling
Slides (updated 20 Feb; errata)
Reading: J&M: Sec 10.3; R&H review chapter (sec 2.3); Young (2008).
Thursday 8 February 2024. Large vocabulary ASR
Slides
Background reading: Ortmanns & Ney, Young (sec 27.2.4)
Monday 12 February 2024. ASR with WFSTs
Slides (updated 3 Mar; errata)
Reading: Mohri et al (2008), Speech recognition with weighted finite-state transducers, in Springer Handbook of Speech Processing (sections 1 and 2)
Thursday 15 February 2024. Neural network acoustic models 1: Introduction
Slides (updated 25 Apr; errata)
Reading: Jurafsky and Martin (draft 3rd edition), chapter 7 (secs 7.1 - 7.4)
Background Reading: M Nielsen (2014), Neural networks and deep learning - chapter 1 (introduction), chapter 2 (back-propagation algorithm), chapter 3 (the parts on cross-entropy and softmax).

Monday 19 - Friday 22 February 2024.
NO LECTURES OR LABS - FLEXIBLE LEARNING WEEK.
Monday 26 February 2024. Neural network acoustic models 2: Hybrid HMM/DNN systems
Slides (updated 25 Apr; errata)
Background Reading: Morgan and Bourlard (May 1995). Continuous speech recognition: Introduction to the hybrid HMM/connectionist approach, IEEE Signal Processing Mag., 12(3):24-42
Mohamed et al (2012). Understanding how deep belief networks perform acoustic modelling, ICASSP-2012.
Thursday 29 February 2024. Neural Networks for Acoustic Modelling 3: DNN architectures
Slides
Reading: Maas et al (2017), Building DNN acoustic models for large vocabulary speech recognition Computer Speech and Language, 41:195-213.
Background reading: Peddinti et al (2015). A time delay neural network architecture for efficient modeling of long temporal contexts, Interspeech-2015
Graves et al (2013), Hybrid speech recognition with deep bidirectional LSTM, ASRU-2013.
Monday 4 March 2024. Speaker Adaptation
Slides
Reading: G&Y review, sec. 5
Woodland (2001), Speaker adaptation for continuous density HMMs: A review, ISCA Workshop on Adaptation Methods for Speech Recognition
Bell et al (2021), Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview , IEEE Open Journal of Signal Processing, Vol 2:33-36.
Thursday 7 March 2024. Multilingual and low-resource speech recognition
Slides (updated 29 Apr; errata)
Background reading: Besaciera et al (2014), Automatic speech recognition for under-resourced languages: A survey, Speech Communication, 56:85--100.
Huang et al (2013). Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers, ICASSP-2013.
Monday 11 March 2024 Discriminative training
Slides
Reading: Sec 27.3.1 of Young (2008), HMMs and Related Speech Recognition Technologies.
Thursday 14 March 2024. End-to-end systems 1: CTC
Slides
Reading: A Hannun et al (2014), Deep Speech: Scaling up end-to-end speech recognition, ArXiV:1412.5567.
A Hannun (2017), Sequence Modeling with CTC, Distill.
Background Reading: Y Miao et al (2015), EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding, ASRU-2105.
A Maas et al (2015). Lexicon-free conversational speech recognition with neural networks, NAACL HLT 2015.
Monday 18 March 2024. End-to-end systems 2: Encoder-decoder models
Slides (added RNN-T forward recursion slides 25 Apr)
Reading: W Chan et al (2015), Listen, attend and spell: A neural network for large vocabulary conversational speech recognitionICASSP.
R Prabhavalkar et al (2017), A Comparison of Sequence-to-Sequence Models for Speech Recognition, Interspeech.
Background Reading: C-C Chiu et al (2018), State-of-the-art sequence recognition with sequence-to-sequence models, ICASSP.
S Watanabe et al (2017), Hybrid CTC/Attention Architecture for End-to-End Speech Recognition, IEEE STSP, 11:1240--1252.
Thursday 21 March 2024. Guest lecture: self-supervised learning for speech
Slides
Background Reading: A van den Ooord et al (2018), Representation learning with contrastive predictive coding
S Schneider et al (2019), wav2vec: Unsupervised pre-training for speech recognition, Interspeech.
Dates to be confirmed. Revision tutorials will be scheduled to take place 1-2 weeks prior to the exam.

Reading

All listed reading is optional and will not be examinable. Works listed as Reading may be useful to improve your understand of the lecture content; Background reading is for interest only.

Textbook

J&M: Daniel Jurafsky and James H. Martin (2008). Speech and Language Processing, Pearson Education (2nd edition).
You can also look at the draft 3rd edition online – we take a much broader view of ASR than coverd in this edition, but material in Appendix A and Chapter 16 is useful.

Review and Tutorial Articles

G&Y: MJF Gales and SJ Young (2007). The Application of Hidden Markov Models in Speech Recognition, Foundations and Trends in Signal Processing, 1 (3), 195-304.
S Young (1996). A review of large-vocabulary continuous-speech recognition, IEEE Signal Processing Magazine 13 (5), 45-57.
R&H:S Renals and T Hain (2010). Speech Recognition, in Computational Linguistics and Natural Language Processing Handbook, A Clark, C Fox and S Lappin (eds.), Blackwells, chapter 12, 299-332.
G Hinton et al (2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Processing Magazine, 29(6):82-97.
S Young (2008). HMMs and Related Speech Recognition Technologies, in Springer Handbook of Speech Processing, J Benesty, MM Sondhi and Y Huang (eds), chapter 27, 539-557.

Other supplementary materials

In case you need more introductory articles on speech signal analysis (Lectures 2 and 3):
Daniel P.W. Ellis, "An introduction to signal processing for speech", Chapter 22 in The Handbook of Phonetic Science, 2nd ed., ed. Hardcastle, Laver, and Gibbon. pp. 757-780, Blackwell, 2008.
Speech.zone by Prof Simon King at the University of Edinburgh.

Copyright (c) University of Edinburgh 2015-2024
The ASR course material is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International Public License
licence.txt
This page maintained by Peter Bell.
Last updated: 2024/04/29 21:41:43UTC

Home : Teaching : Courses : Asr