References for ASR

Textbook

Daniel Jurafsky and James H. Martin (2008). Speech and Language Processing, Pearson Education (2nd edition). (Errata) [chapters 6, 9, 10]

Wikipedia

Wikipedia coverage of most ASR topics is very poor. However the following entries on same basic pattern recognition and density estimation topics are OK:

Review and Tutorial Articles

S Renals and T Hain (2010). Speech Recognition, to appear in Computational Linguistics and Natural Language Processing Handbook, A Clark, C Fox and S Lappin (eds.), Blackwells.
MJF Gales and SJ Young (2007). The Application of Hidden Markov Models in Speech Recognition, Foundations and Trends in Signal Processing, 1 (3), 195-304.
S Young (1996). A review of large-vocabulary continuous-speech recognition, IEEE Signal Processing Magazine 13 (5), 45--57.
J-L Gauvain and L Lamel (2000). Large-vocabulary continuous speech recognition: advances and applications, Proceedings of the IEEE, 88 (8), 1181-1200.
PC Woodland (2002). The development of the HTK Broadcast News transcription system: An overview, Speech Communication, 37(1--2), 47-67.
S Young (2008). HMMs and Related Speech Recognition Technologies, in Springer Handbook of Speech Processing, J Benesty, MM Sondhi and Y Huang (eds), chapter 27, 539--557.

HMM

L Rabiner and B Juang (1986), An introduction to hidden Markov models IEEE ASSP Magazine, 3 (1), 4--16.
JR Bellegarda and D Nahamoo (1990). Tied mixture continuous parameter modeling for speech recognition, IEEE Trans ASSP, 38 (12), 2033-2045.
XD Huang (1992). Phoneme classification using semicontinuous hidden Markov models , IEEE Trans Signal Processing, 40 (5), 1062-1067.
SJ Young and PC Woodland (1994). State clustering in hidden Markov model-based continuous speech recognition, Computer Speech and Language ,4, 369-383.

Context-dependent phone models

R. Schwartz, Y. Chow, O. Kimball, S. Roucos, M. Krasner and J. Makhoul (1985). Context-dependent modeling for acoustic-phonetic recognition of continuous speech, Proc IEEE ICASSP-85, 1205-1208.
K-F Lee (1990). Context-dependent phonetic hidden Markov models for speaker-independent continuous speech recognition, IEEE Trans on ASSP, 38(4), 599-609.
LR Bahl, PV de Souza, PS Gopalakrishnan, D Nahamoo and MA Picheny (1991). Context Dependent Modeling of Phones in Continuous Speech Using Decision Trees, Proc DARPA Speech and Natural Language Processing Workshop, 264-270.
S Young, J Odell and P Woodland (1994). Tree-based state tying for high accuracy acoustic modelling, Proc HLT Workshop, 307-312.

Pronunciation modelling

E. Fosler-Lussier (2003). A tutorial on pronunciation modeling for large vocabulary speech recognition, in S. Renals and G. Grefenstette (eds) Text and Speech Triggered Information Access, LNAI 2705, Springer.
T. Hain (2002). Implicit pronunciation modelling in ASR, in Proc. ISCA Workshop on Pronunciation Modeling and Lexicon Adaptation (PMLA--2002), pp. 129-134.

Language modelling

C. Manning and H. Schutze (1999). Foundations of Statistical Language Processing, MIT Press. (Chapter 6)
Y. Gotoh and S. Renals (2003). Language modelling, in S. Renals and G. Grefenstette (eds) Text and Speech Triggered Information Access, LNAI 2705, Springer.
F. Jelinek (1991). Up from trigrams! The struggle for improved language models, Proc. Eurospeech.
C. Chelba and F. Jelinek (2000). Structured language modeling, Computer Speech and Language, 14:283-332.
J. Goodman (2001). A bit of progress in language modeling, Computer Speech and Language, 15:403-434.

Search

X Aubert (2002), An overview of decoding techniques for large vocabulary continuous speech recognition, Computer Speech and Language , 16:89-114.
M Mohri, F Pereira and M Riley (2002). Weighted finite-state transducers in speech recognition, Computer Speech and Language , 16:69-88.
D Moore et al (2006). Juicer: A WFST speech decoder, in Proc MLMI-06, Springer LNCS 4299, 285-296.

Speaker Adaptation

P Woodland (2001). Speaker adaptation for continuous density HMMs: A review, Proceedings of the ISCA workshop on adaptation methods for speech recognition, 11-19.
M Gales and P Woodland (1996). Mean and variance adaptation within the MLLR framework, Computer Speech and Language, 10:249-264.
M Gales (1998). Maximum likelihood linear transformations for HMM-based speech recognition , Computer Speech and Language, 12:75-98.
M Gales (2000). Cluster adaptive training of hidden Markov models, IEEE Trans Speech and Audio Processing, 8:417-428.
R Kuhn, JC Junqua, P Nguyen and N Niedzielski (2000). Rapid speaker adaptation in eigenvoice space, IEEE Trans Speech and Audio Processing, 8:695-707
G Garau, S Renals and T Hain (2005). Applying vocal tract length normalization to meeting recordings, Proc Interspeech'05

Large Vocabulary Systems

P Woodland (2002). The development of the HTK Broadcast News transcription system: An overview, Speech Communication, 37(1-2):47-67.
SF Chen et al (2006). Advances in speech transcription at IBM under the DARPA EARS program, IEEE Trans Speech and Audio Processing, 14(5):1596-1608.
S. Renals, T. Hain and H. Bourlard (2007). Recognition and understanding of meetings: The AMI and AMIDA projects, Proc. IEEE ASRU.
T. Hain et al (2005). The development of the AMI system for the transcription of speech in meetings. Proc. MLMI '05

Discriminative Training

A Nadas (1983). A decision theoretic formulation of a training problem in speech recognition and a comparison of training by unconditional versus conditional maximum likelihood, Proc IEEE Trans ASSP, 31(4):814-817.
L Bahl, P Brown, P de Souza and R Mercer (1986). Maximum mutual information estimation of hidden Markov model parameters for speech recognition, Proc IEEE ICASSP '86
Y Normandin and SD Morgera (1992). An improved MMIE training algorithm for speaker-independent, small vocabulary, continuous speech recognition, Proc IEEE ICASSP '92
PC Woodland and D Povey (2002). Large scale discriminative training of hidden Markov models for speech recognition, Computer Speech and Language, 16(1):25-47.
D. Povey (2003). Discriminative Training for Large Vocabulary Speech Recognition, PhD thesis, University of Cambridge.

Robustness

J Droppo and A Acero (2008). Environmental Robustness, in Springer Handbook of Speech Processing, J Benesty, MM Sondhi and Y Huang (eds), chapter 33, 653--680.

(Deep) neural networks

N Morgan and H Bourlard (May 1995). Continuous speech recognition: An introduction to the hybrid HMM/connectionist approach, IEEE Signal Processing Magazine, 12(3), 24-42.
N Morgan et al (Sep 2005). Pushing the envelope - aside, IEEE Signal Processing Magazine, 22(5), 81-88. %
F Grezl and P Fousek (2008). Optimizing bottleneck features for LVCSR, Proc ICASSP-2008.
G Hinton et al (Nov 2012). Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups, IEEE Signal Processing Magazine, 29(6), 82--97.

Transcribing TED data

M. Federico et al (2012). Overview of the IWSLT 2012 evaluation campaign, in Proc. IWSLT-2012.
H. Yamamoto et al (2012). The NICT ASR system for the IWSLT2012, in Proc. IWSLT-2012.
E. Hasler et al (2012). The UEDIN systems for the IWSLT 2012 evaluation, in Proc. IWSLT-2012.
P. Bell et al (2013). Multi-level adaptive networks in tandem and hybrid ASR systems, in Proc. ICASSP-2013.
T. Mikolov et al (2010). Recurrent neural network based language model, in Proc. Interspeech-2010.

Introductory texts (now getting rather old)

Comp.Speech FAQ
Speech Analysis
"Spoken Language Input" , which is the first chapter of Survey of the State of the Art in Human Language Technology

Language Models

Statistical Methods in Computational Linguistics by Mark Gawron @ San Diego State Univ.
SRILM Manual Pages
Good-Turing Smoothing Without Tears by William A. Gale @ ATT Bell Lab, 1994.
A Survey of Smoothing Techniques for ME Models by Stanley F. Chen, Ronald Rosenfeld, IEEE Trans SAP, Vol.8, No.1, January 2000.
A study of smoothing methods for language models applied to information retrieval by Chengxiang Zhai and John Lafferty @ CMU, ACM Trans on Information Systems, Vol. 22, Issue 2, pp.179-214, April 2004.

References for ASR

Textbook

Wikipedia

Review and Tutorial Articles

HMM

Context-dependent phone models

Pronunciation modelling

Language modelling

Search

Speaker Adaptation

Large Vocabulary Systems

Discriminative Training

Robustness

(Deep) neural networks

Transcribing TED data

Introductory texts (now getting rather old)

Language Models

Decoding / Search

Speaker Recognition

Audio Signal Processing

History

Other topics