Q: My script which was supposed to run for many hours stopped running properly (or it gave me a lot of error messages such "... cannot access file...") after it ran properly for a while. Why? A: It could be due to the expiration of your file access authentication (i.e. Kerberos ticket for the AFS file system). For details, see: http://computing.help.inf.ed.ac.uk/afs-top-ten-tips#Tip07 in which you will find there is a wrapper script "longjob" in DICE to avoid the problem. Q: Why do we need to start with HCompV? (in the Speech processing course we started the process which HInit), what is the advantage of starting with this one? A: Recall the HMM training algorithm (i.e. EM/Baum-Welch algorithm) does not guarantee global optimum, meaning initial values do matter. See HTK manual for HCompV, and check how it is used in script prep_monophone, and the output files. It would be interesting to investigate how much effect it has on recognition performance. Q: What is the difference between HRest and HERest? In the HTK documentation says that HERest performs an "embedded training version of Baum-Welch" and that uses a "composite model". I understand the general procedure of BW but I don't understand what they mean by "embedded" and "composite model". A: HRest train each HMM model separately according to the phone labels, whereas HERest train a set of models according to the utterances (phone or word sequences), in which a very long HMM is virtually constructed by concatenating HMMs so that the composite HMM corresponds to the whole utterances. (This should have been mentioned in the lecture slides.) For details, see the HTK manual (Sections 8.5 and 17.7) and Jurafsky's Speech and Language Processing (Section 9.7). Q: Why do we train the sp (short pause) model in the middle of the whole training phase and not with the rest of the monophone models? Is it actually replacing the "sil" model? A: "sp" is not given in the original hand-labelled labels, but we'd like to have sp model to handle possible short pauses that may appear (e.g between words) in real speech. See HTK Manual section 3.2.2, and see the output model, i.e. hmm5/MODEL, after you ran mk_sp_model. Q: Why do we need to re-align the labels with the sentences at the middle of the training using HVite? I understand that HInit also uses the Viterbi algorithm to perform alignment, so what is the difference between the two? is it a valid experiment to not perform this step and see which effects has on the recognition? A: See HTK Manual section 3.2.3. Yes, it would be interesting to look into this. Q: Why do we use three different dictionaries, one at training, one at HVite recognition and one at HResults? A: The dictionaries for HVite and the one for HResults can be basically the same. We just need word information for HResults. I don't remember if the dictionary for HResults causes any errors if it contains additional phone information. You could merge dictionaries for training and recognition. Q: Is it possible to see the language model in a more meaningful way? (to see what kind of word combinations are being allowed, e.g. fully-all-possible word combinations). A: There is "2gram.arpa" in the language model directory, which is the original bigram for 2gram.net. A bit more information can be found in Org/language_modeles.