Q: My script which was supposed to run for many hours stopped running
properly (or it gave me a lot of error messages such "... cannot
access file...") after it ran properly for a while. Why?

A: It could be due to the expiration of your file access authentication
 (i.e. Kerberos ticket for the AFS file system).
 For details, see:
    http://computing.help.inf.ed.ac.uk/afs-top-ten-tips#Tip07
 in which you will find there is a wrapper script "longjob" in DICE
 to avoid the problem.

Q: Why do we need to start with HCompV? (in the Speech processing
  course we started the process which HInit), what is the advantage of
  starting with this one? 

A: Recall the HMM training algorithm (i.e. EM/Baum-Welch algorithm) does not
   guarantee global optimum, meaning initial values do matter. See HTK 
   manual for HCompV, and check how it is used in script prep_monophone,
   and the output files. It would be interesting to investigate how
   much effect it has on recognition performance.

Q: What is the difference between HRest and HERest? In the HTK
   documentation says that HERest performs an "embedded training
   version of Baum-Welch" and that uses a "composite model". I
   understand the general procedure of BW but I don't understand what
   they mean by "embedded" and "composite model". 

A: HRest train each HMM model separately according to the phone labels,
   whereas HERest train a set of models according to the utterances 
   (phone or word sequences), in which a very long HMM is virtually
   constructed by concatenating HMMs so that the composite HMM
   corresponds to the whole utterances. (This should have been
   mentioned in the lecture slides.)
   For details, see the HTK manual (Sections 8.5 and 17.7) and
   Jurafsky's Speech and Language Processing (Section 9.7). 

Q: Why do we train the sp (short pause) model in the middle of the
   whole training phase and not with the rest of the monophone models?
   Is it actually replacing the "sil" model? 

A: "sp" is not given in the original hand-labelled labels, but we'd
   like to have sp model to handle possible short pauses that may appear
   (e.g between words) in real speech. See HTK Manual section 3.2.2,
   and see the output model, i.e. hmm5/MODEL, after you ran mk_sp_model.

Q: Why do we need to re-align the labels with the sentences at the
  middle of the training using HVite? I understand that HInit also
  uses the Viterbi algorithm to perform alignment, so what is the
  difference between the two? is it a valid experiment to not perform
  this step and see which effects has on the recognition? 

A: See HTK Manual section 3.2.3. Yes, it would be interesting to look
   into this.

Q: Why do we use three different dictionaries, one at training, one at
  HVite recognition and one at HResults? 

A: The dictionaries for HVite and the one for HResults can be basically
   the same. We just need word information for HResults. I don't
   remember if the dictionary for HResults causes any errors if it
   contains additional phone information.
   You could merge dictionaries for training and recognition.

Q: Is it possible to see the language model in a more meaningful way?
  (to see what kind of word combinations are being allowed,
  e.g. fully-all-possible word combinations). 

A: There is "2gram.arpa" in the language model directory, which is the
   original bigram for 2gram.net. A bit more information can be found
   in Org/language_modeles.