Answers and Explanations for Lab 2

Author: Sharon Goldwater
Date: 2014-09-01
Copyright: This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License: You may re-use, redistribute, or modify this work for non-commercial purposes provided you retain attribution to any previous author(s).

Examining and running the code

If you type this into the interpreter:


You will get an error message You must provide at least one filename argument. (This is followed by a further, completely generic, error message simply saying that an exception has occurred, i.e., the program has exited unexpectedly.) The error message is generated by a print statement in the main body of code (near the end of the file), which gets called if sys.argv is less than 2 (i.e., no arguments are provided).

fnames is a list which stores the names of the files that are to be processed.

word_counts is a dictionary (or to be precise, a defaultdict), whose keys are the words and values are their counts.

In the first two Ethan files, 'the' occurs 268 times, which you can see by typing word_counts['the'] in the interpreter.

Looking at the data

The Zipf plot from eth01.cha is shown below. The main problem that can be seen (if you know what to look for, and you must look at the log-log plot) is that there are too many words with frequency one, which shows up as a very long horizontal line at the bottom. It is maybe not completely obvious from this plot, but if you generate the plot from all the Ethan files together, it becomes much more obvious.

In this particular case, it probably won't be obvious how many unique word types there ought to be, so doing this won't help much initially, but it's useful for helping to verify later that you have fixed the problem.
These lines will give you the first 10 and last 10 words. The last 10 look fine, but the first 10 do not look like real words, instead they look like \x151000080_1006955\x15. This indicates the first problem -- we have included the timestamp IDs in the word counts, which we should not have done.
sorted(word_counts, key=word_counts.get)[:10]
sorted(word_counts, key=word_counts.get)[-10:]
The 10 lowest frequency words (first line above) are the same junk words we got by looking at the alphabetical sort, which makes sense because each of these only occurs once (the timestamps are unique). The 10 highest frequency words (second line above) include some actual high-frequency words like 'your' and 'the', but also some punctuation marks and the token *MOT (which is actually the highest frequency item). The latter is definitely a problem, since it isn't a word but instead identifies who is speaking. The punctuation marks may or may not be a problem depending on how we want to define a word and what we might be doing with the data.
See our answer code.

Fixing the problem

You can take a look at our very simple solution in the answer code. It simply strips off the first token (the speaker ID) and last two tokens (final punctuation mark and timestamp ID) before adding the tokens to the word_counts dictionary. The decision to strip off final punctuation is maybe a bit inconsistent because we still have internal punctuation (commas etc) in the counts; we could have chosen to leave the final punctuation on.

Other possible non-words include sound effects ('boom', 'boo', 'ah', etc) and alternate spellings ('(a)bout'), as well as punctuation.

When you run the code on all the files, it looks smoother because the statistics of large data sets become closer to 'average' -- the randomness of small sample sizes begins to smooth out.


These variables represent the running total number of word tokens and number of utterances in each file. To compute the MLU for a file, we need to divide ntoks by nutts. For each line (utterance) spoken by the child, ntoks is incremented by the the number of (whitespace-separated) words in the line (computed as len(tokens)) minus 3. We subtract 3 because we do not want to include the speaker ID, final punctuation, or timestamp ID in the count of words spoken by the child. (We would get a more accurate MLU by also throwing away other punctuation inside the sentence, but we are trying to keep things simple here and we'll still have a reasonable approximation.)

See answer code for our solution.

In general, you should see that the children's MLUs increase over time, starting at around 1-1.5 and going up to 3.5 or more in most cases. There are a lot of ups and downs in the plots because each of the files is a fairly small sample of language, so there will be quite a bit of random variation in the MLU.

The key additional piece of information we would need is the actual ages of the children. For example, if we look at the plots, it appears that Alex's language is hardly developing at all, whereas Naima's MLU is around 5 by the end of the data. However, from these plots alone we do not know whether the time range spanned by the Alex files is the same as that of the Naima files. We would want to know both whether the files start at the same ages for each child, and also whether the period of time between each file is the same.

Going Further

Answer code is available for Q2 and Q5. (Note our solution for Q2 generates an error if you run on the William data. Why? How would you fix it?)