The aims of this lab session are to 1) explore the different uses of language in different documents, authored by different people and 2) introduce the construction of language models using Python’s Natural Language Tool Kit (NLTK). Successful completion of this lab is important as the first assignment for FNLP builds on some of the concepts and methods that are introduced here. By the end of this lab session, you should be able to:
This year our recommended method for running labs is through Jupyter Notebooks.
To do this you must first install jupyter. On DICE simply open a terminal window and type pip install --user jupyter
.
To run the notebook locally, download it from the website and cd
into the directory you saved it in. Then run jupyter notebook
. This will should open a browser window showing the contents of your working directory. Click on lab1.ipynb.
Now that you are here you can run the code in any of the cells. The simplest way to do this is by hitting either ctrl+enter
or shift+enter
(the former will run the current cell while the latter will run the cell and move the focus to the next cell)
Try it out by importing NLTK:
import nltk
Python contains an inbuilt help module that runs in an interactive mode. To run the interactive help, type:
help()
help()
will run until interrupted. If a cell is running it will block any other cell from running until it has completed. You can check if a cell is still running by looking at In [*]:
to the left of any cell. If there is a *
inside the brackets the cell is still running. As soon as the cell has stopped running the *
will be replaced by a number.
Before moving on you will need to interupt help()
(make it stop running). To interupt running cells go to kernel/interrupt
at the top of the webpage. You can also hit the big black square button right underneath (if you hover over it it will say interrupt kernel). This is equivalent to hitting CTRL-d to interrupt a running program in the terminal or the python shell.
If you know the name of the module that you want to get help on, type:
import <module_name>
help(<module_name>)
try looking at the help documentation for nltk
help(nltk)
If you know the name of the module and the method that you want to get help
on, type help(<module_name>.<method_name>)
(note you must have imported <module_name>
The FNLP lab sessions will make use of the Natural Language Tool Kit (NLTK) for Python. NLTK is a platform for writing programs to process human language data, that provides both corpora and modules. For more information on NLTK, please visit: http://www.nltk.org/.
For each exercise, edit the corresponding function in the notebook (e.g. ex1 for Exercise 1), then run the lines which prepare for and invoke that function.
If you’re unfamiliar with developing python code, you may want to look at the second lab for ANLP, which assumes much less background experience and has a detailed step-by-step guide to using python for the first time:
http://www.inf.ed.ac.uk/teaching/courses/anlp/labs/lab2.html
NLTK provides many corpora and covers many genres of text. Some of the corpora are listed below:
To see a complete list of available corpora you can run:
import os
print os.listdir(nltk.data.find("corpora"))
Each corpus contains a number of texts. We’ll work with the inaugural corpus, and explore what the corpus contains. Make sure you have imported the nltk module first and then load the inaugural corpus by typing the following:
from nltk.corpus import inaugural
To list all of the documents in the inaugural corpus, run:
inaugural.fileids()
From this point on we’ll work with President Barack Obama’s inaugural speech from 2009 (2009-Obama.txt). The contents of each document (in a corpus) may be accessed via a number of corpus readers. The plaintext corpus reader provides methods to view the raw text (raw), a list of words (words) or a list of sentences:
print inaugural.raw('2009-Obama.txt')
print inaugural.words('2009-Obama.txt')
print inaugural.sents('2009-Obama.txt')
Find the total number of words (tokens) in Obama’s 2009 speech
Find the total number of distinct words (word types) in the same speech
def ex1(doc_name):
# Use the plaintext corpus reader to access a pre-tokenised list of words
# for the document specified in "doc_name"
doc_words = inaugural.words(doc_name)
# Find the total number of words in the speech
total_words =
# Find the total number of DISTINCT words in the speech
total_distinct_words =
# Return the word counts
return (total_words, total_distinct_words)
To test your solution:
speech_name = '2009-Obama.txt'
(tokens,types) = ex1(speech_name)
print "Total words in %s: %s"%(speech_name,tokens)
print "Total distinct words in %s: %s"%(speech_name,types)
Find the average word-type length of Obama’s 2009 speech
def ex2(doc_name):
doc_words = inaugural.words(doc_name)
# Construct a list that contains the word lengths for each DISTINCT word in the document
distinct_word_lengths =
# Find the average word type length
avg_word_length =
# Return the average word type length of the document
return avg_word_length
To test your solution:
speech_name = '2009-Obama.txt'
result2 = ex2(speech_name)
print "Average word type length for %s: %.3f"%(speech_name,result2)
A frequency distribution records the number of times each outcome of an ex- periment has occurred. For example, a frequency distribution could be used to record the number of times each word appears in a document:
# Obtain the words from Barack Obama’s 2009 speech
obama_words = inaugural.words('2009-Obama.txt')
# Construct a frequency distribution over the lowercased words in the document
fd_obama_words = nltk.FreqDist(w.lower() for w in obama_words)
# Find the top 50 most frequently used words in the speech
fd_obama_words.most_common(50)
You can easily plot the top 50 words (note %matplotlib inline
tells jupyter that it should embed plots in the output cell after you run the code. You only need to run it once per notebook, not in every cell with a plot.
%matplotlib inline
fd_obama_words.plot(50)
Find out how many times the words peace and america were used in the speech:
print 'peace:', fd_obama_words['peace']
print 'america:', fd_obama_words['america']
Compare the top 50 most frequent words in Barack Obama’s 2009 speech with George Washington’s 1789 speech. What can knowing word frequencies tell us about different speeches at different times in history?
def ex3(doc_name, x):
doc_words = inaugural.words(doc_name)
# Construct a frequency distribution over the lowercased words in the document
fd_doc_words =
# Find the top x most frequently used words in the document
top_words =
# Return the top x most frequently used words
return top_words
### Now test your code
print "Top 50 words for Obama's 2009 speech:"
result3a = ex3('2009-Obama.txt', 50)
print result3a
print "Top 50 words for Washington's 1789 speech:"
result3b = ex3('1789-Washington.txt', 50)
print result3b
A statistical language model assigns a probability to a sequence of words, using a probability distribution. Language models have many applications in Natural Language Processing. For example, in speech recognition they may be used to predict the next word that a speaker will utter. In machine translation a language model may be used to score multiple candidate translations of an input sentence in order to find the most fluent/natural translation from the set of candidates.
We provide you with an NgramModel module taken from an old version of NLTK. The initialisation method looks like this:
def __init__(self, n, train, pad_left=False, pad_right=False,
estimator=None, *estimator_args, **estimator_kwargs):
Where:
Use NgramModel
to build a language model based on the text of Sense and
Sensibility by Jane Austen. The language model should be a bigram model, and
you can let it use the default nltk.MLEProbDist
estimator.
Hint, fill in the gaps with the information already provided in the code / comments.
# Import NLTK's NgramModel module (for building language models)
# It has been removed from standard NLTK, so we access it from a shared space
import sys
sys.path.extend(['/group/ltg/projects/fnlp', '/group/ltg/projects/fnlp/packages_2.6'])
from nltkx import NgramModel
from nltk.corpus import gutenberg
# Input: doc_name (string), n (int)
# Output: lm (NgramModel language model)
def ex4(doc_name, n):
# Construct a list of lowercase words from the document
words = [w.lower() for w in gutenberg.words(doc_name)]
# Build the language model using the nltk.MLEProbDist estimator
lm = NgramModel(<order> , <training_data>)
# Return the language model (we'll use it in exercise 5)
return lm
### Test your code for excercise 4
result4 = ex4('austen-sense.txt',2)
print "Sense and Sensibility bigram language model built"
Using the language model, we can work out the probability of a word given its context. In the case of the bigram language model built in Exercise 4, we have only one word of context. To obtain probabilities from a language model, use NgramModel.prob:
lm.prob(word, [context])
Where word and context are both unigram strings when working with a bigram language model. For higher order language models, context will be a list of unigram strings of length order-1.
Using the bigram language model built in Exercise 4, compute the following probabilities
Now uncomment the test code and check your results The result for c above is perhaps not what you expected. Why do you think it happened?
# Input: lm (NgramModel language model, from exercise 4), word (string), context (list)
# Output: p (float)
def ex5(lm,word,context,verbose=False):
# Compute the probability for the word given the context
p =
# Return the probability
return p
### Test your code for Excercise 5
result5a = ex5(result4,'for',['reason'])
print "Probability of \'reason\' followed by \'for\': %.3f"%result5a
result5b = ex5(result4,'end',['the'])
print "Probability of \'the\' followed by \'end\': %.5f"%result5b
result5c = ex5(result4,'the',['end'])
print "Probability of \'end\' followed by \'the\': %.1f"%result5c
Update your definition of the ex5
function to include a (boolean) verbose
argument, which is passed through to NgramModel.prob
. Use this to see if
it gives any insight on the (end, the) bigram.
### Test your code for Excercise 6
result6 = ex5(result4,'the',['end'],True)
# to see what estimators are available run:
from nltk import probability
help(probability)
So far you’ve treated the data as a flat list of ‘words’, which doesn’t fully address
the place of words within sentences. Using gutenberg.sents(...)
explore the
impact of the pad left
and pad right
argument to NgramModel
. Compare the
following:
print lm.prob('The',['<s>'])
print lm.prob('the',['<s>'])
print lm.prob('End',['<s/>'])
print lm.prob('end',['<s/>'])
print lm.prob('.',['<s/>'])
Redo the previous two sub-sections using costs instead of probabilities.
help(lm.logprob)