FNLP: Lab Session 1

Corpora and Language Models

Aim

The aims of this lab session are to 1) explore the different uses of language in different documents, authored by different people and 2) introduce the construction of language models using Python’s Natural Language Tool Kit (NLTK). Successful completion of this lab is important as the first assignment for FNLP builds on some of the concepts and methods that are introduced here. By the end of this lab session, you should be able to:

  • Access the corpora provided in NLTK
  • Compute a frequency distribution
  • Train a language model
  • Use a language model to compute bigram probabilities

Running NLTK, Jupyter, and Python Help

Running Jupyter and NLTK

This year our recommended method for running labs is through Jupyter Notebooks.

To do this you must first install jupyter. On DICE simply open a terminal window and type pip install --user jupyter.

To run the notebook locally, download it from the website and cd into the directory you saved it in. Then run jupyter notebook. This will should open a browser window showing the contents of your working directory. Click on lab1.ipynb.

Now that you are here you can run the code in any of the cells. The simplest way to do this is by hitting either ctrl+enter or shift+enter (the former will run the current cell while the latter will run the cell and move the focus to the next cell)

Try it out by importing NLTK:

In [1]:
import nltk

2.2 Python Help

Python contains an inbuilt help module that runs in an interactive mode. To run the interactive help, type:

In [ ]:
help()

help() will run until interrupted. If a cell is running it will block any other cell from running until it has completed. You can check if a cell is still running by looking at In [*]: to the left of any cell. If there is a * inside the brackets the cell is still running. As soon as the cell has stopped running the * will be replaced by a number.

Before moving on you will need to interupt help() (make it stop running). To interupt running cells go to kernel/interrupt at the top of the webpage. You can also hit the big black square button right underneath (if you hover over it it will say interrupt kernel). This is equivalent to hitting CTRL-d to interrupt a running program in the terminal or the python shell.

If you know the name of the module that you want to get help on, type: import <module_name> help(<module_name>) try looking at the help documentation for nltk

In [ ]:
help(nltk)

If you know the name of the module and the method that you want to get help on, type help(<module_name>.<method_name>) (note you must have imported <module_name>

Introduction

The FNLP lab sessions will make use of the Natural Language Tool Kit (NLTK) for Python. NLTK is a platform for writing programs to process human language data, that provides both corpora and modules. For more information on NLTK, please visit: http://www.nltk.org/.

For each exercise, edit the corresponding function in the notebook (e.g. ex1 for Exercise 1), then run the lines which prepare for and invoke that function.

If you’re unfamiliar with developing python code, you may want to look at the second lab for ANLP, which assumes much less background experience and has a detailed step-by-step guide to using python for the first time:

http://www.inf.ed.ac.uk/teaching/courses/anlp/labs/lab2.html

Accessing Corpora

NLTK provides many corpora and covers many genres of text. Some of the corpora are listed below:

  • Gutenberg: out of copyright books
  • Brown: a general corpus of texts including novels, short stories and news articles
  • Inaugural: U.S. Presidential inaugural speeches

To see a complete list of available corpora you can run:

In [5]:
import os
print os.listdir(nltk.data.find("corpora"))
[u'abc.zip', u'abc', u'alpino.zip', u'alpino', u'biocreative_ppi.zip', u'biocreative_ppi', u'brown.zip', u'brown', u'brown_tei.zip', u'brown_tei', u'cess_cat.zip', u'cess_cat', u'cess_esp.zip', u'cess_esp', u'dolch.zip', u'dolch', u'conll2007.zip']

Each corpus contains a number of texts. We’ll work with the inaugural corpus, and explore what the corpus contains. Make sure you have imported the nltk module first and then load the inaugural corpus by typing the following:

In [6]:
from nltk.corpus import inaugural

To list all of the documents in the inaugural corpus, run:

In [ ]:
inaugural.fileids()

From this point on we’ll work with President Barack Obama’s inaugural speech from 2009 (2009-Obama.txt). The contents of each document (in a corpus) may be accessed via a number of corpus readers. The plaintext corpus reader provides methods to view the raw text (raw), a list of words (words) or a list of sentences:

In [ ]:
print inaugural.raw('2009-Obama.txt')
In [ ]:
print inaugural.words('2009-Obama.txt')
In [ ]:
print inaugural.sents('2009-Obama.txt')

Exercise 1

  • Find the total number of words (tokens) in Obama’s 2009 speech

  • Find the total number of distinct words (word types) in the same speech

In [9]:
def ex1(doc_name):
    # Use the plaintext corpus reader to access a pre-tokenised list of words
    # for the document specified in "doc_name"
    doc_words = inaugural.words(doc_name)

    # Find the total number of words in the speech
    total_words = 

    # Find the total number of DISTINCT words in the speech
    total_distinct_words = 

    # Return the word counts
    return (total_words, total_distinct_words)

To test your solution:

In [ ]:
speech_name = '2009-Obama.txt'
(tokens,types) = ex1(speech_name)
print "Total words in %s: %s"%(speech_name,tokens)
print "Total distinct words in %s: %s"%(speech_name,types)

Exercise 2

Find the average word-type length of Obama’s 2009 speech

In [11]:
def ex2(doc_name):
    doc_words = inaugural.words(doc_name)

    # Construct a list that contains the word lengths for each DISTINCT word in the document
    distinct_word_lengths = 

    # Find the average word type length
    avg_word_length = 

    # Return the average word type length of the document
    return avg_word_length

To test your solution:

In [ ]:
speech_name = '2009-Obama.txt'
result2 = ex2(speech_name)
print "Average word type length for %s: %.3f"%(speech_name,result2)

Frequency Distribution

A frequency distribution records the number of times each outcome of an ex- periment has occurred. For example, a frequency distribution could be used to record the number of times each word appears in a document:

In [ ]:
# Obtain the words from Barack Obama’s 2009 speech
obama_words = inaugural.words('2009-Obama.txt')
# Construct a frequency distribution over the lowercased words in the document
fd_obama_words = nltk.FreqDist(w.lower() for w in obama_words)
# Find the top 50 most frequently used words in the speech
fd_obama_words.most_common(50)

You can easily plot the top 50 words (note %matplotlib inline tells jupyter that it should embed plots in the output cell after you run the code. You only need to run it once per notebook, not in every cell with a plot.

In [ ]:
%matplotlib inline
fd_obama_words.plot(50)

Find out how many times the words peace and america were used in the speech:

In [ ]:
print 'peace:', fd_obama_words['peace']
print 'america:', fd_obama_words['america']

Excercise 3

Compare the top 50 most frequent words in Barack Obama’s 2009 speech with George Washington’s 1789 speech. What can knowing word frequencies tell us about different speeches at different times in history?

In [10]:
def ex3(doc_name, x):
    doc_words = inaugural.words(doc_name)
    
    # Construct a frequency distribution over the lowercased words in the document
    fd_doc_words = 

    # Find the top x most frequently used words in the document
    top_words = 

    # Return the top x most frequently used words
    return top_words
In [ ]:
### Now test your code
print "Top 50 words for Obama's 2009 speech:"
result3a = ex3('2009-Obama.txt', 50)
print result3a
print "Top 50 words for Washington's 1789 speech:"
result3b = ex3('1789-Washington.txt', 50)
print result3b

Language Models

A statistical language model assigns a probability to a sequence of words, using a probability distribution. Language models have many applications in Natural Language Processing. For example, in speech recognition they may be used to predict the next word that a speaker will utter. In machine translation a language model may be used to score multiple candidate translations of an input sentence in order to find the most fluent/natural translation from the set of candidates.

Building a Language Model

We provide you with an NgramModel module taken from an old version of NLTK. The initialisation method looks like this:

def __init__(self, n, train, pad_left=False, pad_right=False,
estimator=None, *estimator_args, **estimator_kwargs):

Where:

  • n = order of the language model. 1=unigram; 2=bigram; 3=trigram, etc.
  • train = the training data (supplied as a list)
  • pad left and pad right = sentence initial and sentence final padding
  • estimator = method used to construct the probability distribution. May
  • or may not include smoothing. Arguments to the estimator are optional.

Excercise 4

Use NgramModel to build a language model based on the text of Sense and Sensibility by Jane Austen. The language model should be a bigram model, and you can let it use the default nltk.MLEProbDist estimator.

Hint, fill in the gaps with the information already provided in the code / comments.

In [ ]:
# Import NLTK's NgramModel module (for building language models)
# It has been removed from standard NLTK, so we access it from a shared space
import sys
sys.path.extend(['/group/ltg/projects/fnlp', '/group/ltg/projects/fnlp/packages_2.6'])
from nltkx import NgramModel
from nltk.corpus import gutenberg
In [ ]:
# Input: doc_name (string), n (int)
# Output: lm (NgramModel language model)
def ex4(doc_name, n):
    # Construct a list of lowercase words from the document
    words = [w.lower() for w in gutenberg.words(doc_name)]

    # Build the language model using the nltk.MLEProbDist estimator 
    lm = NgramModel(<order> , <training_data>)
    
    # Return the language model (we'll use it in exercise 5)
    return lm
In [ ]:
### Test your code for excercise 4
result4 = ex4('austen-sense.txt',2)
print "Sense and Sensibility bigram language model built"

Computing Probabilities

Using the language model, we can work out the probability of a word given its context. In the case of the bigram language model built in Exercise 4, we have only one word of context. To obtain probabilities from a language model, use NgramModel.prob:

lm.prob(word, [context])

Where word and context are both unigram strings when working with a bigram language model. For higher order language models, context will be a list of unigram strings of length order-1.

Excercise 5

Using the bigram language model built in Exercise 4, compute the following probabilities

  1. reason followed by for
  2. the followed by end
  3. end followed by the

Now uncomment the test code and check your results The result for c above is perhaps not what you expected. Why do you think it happened?

In [ ]:
# Input: lm (NgramModel language model, from exercise 4), word (string), context (list)
# Output: p (float)
def ex5(lm,word,context,verbose=False):
    
    # Compute the probability for the word given the context
    p = 

    # Return the probability
    return p
In [ ]:
### Test your code for Excercise 5
result5a = ex5(result4,'for',['reason'])
print "Probability of \'reason\' followed by \'for\': %.3f"%result5a
result5b = ex5(result4,'end',['the'])
print "Probability of \'the\' followed by \'end\': %.5f"%result5b
result5c = ex5(result4,'the',['end'])
print "Probability of \'end\' followed by \'the\': %.1f"%result5c

Excercise 6

Update your definition of the ex5 function to include a (boolean) verbose argument, which is passed through to NgramModel.prob. Use this to see if it gives any insight on the (end, the) bigram.

In [ ]:
### Test your code for Excercise 6
result6 = ex5(result4,'the',['end'],True)

7 Going Further

7.1 Smooting

Try using an estimator which does do smoothing, and see what happens to all three of the bigram probabilities. Try help(NgramModel) for help with the operation of this class and how to supply estimators.

In [ ]:
# to see what estimators are available run:
from nltk import probability
help(probability)

7.2 Tokenisation Padding

So far you’ve treated the data as a flat list of ‘words’, which doesn’t fully address the place of words within sentences. Using gutenberg.sents(...) explore the impact of the pad left and pad right argument to NgramModel. Compare the following:

In [ ]:
print lm.prob('The',['<s>'])
print lm.prob('the',['<s>'])
print lm.prob('End',['<s/>'])
print lm.prob('end',['<s/>'])
print lm.prob('.',['<s/>'])

7.3 Costs vs. probabilities

Redo the previous two sub-sections using costs instead of probabilities.

In [ ]:
help(lm.logprob)