# Import libraries used for this lab
import nltk
import sys
# Import the gutenberg corpus
from nltk.corpus import gutenberg
# Import NLTK's NgramModel module (for building language models)
# It has been removed from standard NLTK, so we access it in a special package installation
sys.path.extend(['/group/ltg/projects/fnlp', '/group/ltg/projects/fnlp/packages_2.6'])
from nltkx import NgramModel
# Import probability distributions
from nltk.probability import LaplaceProbDist
from nltk.probability import LidstoneProbDist
from nltk.probability import SimpleGoodTuringProbDist
The aims of this lab session are to explore 1) the Laplace, Lidstone and Good- Turing smoothing methods for language models and 2) the use of language models in authorship identification. Successful completion of this lab will help you solidify your understanding of smoothing (important not just for LMs but all over NLP), perplexity (important also for assignment 1), and one type of text classification (authorship identification). By the end of this lab session, you should be able to:
In some of the exercises, you’ll use NLTK’s NgramModel
to train language models.
The initialisation method for NgramModel is:
def __init__(self, n, train, pad_left=False, pad_right=False,
estimator=None, *estimator_args, **estimator_kwargs):
"""
Creates an ngram language model to capture patterns in n consecutive
words of training text. An estimator smooths the probabilities derived
from the text and may allow generation of ngrams not seen during
training.
:param n: the order of the language model (ngram size)
:type n: C{int}
:param train: the training text
:type train: C{iterable} of C{string} or C{iterable} of C{iterable} of C{string}
:param estimator: a function for generating a probability distribution---defaults to MLEProbDist
:type estimator: a function that takes a C{ConditionalFreqDist} and
returns a C{ConditionalProbDist}
:param pad_left: whether to pad the left of each sentence with an (n-1)-gram of <s>
:type pad_left: bool
:param pad_right: whether to pad the right of each sentence with </s>
:type pad_right: bool
:param estimator_args: Extra arguments for estimator.
These arguments are usually used to specify extra
properties for the probability distributions of individual
conditions, such as the number of bins they contain.
Note: For backward-compatibility, if no arguments are specified, the
number of bins in the underlying ConditionalFreqDist are passed to
the estimator as an argument.
:type estimator_args: (any)
:param estimator_kwargs: Extra keyword arguments for the estimator
:type estimator_kwargs: (any)
"""
In the final exercise of Lab 1, you were asked to calculate the probability of a word given its context, using an bigram language model with no smoothing. For the first two word-context pairs, these bigrams had been seen in the data used to train the language model. For the third word-context pair, the bigram had not been seen in the training data, which led to an estimated probability of 0.0.
Zero probabilities for unseen n-grams cause problems. Suppose for example you take a bigram language model and use it to score an automatically generated sentence of 10 tokens (say the output of a machine translation system). If one of the bigrams in that sentence is unseen, the probability of the sentence will be zero.
Smoothing is a method of assigning probabilities to unseen n-grams. As language models are typically trained using large amounts of data, any n-gram not seen in the training data is probably unlikely to be seen in other (test) data. A good smoothing method is therefore one that assigns a fairly small probability to unseen n-grams.
We’ll implement two different smoothing methods: Laplace (add-one) and Lidstone (add-alpha), and we will also consider the effects of backoff, which is implemented in NLTK’s NgramModel.
(NLTK also includes implementations of Laplace and Lidstone smoothing in its probability module; if you wish to look at the implementations they are here: http://www.nltk.org/api/nltk.html#module-nltk.probability However the point of this lab is to implement these simple methods yourself to make sure you understand them.)
Before implementing any smoothing, you should make sure you understand how to implement maximum likelihood estimation. In last week’s lab, we used NLTK to do this for us by training a bigram language model with an MLE estimator. We could then use the language model to find the MLE probability of any word given its context. Here, you’ll do the same thing but without using NLTK, just to make sure you understand how. We will also compare the smoothed probabilities you compute later to these MLE probabilities.
Code has been provided that extracts all the words from Jane Austen’s “Sense and Sensibility”, and then computes a list of bigram tuples by pairing up each word in the corpus with the following word. Using these unigrams and bigrams, fill in the remaining code to compute the MLE probability of a word given a single word of context. Then uncomment the test code to compute the proba- bilities:
Make sure your answers match the MLE probability estimates from Exercise 5 of Lab 1, where we used NLTK to compute these estimates.
#################### EXERCISE 0 ####################
# Solution for exercise 0
# Input: word (string), context (string)
# Output: p (float)
# Compute the unsmoothed (MLE) probability for word given the single word context
def ex0(word,context):
p = 0.0
austen_words = [w.lower() for w in gutenberg.words('austen-sense.txt')]
austen_bigrams = zip(austen_words[:-1], austen_words[1:]) # list of bigrams as tuples
# (above doesn't include begin/end of corpus: but basically this is fine)
# Compute probability of word given context
p =
# Return probability
return p
### Test your code with these examples
print "MLE:"
result0a = ex0('end','the')
print "Probability of \'end\' given \'the\': " + str(result0a)
result0b = ex0('the','end')
print "Probability of \'the\' given \'end\': " + str(result0b)
Laplace smoothing adds a value of 1 to the sample count for each “bin” (possible observation, in this case each possible bigram), and then takes the maximum likelihood estimate of the resulting frequency distribution.
Exercise 1 Assume that the size of the vocabulary is just the number of different words observed in the training data (that is, we will not deal with unseen words). Add code to the template to compute Laplace smoothed probabilities, again without using NLTK. Hint: if you have trouble, study the equations and example in Lecture 4 Now uncomment the test code and look at the estimates for:
How do these probabilities differ from the MLE estimates?
# Input: word (string), context (string)
# Output: p (float)
# Compute the Laplace smoothed probability for word given the single word context
def ex1(word,context):
p = 0.0
austen_words = [w.lower() for w in gutenberg.words('austen-sense.txt')]
austen_bigrams = zip(austen_words[:-1], austen_words[1:]) # list of bigrams as tuples
# (above doesn't include begin/end of corpus: but basically this is fine)
# compute the vocabulary size
V =
# Compute probability of word given context
p =
# Return probability
return p
### Test your code with these examples
print "LAPLACE:"
result1a = ex1('end','the')
print "Probability of \'end\' given \'the\': " + str(result1a)
result1b = ex1('the','end')
print "Probability of \'the\' given \'end\': " + str(result1b)
In practice, Laplace smoothing assigns too much mass to unseen n-grams. The Lidstone method works in a similar way, but instead of adding 1, it adds a value between 0 and 1 to the sample count for each bin (in class we called this value alpha, NLTK calls it gamma).
Fill in the code to compute Lidstone smoothed probabilities, then uncomment the test code and look at the probability estimates that are computed for the same bigrams as before using various values of alpha.
What do you notice about using alpha = 0
and alpha = 1
? (Compare to the
probabilities computed by the previous methods.)
What about when alpha = 0.01
? Are the estimated probabilities more similar
to MLE or Laplace smoothing in this case?
# Input: word (string), context (string), alpha (float)
# Output: p (float)
# Compute the Lidstone smoothed probability for word given the single word context
# Alpha is the smoothing parameter, normally between 0 and 1.
def ex2(word,context,alpha):
p =0.0
austen_words = [w.lower() for w in gutenberg.words('austen-sense.txt')]
austen_bigrams = zip(austen_words[:-1], austen_words[1:]) # list of bigrams as tuples
# (above doesn't include begin/end of corpus: but basically this is fine)
# Compute the vocabulary size
V =
# Compute the probability of a word given context
p =
# Return probability
return p
### Test your code with these examples
print "LIDSTONE, alpha=0.01:"
result2a = ex2('end','the',.01)
print "Probability of \'end\' given \'the\': " + str(result2a)
result2b = ex2('the','end',.01)
print "Probability of \'the\' given \'end\': " + str(result2b)
print "LIDSTONE, alpha=0:"
result2c = ex2('end','the',0)
print "Probability of \'end\' given \'the\': " + str(result2c)
result2d = ex2('the','end',0)
print "Probability of \'the\' given \'end\': " + str(result2d)
print "LIDSTONE, alpha=1:"
result2e = ex2('end','the',1)
print "Probability of \'end\' given \'the\': " + str(result2e)
result2f = ex2('the','end',1)
print "Probability of \'the\' given \'end\': " + str(result2f)
Now we will look at the effects of incorporating backoff in addition to some of these simple smoothing methods. In a bigram language model with backoff, the probability of an unseen bigram is computed by “backing off”: that is, if a word has never been seen in a particular context, then we compute its probability by using one fewer context words. Backing off from a bigram model (one word of context) therefore means we’d get estimates based on unigram frequencies (no context).
The mathematical details of backoff are a bit complex to ensure all the prob- abilities sum to 1. You needn’t understand all the details of backoff but you should understand these basic principles:
That is, a word with high corpus frequency will have a higher unigram backoff probability than a word with a low corpus frequency. Look back at the initialization method for NgramModel earlier in the lab. If you pass in MLEProbDist as the estimator (which we did in the last lab), then no backoff is used. However, with any other estimator (i.e., smoothing), the NgramModel does use backoff.
Use the code we have provided to train a bigram language model using Jane Austen’s “Sense and Sensibility” and Laplace smoothing. (By using Laplace as the estimator, you are also turning on backoff in NgramModel.) Use this language model to compute the probability of the same bigrams we’ve been looking at all along. Uncomment the test code to see the results.
# Input: word (string), context (string)
# Output: p (float)
def ex3(word,context):
p =0.0
# Train a bigram language model using a LAPLACE estimator AND BACKOFF
austen_words = [w.lower() for w in gutenberg.words('austen-sense.txt')]
lm = NgramModel(2,austen_words,estimator=lambda f,b: LaplaceProbDist(f,b+1))
# b+1 is necessary to provide, as it were, a bin for unseen contexts,
# so there is some probability left for the backoff probability, i.e. so
# that alpha is > 0.
# Compute probability of word given context.
# This method takes two arguments: the word and a *list* of words
# specifying its context.
# To see messages about backoff in action, use the named argument
verbose = True.
p = lm.prob(...)
# Return probability
return p
### Test your code with these examples
print "BACKOFF WITH LAPLACE"
result3a = ex3('end','the')
print "Probability of \'end\' given \'the\': " + str(result3a)
result3b = ex3('the','end')
print "Probability of \'the\' given \'end\': " + str(result3b)
In language modelling, a model is trained on a set of data (i.e. the training data). The cross-entropy of this model may then be measured on a test set (i.e. another set of data that is different from the training data) to assess how accurate the model is in predicting the test data.
Another way to look at this is: if we used the trained model to generate new sentences by sampling words from its probability distribution, how similar would those new sentences be to the sentences in the test data? This interpretation allows us to use cross-entropy for authorship detection, as described below.
nltkx.NgramModel contains the following cross-entropy method:
def entropy(self, text, pad_left=False, pad_right=False,
verbose=False, perItem=False):
"""
Calculate the approximate cross-entropy of the n-gram model for a
given evaluation text.
This is the average log probability of each item in the text.
:param text: items to use for evaluation
:type text: iterable(str)
:param pad_left: whether to pad the left of each text with an (n-1)-gram\
of <s> markers
:type pad_left: bool
:param pad_right: whether to pad the right of each sentence with an </s>\
marker
:type pad_right: bool
:param perItem: normalise for length if True
:type perItem: bool
"""
We can use cross-entropy in authorship detection. For example, suppose we have a language model trained on Jane Austen’s “Sense and Sensibility” (training data) plus the texts for two other novels (test data), one by Jane Austen and one by another author, but we don’t know which is which. We can work out the cross-entropy of the model on each of the texts and from the scores, determine which of the two test texts was more likely written by Jane Austen. Use:
NgramModel
’s entropy function: lm.entropy(...)
Compute both the total cross-entropy and the per-word cross entropy of each text. (Separate function templates are provided.)
# Solution for exercise 4 - entropy calculation
# Input: lm (NgramModel language model), doc_name (string)
# Output: e (float)
def ex4_tot_entropy(lm,doc_name):
e = 0.0
# Construct a list of lowercase words from the document (test document)
doc_words =
# Compute the total cross entropy of the text in doc_name
e =
# Return the entropy
return e
# Solution for exercise 4 - per-word entropy calculation
# Input: lm (NgramModel language model), doc_name (string)
# Output: e (float)
def ex4_perword_entropy(lm,doc_name):
e = 0.0
# Construct a list of lowercase words from the document (test document)
doc_words =
# Compute the normalized cross entropy of the text in doc_name
e =
# Return the entropy
return e
# Solution for exercise 4 - language model training
# Input: doc_name (string)
# Output: l (language model)
def ex4_lm(doc_name):
l = None
# Construct a list of lowercase words from the document (training data for lm)
doc_words =
# Train a trigram language model using doc_words and a Lidstone probability distribution with +0.01 added to the sample count for each bin
l =
# Return the language model
return l
### Test your code with these examples
lm4 = ex4_lm('austen-sense.txt')
result4a = ex4_tot_entropy(lm4,'austen-emma.txt')
print "Total cross-entropy for austen-emma.txt: " + str(result4a)
result4b = ex4_tot_entropy(lm4,'chesterton-ball.txt')
print "Total cross-entropy for chesterton-ball.txt: " + str(result4b)
result4c = ex4_perword_entropy(lm4,'austen-emma.txt')
print "Per-word cross-entropy for austen-emma.txt: " + str(result4c)
result4d = ex4_perword_entropy(lm4,'chesterton-ball.txt')
print "Per-word cross-entropy for chesterton-ball.txt: " + str(result4d)