The aims of this lab session are to 1) familiarize the students with pointwise mutual information (PMI) 2) show how to apply PMI for the task of finding word collocations 3) identify shortcomings of this approach . By the end of this lab session, you should be able to:
Before continuing with this lab sheet, please download a copy of the lab template (lab6.py) for this lab from the FNLP course website. This template contains code that you should use as a starting point when attempting the exercises for this lab. In this lab we consider the task of identifying word collocations in a corpus to demonstrate the use of Pointwise Mutual Information (PMI).
$PMI(x_i, y_j) = log\frac{P(X=x_i,Y=y_j)}{P(X=x i)P(Y =y_j)}$
import operator
import nltk
from nltk.probability import FreqDist,MLEProbDist
from nltk.corpus import gutenberg
from nltk import bigrams
from nltk.corpus import stopwords
from math import log
from pprint import pprint
class FasterMLEProbDist(MLEProbDist):
'''Speed up prob lookup for large sample sizes'''
def __init__(self,freqdist):
self._N=freqdist.N()
if self._N == 0:
self._empty = True
else:
self._empty = False
self._pq=float(self._N)
MLEProbDist.__init__(self,freqdist)
def prob(self, sample):
'''use cached quotient for division'''
if self._empty:
return 0
else:
return float(self._freqdist[sample]) / self._pq
stopwords = stopwords.words('english')
def Filter1(word):
return word.isalpha()
def Filter2(word):
return (word.isalpha() and not(word.lower() in stopwords))
The data set consists of bigrams extracted from the Moby Dick novel.
sentences=gutenberg.sents('melville-moby_dick.txt')
In order to compute the PMI we need the joint probability $P(X = x, Y = y)$ and the marginal probabilities $P (X = x)$ and $P (Y = y)$. In our case $P (X = x)$ and $P (Y = y)$ will be the unigram probabilities of the two words that are considered to form a collocation, and $P (X = x, Y = y)$ will be the bigram probability.
In this exercise we will compute the joint and marginal probability distributions for the word bigrams. You will have to fill in two functions to achieve this:
Specifically:
help(nltk.bigrams)
help(nltk.probability.FasterMLEProbDist)
Now uncomment the test code. Compare the top 30 most frequent bigrams arising from the two different filters.
Most of the former are made up of closed-class words. If we eliminate stopwords some interesting bigramsover content words show up.
def BuildData(sentences,Filter):
#unigrams_list =
#bigrams_list
return bigrams_list, unigrams_list
#TODO: using the data build the probability distribution over bigrams and unigrams using FasterMLEProbDist
def ex1(bigrams, unigrams):
#TODO build the frequency distribution over bigrams and unigrams
bigramFreqDist =
unigramFreqDist =
#TODO build the probability distribuition from the above frequency distributions using the FasterMLEProbDist estimator
bigramProbDist =
unigramProbist =
return bigramProbDist, unigramProbist
def test1():
bigrams, unigrams = BuildData(sentences,Filter1)
bigramProbDist1, unigramProbist1 = ex1(bigrams, unigrams)
print "type: ",type(bigramProbDist1) # <class 'nltk.probability.FasterMLEProbDist'>
print "type: ",type(unigramProbist1) # <class 'nltk.probability.FasterMLEProbDist'>
MLESorted = bigramProbDist1.freqdist().most_common(30)
print "Using filter 1:",pprint(MLESorted)
print "type: \n",type(MLESorted) # <type 'list'>
bigrams, unigrams = BuildData(sentences,Filter2)
bigramProbDist, unigramProbist = ex1(bigrams, unigrams)
MLESorted = bigramProbDist.freqdist().most_common()[:30]
print "Using filter 2:",pprint(MLESorted)
print "\n"
return bigramProbDist1, unigramProbist1
# TEST EXERCISE 1 - return values will be used for exercise 2
bigramProbDist, unigramProbDist = test1()
In the previous section we estimated the joint and marginal probability distributions from the data. In this section we use these distributions to compute the PMI for a given sample. In order to avoid multiplication of small floating point numbers (probabilities), we can rewrite the formula for PMI as: $P M I(x i , y j ) = log P (x i , y j ) − (log P (x i ) + log P (y j ))$
The template of the function that you have to implement takes two parameters: the bigram and unigram probability distributions.
Uncomment the test code. Look at the PMI of some bigrams. We can see that the PMI for ”sperm whale” is 5 binary orders of magnitude greater than the PMI of ”of the”. Can you think of some reasons why the PMI for ”old man” is not as low as we would expect?
What can you observe by looking at the top 10 bigrams according to the PMI? How do low counts affect the PMI?
#TODO: for each sample in the bigramProbDist compute the PMI and add {sample,PMI} pair to the PMI dict
#input: bigram and unigram distribution of type nltk.probability.FasterMLEProbDist
#output: PMI dict, PMIsorted list
def ComputePMI(bpd, upd):
# data structure to hold the PMI score for every bigram
#TODO: compute PMI for each sample in bpd and add the (sample,PMI) pair to the dict
PMIs=
#list of (bigrams,PMI) sorted according to the PMI score
PMIsorted = sorted(PMIs,key=operator.itemgetter(1), reverse=True)
return dict(PMIs), PMIsorted
def test2(bpd,upd):
print "type: ",type(bpd) # <class 'nltk.probability.FasterMLEProbDist'>
print "type: ",type(upd) # <class 'nltk.probability.FasterMLEProbDist'>
PMIs, PMIsorted = ComputePMI(bpd, upd)
print "type: ", type(PMIs) # <type 'dict'>
print "type: ", type(PMIsorted) # <type 'list'>
print "sperm whale %0.2f" % PMIs[("sperm","whale")]
print "of the %0.2f" % PMIs[("of","the")]
print "old man %0.2f" % PMIs[("old","man")] #comment why it's not as expected close to 0 -> because not enough data
print "one side %0.2f" % PMIs[("one","side")]
print "\n"
bcount=bpd.freqdist()
print "Top 10 by PMI"
print "%s\t%s\t%s"%('PMI','n','pair')
for pair in PMIsorted[:10]:
print "%0.2f\t%d\t%s" % (pair[1], bcount[pair[0]], pair[0])
return PMIsorted
# TEST EXERCISE 2 - return values will be used for exercise 3
PMIsorted = test2(bigramProbDist, unigramProbDist)
In the previous exercise we found that the PMI is very sensitive to data sparsity. Bigrams composed of low frequency words are ranked higher than bigrams with high frequency words according to PMI. One way to fix this issue is by putting a threshold for the frequency of words. Edit the ex3 function to do this:
What does a negative score say about a bigram?
Optionally you can eliminate stop-words from the corpus by applying the second Filter function, then recom- pute the PMI and investigate the top and last bigrams.
def ex3(PMIsorted,bpd):
#TODO Build a list of bigrams and their corresponding PMI for bigrams composed of words with frequency grater than 30
#PMIs should look like [(bigram, PMI[bigram])] where bigram is a key in the bigramProbDist
bcount =
return [pair for pair in PMIsorted if bcount[pair[0]]>30]
def test3(PMIsorted, bpd):
high_freq = ex3(PMIsorted, bpd)
print "\nTop 20 by PMI where pair count>30"
print "%s\t%s\t%s"%('PMI','n','pair')
bcount = bpd.freqdist()
for pair in high_freq[:20]:
print "%0.2f\t%d\t%s" % (pair[1], bcount[pair[0]], pair[0])
print "\nBottom 20 by PMI where pair count>30"
for pair in high_freq[-20:]:
print "%s\t%0.2f\t%d" % (pair[0], pair[1], bpd.freqdist()[pair[0]])
# TEST EXERCISE 3
test3(PMIsorted,bigramProbDist)