FNLP 2011: Tutorial 8: Working with corpora: mutual information

1. Mutual information: recap

Set up the python context as follows:

import nltk
from nltk.corpus import gutenberg
from nltk import FreqDist, bigrams
text1=gutenberg.words('melville-moby_dick.txt')
f = FreqDist(bigrams(w.lower() for w in text1 if (w.isalpha() or w=='.')))
f
from pprint import pprint as pp
pp(f.items()[:30])

I added the full-stop special case (w=='.') above to try to handle sentence boundaries better. If we just go rid of all punctuation, then all the sentences would run together and we from e.g. "... and the whale. Shipmates, I do not ..." we would get a whale shipmates bigram, which is just wrong.

But the above full-stop hack really isn't right either, as it mis-handles abbreviations, as can be seen by trying

f[('mr','.')]

NLTK has built-in sentence tokenisation, so a better way to proceed is to make use of that:

text1=gutenberg.sents('melville-moby_dick.txt')
f = FreqDist(p for s in text1 for p in bigrams(w.lower() for w in s if w.isalpha()))
f
pp(f.items()[:30])

That's better, no punctuation and no cross-sentence pairs. But only one pair of content words in the top 30. . .

Let's get rid of the stop words:

from nltk.corpus import stopwords
es = stopwords.words('english')
f = FreqDist(p for s in text1 for p in bigrams(w.lower() for w in s if (w.isalpha() and not(w.lower() in es))))
f
pp(f.items()[:20])

A big drop in the size of the bigram table, but it looks much better. . .

Now lets move to mutual information from raw frequency. We need a unigram frequency tabulation as well, so we can compute the independent probability of a bigram

u = FreqDist(w.lower() for s in text1 for w in s if (w.isalpha() and not(w.lower() in es)))
u
u['moby']
u['dick']
u['captain']
u['peleg']
u['ahab']

And a function to compute mutual information

from math import log
def mutInf(p,u1,u2,b):
    return log((float(b[p])/float(b.N()))/
               ((float(u1[p[0]])*float(u2[p[1]]))/
                (float(u1.N())*float(u2.N()))),
               2)

We allow for separate unigram distributions, because we also will use MI to explore bilingual corpora where the pairs come from pairs of sentences. . .

fmi = [(p,mutInf(p,u,u,f)) for p in f.keys()]
pp(fmi[:30])
fmi.sort(key=lambda p:p[1],reverse=True)
pp(fmi[:30])

Hmm, looks suspicious---all the same MI

mi0=fmi[0]
sum(1 for p in fmi if p[1]==mi0[1])
u['alb'] # or the first half of whatever top-30 MI pair catches your eye
u['tunic'] # likewise second half
f[('alb','tunic')]

So, all hapaxes (words which occur only once)! MI is being distorted by sparse data problems.

We can try to find some high-MI pairs with high individual frequencies:

pp([p[0] for p in fmi if u[p[0][0]] > 30 and u[p[0][1]] > 30][:30])

NLTK provides a number of more sophisticated approaches to finding collocations:

from nltk import Text
mt = Text(gutenberg.words('melville-moby_dick.txt'))
mt.collocations()

The default is based on likelihood ratios, which is less sensitive to sparse data than mutual information