FNLP 2014: Lecture 4: N-gram modelling

Henry S. Thompson
24 January 2014
Creative CommonsAttributionShare Alike

1. Bigrams, trigrams, 4-grams . . . N-grams: A hypothesis

Sequence and adjacency are important organising principles of natural languages

At the very least, the hypothesis that there is some N for which relative frequency of N-character, or N-word, or N-part-of-speech, sequences is a good model of English, or some other language, deserves to be explored

Because collecting N-gram frequencies is relatively easy, they are often used as a baseline or first-approximation for language modelling.

2. Illustration: characters

Let's see how well N-character models work for different 'N'

[linux prompt]> export PYTHONPATH=/group/ltg/projects/fnlp
from nltk.book import *
from norm import *
m=[x for x in norm(y for y in gutenberg.raw('melville-moby_dick.txt'))]
uf=FreqDist(l for l in m)
from nltk import DictionaryProbDist
up=DictionaryProbDist(uf,normalize=True)
''.join(up.generate() for n in range(0,99))
from nltk.model import *
bm=NgramModel(2,m)
tm=NgramModel(3,m)
fm=NgramModel(4,m)
''.join(bm.generate(100,'T'))
''.join(tm.generate(100,'T'))
''.join(fm.generate(100,'T'))

3. What is a 'language model'

A language model is a hypothesis about what strings are in a language

So, similar to a scale model of an airplane

4. Language models: What are they for, anyway?

Spelling error detection

Spelling error correction

Indeed 'correction' more broadly

In these cases, the model serves to flag improbable sequences, or to select from among alternatives

Augmentation

In these cases, the model is a source of (alternative) predictions

5. Sequence probabilities

Consider two alternative sequences of English words:

years three the spent mast before I
I spent three years before the mast

Neither of these strings used to appear in any documents indexed by Google, (and now they do only in a copy of these slides :-)

Can we none-the-less quantify our intuition that the second is much more likely than the first to turn up some day (outside a linguistics lecture)?

Comparing the observed frequency of the two sequences is no good, since in both cases that's zero.

But we can approximate the probability of these sequences, based on frequencies which are not zero

6. From joint to conditional probability

The probability of the whole sequence is the joint probability of each word in a seven-word sequence having the 'right' value

The joint probability of two events, P(X,Y), is just P(X)P(Y|X) (or, by symmetry, P(Y)P(X|Y))

We write P(X|Y) for the conditional probability of X given Y.

7. Joint and conditional probability, example

For example, consider how my geek friend Hector chooses his clothing for the week:

shirts
He has three: blue, black and brown
trousers
He has three pairs: blue, black and brown
socks
He has four: two black and two brown

Every Monday morning, he picks one shirt, one pair of trousers and two socks, at random

Every weekend he washes what he wore that week, and puts them back in the closet

8. Independent probabilities, example

What are the chances Hector turns up to work on Monday with matching shirt and trousers?

9. Conditional probability, example

Now for the socks. What are the chances he turns up to work on Monday with matching socks?

10. The chain rule

When we have more than two constituents in a joint probability, we can apply the definition repeatedly.

So for a 4-word phrase, with joint probabilty P(w1,x2,y3,z4), we apply the definition three times, choosing to work from right to left:

  1. P(w1,x2,y3,z4) is P(w1,x2,y3)P(z4|w1,x2,y3)
  2. P(w1,x2,y3) is P(w1,x2)P(y3|w1,x2)
  3. P(w1,x2) is P(w1)P(x2|w1)

When we substitute each line into the one above it, we get:

P(w1,x2,y3,z4) is P(w1)P(x2|w1)P(y3|w1,x2)P(z4|w1,x2,y3)

Or in general

P(w1,w2,...,wn) is k=1nP(wk|w1k-1)

11. From the chain rule to N-grams

There's a problem with the full expansion the chain rule gives us

We would still need frequency counts for very improbable things, e.g.

P(the|I,spent,three,years,before)

(One occurrence on the Web as of today, according to Google)

So we approximate the chain with N-gram probabilities

P(w1,w2,...,wn) is approximately k=1nP(wk|wk-1), using bigrams or

P(w1,w2,...,wn) is approximately k=1nP(wk|wk-2k-1), using trigrams

(In all of the above, we mean the right thing to happen at the left margin)

12. Estimating N-gram probabilities

The simplest estimate of probability is based directly on frequency

The maximum likelihood estimate (or MLE) of the probability of an event is its normalised frequency in a sample

The sum of the MLEs of all the items in a sample must be 1

How this works is obvious for unigrams, but what about bigrams or trigrams?

Remember that what we need is, for example, P(mast|before,the)

In practice what this means is that our sample is all trigrams whose first two words are "before the", and the frequency we care about is that of the trigram "before the mast"

With a little thought, it becomes evident that the formula for the estimate we want is

So for our example, that's just C(before the mast)/C(before the)

13. Estimating conditional probabilities, example

In the Gutenberg Moby Dick corpus, C(before the mast)/C(before the) turns out to be 0.0727: "before the" appears 55 times, and "before the mast" appears 4 times.

The unigram probability of "mast" in Moby Dick is only .00049, and its conditional probability given "the" is still only .0029.

Looking at this from the positive side

14. A practical consideration

Multiplying lots of (small) probabilities very quickly gets us into trouble

The answers become too small to be accurately represented as even double-precision floating point

And although hardware is beginning to support arbitrary-precision decimal arithmetic

So instead of multiplying probabilities

15. Another practical consideration

How much data is enough?

The rule of thumb for Zipf's Law distributions is "n-squared"

Given that there are 17231 distinct lower-case word types in the Gutenberg Moby Dick

16. Reading

Read Chapter 4 of Jurafsky & Martin, from the beginning through section 4.4.

17. Administrative details

Course home page is at http://www.inf.ed.ac.uk/teaching/courses/fnlp/

Do your own work yourself

Tutorial group timing finally stabilised: no excuses for not attending.

You really will need to have access to a copy of the text, Jurafsky and Martin 2nd edition.

Notices will come via the mailing list, so do register for the course ASAP even if you're not sure you will take it.

First assignment due a week from today.