FNLP 2014: Lecture 4: N-gram modelling

1. Bigrams, trigrams, 4-grams . . . N-grams: A hypothesis

Sequence and adjacency are important organising principles of natural languages

Often, but not always

At the very least, the hypothesis that there is some N for which relative frequency of N-character, or N-word, or N-part-of-speech, sequences is a good model of English, or some other language, deserves to be explored

Because collecting N-gram frequencies is relatively easy, they are often used as a baseline or first-approximation for language modelling.

2. Illustration: characters

Let's see how well N-character models work for different 'N'

[linux prompt]> export PYTHONPATH=/group/ltg/projects/fnlp

from nltk.book import *
from norm import *
m=[x for x in norm(y for y in gutenberg.raw('melville-moby_dick.txt'))]
uf=FreqDist(l for l in m)
from nltk import DictionaryProbDist
up=DictionaryProbDist(uf,normalize=True)
''.join(up.generate() for n in range(0,99))
from nltk.model import *
bm=NgramModel(2,m)
tm=NgramModel(3,m)
fm=NgramModel(4,m)
''.join(bm.generate(100,'T'))
''.join(tm.generate(100,'T'))
''.join(fm.generate(100,'T'))

3. What is a 'language model'

A language model is a hypothesis about what strings are in a language

It could be as simple as a list
Or a grammar (FSM, CF-PSG, . . .)
Or a means of assigning probabilities to strings

So, similar to a scale model of an airplane

It lets us make predictions
Without having to (or being able to) deal with the real thing

4. Language models: What are they for, anyway?

Spelling error detection

Spelling error correction

Indeed 'correction' more broadly

OCR output
MT output
ASR output

In these cases, the model serves to flag improbable sequences, or to select from among alternatives

Augmentation

For assistive technologies
For mobile texting

In these cases, the model is a source of (alternative) predictions

5. Sequence probabilities

Consider two alternative sequences of English words:

years three the spent mast before I

I spent three years before the mast

Neither of these strings used to appear in any documents indexed by Google, (and now they do only in a copy of these slides :-)

Can we none-the-less quantify our intuition that the second is much more likely than the first to turn up some day (outside a linguistics lecture)?

Comparing the observed frequency of the two sequences is no good, since in both cases that's zero.

But we can approximate the probability of these sequences, based on frequencies which are not zero

6. From joint to conditional probability

The probability of the whole sequence is the joint probability of each word in a seven-word sequence having the 'right' value

The joint probability of two events, P(X,Y), is just P(X)P(Y|X) (or, by symmetry, P(Y)P(X|Y))

We write P(X|Y) for the conditional probability of X given Y.

If X and Y are independent, then e.g. P(Y|X) is just P(Y), and the joint probability is just the product P(X)P(Y)

7. Joint and conditional probability, example

For example, consider how my geek friend Hector chooses his clothing for the week:

shirts: He has three: blue, black and brown
trousers: He has three pairs: blue, black and brown
socks: He has four: two black and two brown

Every Monday morning, he picks one shirt, one pair of trousers and two socks, at random

Every weekend he washes what he wore that week, and puts them back in the closet

8. Independent probabilities, example

What are the chances Hector turns up to work on Monday with matching shirt and trousers?

Three cases: all in black, all in blue, all in brown
What odds the first case?
- The choice of shirt is independent from the choice of trousers, so the answer is 1/3 * 1/3 == 1/9
So the overall probability is 1/3
- 1/9 for all black plus 1/9 for all blue plus 1/9 for all brown

9. Conditional probability, example

Now for the socks. What are the chances he turns up to work on Monday with matching socks?

Two cases: both black or both brown
What odds the first case?
- The choice of second sock is not independent from the choice of first sock (why not?), so the answer is not1/2 * 1/2 == 1/4
- We have to use the full joint probability formula
  - P(X)P(Y|X)
  - That is, in this case, P(firstBlack)P(secondBlack|firstBlack)
- P(firstBlack) is 1/2
- But P(secondBlack|firstBlack) is only 1/3
  - Because one of the two black socks has already been picked, so only one black and two brown remain
- So the probability for two black socks is 1/2 * 1/3 == 1/6
And since the story for two browns is the same, the overall answer is 1/6 + 1/6 == 1/3

10. The chain rule

When we have more than two constituents in a joint probability, we can apply the definition repeatedly.

So for a 4-word phrase, with joint probabilty $P (w_{1}, x_{2}, y_{3}, z_{4})$ , we apply the definition three times, choosing to work from right to left:

$P (w_{1}, x_{2}, y_{3}, z_{4})$ is $P (w_{1}, x_{2}, y_{3}) P (z_{4} | w_{1}, x_{2}, y_{3})$
$P (w_{1}, x_{2}, y_{3})$ is $P (w_{1}, x_{2}) P (y_{3} | w_{1}, x_{2})$
$P (w_{1}, x_{2})$ is $P (w_{1}) P (x_{2} | w_{1})$

When we substitute each line into the one above it, we get:

$P (w_{1}, x_{2}, y_{3}, z_{4})$ is $P (w_{1}) P (x_{2} | w_{1}) P (y_{3} | w_{1}, x_{2}) P (z_{4} | w_{1}, x_{2}, y_{3})$

Or in general

$P (w_{1}, w_{2}, ..., w_{n})$ is $\prod_{k = 1}^{n} P (w_{k} | w_{1}^{k - 1})$

11. From the chain rule to N-grams

There's a problem with the full expansion the chain rule gives us

We would still need frequency counts for very improbable things, e.g.

P(the|I,spent,three,years,before)

(One occurrence on the Web as of today, according to Google)

So we approximate the chain with N-gram probabilities

$P (w_{1}, w_{2}, ..., w_{n})$ is approximately $\prod_{k = 1}^{n} P (w_{k} | w_{k - 1})$ , using bigrams or

$P (w_{1}, w_{2}, ..., w_{n})$ is approximately $\prod_{k = 1}^{n} P (w_{k} | w_{k - 2}^{k - 1})$ , using trigrams

(In all of the above, we mean the right thing to happen at the left margin)

12. Estimating N-gram probabilities

The simplest estimate of probability is based directly on frequency

The maximum likelihood estimate (or MLE) of the probability of an event is its normalised frequency in a sample

That is, the frequency of its occurence divided by the size of the population

The sum of the MLEs of all the items in a sample must be 1

How this works is obvious for unigrams, but what about bigrams or trigrams?

Remember that what we need is, for example, P(mast|before,the)

that is, the conditional probability of "mast" given that the preceding two words are "before the"

In practice what this means is that our sample is all trigrams whose first two words are "before the", and the frequency we care about is that of the trigram "before the mast"

With a little thought, it becomes evident that the formula for the estimate we want is

$P_{MLE} (w_{k} | w_{k - 2}^{k - 1})$ is $\frac{C (w_{k - 2}, w_{k - 1}, w_{k})}{C (w_{k - 2}, w_{k - 1})}$ , where we use C(...) for the frequency of some item

So for our example, that's just C(before the mast)/C(before the)

13. Estimating conditional probabilities, example

In the Gutenberg Moby Dick corpus, C(before the mast)/C(before the) turns out to be 0.0727: "before the" appears 55 times, and "before the mast" appears 4 times.

It's the second most common of the 41 word types which occur after "before the", behind "wind", which occurs 9 times.

The unigram probability of "mast" in Moby Dick is only .00049, and its conditional probability given "the" is still only .0029.

Looking at this from the positive side

the probability of seeing "mast" goes up by an order of magnitude if you know the previous word was "the"
and another order and a half of magnitude if you know the word before that is "before"

14. A practical consideration

Multiplying lots of (small) probabilities very quickly gets us into trouble

The answers become too small to be accurately represented as even double-precision floating point

And although hardware is beginning to support arbitrary-precision decimal arithmetic

It's not widely available yet
It carries a necessary performance penalty

So instead of multiplying probabilities

We'll sum log-probabilities

15. Another practical consideration

How much data is enough?

That is, how reliable are these estimates?

The rule of thumb for Zipf's Law distributions is "n-squared"

That is, if there are n possible items, then you need n-squared samples for your $P_{MLE}$ estimates to be reliable.

Given that there are 17231 distinct lower-case word types in the Gutenberg Moby Dick

We would need about 30 million words for our PMLE estimates of unigram frequencies to be reliable.
- In Moby Dick, those 17231 words combine to produce only 114182 bigrams, many fewer than the in-principle 30 million.
- Squaring that actual bigram count gives us a lower bound of around 13 billion words to get reliable bigram statistics.
- Why a lower bound? Because if we had 13 billion words, there would be more than 114182 bigrams represented therein.

16. Reading

Read Chapter 4 of Jurafsky & Martin, from the beginning through section 4.4.

17. Administrative details

Course home page is at http://www.inf.ed.ac.uk/teaching/courses/fnlp/

Do your own work yourself

Tutorial group timing finally stabilised: no excuses for not attending.

You really will need to have access to a copy of the text, Jurafsky and Martin 2nd edition.

Notices will come via the mailing list, so do register for the course ASAP even if you're not sure you will take it.

First assignment due a week from today.