Sequence and adjacency are important organising principles of natural languages
At the very least, the hypothesis that there is some N for which relative frequency of N-character, or N-word, or N-part-of-speech, sequences is a good model of English, or some other language, deserves to be explored
Because collecting N-gram frequencies is relatively easy, they are often used as a baseline or first-approximation for language modelling.
Let's see how well N-character models work for different 'N'
[linux prompt]> export PYTHONPATH=/group/ltg/projects/fnlp
from nltk.book import *
from norm import *
m=[x for x in norm(y for y in gutenberg.raw('melville-moby_dick.txt'))]
uf=FreqDist(l for l in m)
from nltk import DictionaryProbDist
up=DictionaryProbDist(uf,normalize=True)
''.join(up.generate() for n in range(0,99))
from nltk.model import *
bm=NgramModel(2,m)
tm=NgramModel(3,m)
fm=NgramModel(4,m)
''.join(bm.generate(100,'T'))
''.join(tm.generate(100,'T'))
''.join(fm.generate(100,'T'))
A language model is a hypothesis about what strings are in a language
So, similar to a scale model of an airplane
Spelling error detection
Spelling error correction
Indeed 'correction' more broadly
In these cases, the model serves to flag improbable sequences, or to select from among alternatives
Augmentation
In these cases, the model is a source of (alternative) predictions
Consider two alternative sequences of English words:
years three the spent mast before I
I spent three years before the mast
Neither of these strings used to appear in any documents indexed by Google, (and now they do only in a copy of these slides :-)
Can we none-the-less quantify our intuition that the second is much more likely than the first to turn up some day (outside a linguistics lecture)?
Comparing the observed frequency of the two sequences is no good, since in both cases that's zero.
But we can approximate the probability of these sequences, based on frequencies which are not zero
The probability of the whole sequence is the joint probability of each word in a seven-word sequence having the 'right' value
The joint probability of two events, P(X,Y), is just P(X)P(Y|X) (or, by symmetry, P(Y)P(X|Y))
We write P(X|Y) for the conditional probability of X given Y.
For example, consider how my geek friend Hector chooses his clothing for the week:
Every Monday morning, he picks one shirt, one pair of trousers and two socks, at random
Every weekend he washes what he wore that week, and puts them back in the closet
What are the chances Hector turns up to work on Monday with matching shirt and trousers?
1/3 * 1/3 == 1/9
1/3
1/9
for all black plus 1/9
for all blue
plus 1/9
for all brownNow for the socks. What are the chances he turns up to work on Monday with matching socks?
1/2 * 1/2 == 1/4
1/2
1/3
1/2 * 1/3 == 1/6
1/6 + 1/6 == 1/3
When we have more than two constituents in a joint probability, we can apply the definition repeatedly.
So for a 4-word phrase, with joint probabilty , we apply the definition three times, choosing to work from right to left:
When we substitute each line into the one above it, we get:
is
Or in general
is
There's a problem with the full expansion the chain rule gives us
We would still need frequency counts for very improbable things, e.g.
P(the|I,spent,three,years,before)
(One occurrence on the Web as of today, according to Google)
So we approximate the chain with N-gram probabilities
is approximately , using bigrams or
is approximately , using trigrams
(In all of the above, we mean the right thing to happen at the left margin)
The simplest estimate of probability is based directly on frequency
The maximum likelihood estimate (or MLE) of the probability of an event is its normalised frequency in a sample
The sum of the MLEs of all the items in a sample must be 1
How this works is obvious for unigrams, but what about bigrams or trigrams?
Remember that what we need is, for example, P(mast|before,the)
In practice what this means is that our sample is all trigrams whose first two words are "before the", and the frequency we care about is that of the trigram "before the mast"
With a little thought, it becomes evident that the formula for the estimate we want is
So for our example, that's just C(before the mast)/C(before the)
In the Gutenberg Moby Dick corpus, C(before the mast)/C(before the)
turns out to be
0.0727
: "before the" appears 55 times, and "before the mast" appears 4 times.
The unigram probability of "mast" in Moby Dick is
only .00049
, and its conditional probability given "the" is
still only .0029
.
Looking at this from the positive side
Multiplying lots of (small) probabilities very quickly gets us into trouble
The answers become too small to be accurately represented as even double-precision floating point
And although hardware is beginning to support arbitrary-precision decimal arithmetic
So instead of multiplying probabilities
How much data is enough?
The rule of thumb for Zipf's Law distributions is "n-squared"
Given that there are 17231 distinct lower-case word types in the Gutenberg Moby Dick
Read Chapter 4 of Jurafsky & Martin, from the beginning through section 4.4.
Course home page is at http://www.inf.ed.ac.uk/teaching/courses/fnlp/
Do your own work yourself
Tutorial group timing finally stabilised: no excuses for not attending.
You really will need to have access to a copy of the text, Jurafsky and Martin 2nd edition.
Notices will come via the mailing list, so do register for the course ASAP even if you're not sure you will take it.
First assignment due a week from today.