1. Evaluating N-Grams: Extrinsic vs. intrinsic
So, we've estimated a lot of conditional probabilities
And we can multiply them together (or add their logs) to get a probablity
for a whole sentence
How good are our estimates?
Extrinsic Evaluation: How much help are they?
- Embed the language model in some
application (e.g., MT), and quantify any
improvement in performance
- Best, but sometimes impractical or too
expensive
Intrinsic Evaluation: How accurately do they model the training data?
- Test accuracy of the model on unseen test
data
- Cheaper, and usually correlates with extrinsic
evaluation
2. Evaluation: Entropy
There's no obvious absolute measure, but with a
relative measure we can at least compare two models
We could simply compare the (log) probability estimate for the whole
training set, or for a random sampling of sentences in it, or in a test corpus.
Traditionally this is expressed as the entropy, which is the negative
of the sum of the log (base 2) probabilities of each constituent N-gram
- That is, the N-gram estimate of the cost of the total constituent
Strictly speaking, this is cross-entropy
- Using estimates derived from one set of examples
- To compute the entropy of another set
3. Entropy example
So for our "I spent three years before the mast", the entropy using a
bigram model trained on Moby Dick is (being careful about the left margin this time)
In contrast, the entropy with respect to the unigram model
trained on Moby Dick is
And to remind ourselves of why we're doing this in the log domain
- This means the estimated probability using bigrams of our sentence is
- But using unigrams it's only
- That's roughly 10 thousand million times less likely.
This gives us a concrete estimate of how much better the bigram model is
than the unigram model
4. Interpreting entropy
In calculating entropy, we use log (base 2) because this supports the interpretation of the
entropy as the number of binary bits needed to encode the string using an
information-theoretically optimal encoding, which uses short bitstrings for
common words and longer ones for less common words.
- So for our bigram-based example, an entropy of means we would need 42 bits to encode the whole sentence
- or 6 bits per word
- which is at least a factor of 4 better than the average of 24 bits
per word using ASCII
5. Perplexity
Perplexity is just
Word-level perplexity can be understood as the average branching factor
at each point in the language
- So for our bigram example, a word-level entropy of 6 == an entropy of == 64, or approximately 64 choices at each point in the sentence.
Our unigram word-level entropy of 10 bits per word, giving a perplexity of around 1000
- Still much better
than ASCII, entropy of 24 bits per word == perplexity of around 17 million
"Typical perplexities yielded by n-gram
models on English text range from about 50 to almost 1000 (corresponding to cross-entropies from
about 6 to 10 bits/word), depending on the type of text."
From An Empirical Study of Smoothing Techniques for
Language Modeling, Chen and Goodman 1998, which I recommend for anyone
interested in the details of language modelling and smoothing.
6. The impact of missing data
What about the entropy of our sentence using a trigram model trained on
Moby Dick?
We have a problem
- The trigram "i spent three" doesn't occur in Moby Dick
at all
- So the probability estimate (using Moby Dick-trained trigrams) of our string is 0!
- And the entropy is negative infinity. . .
The reason for this is hidden in the name of our probability estimation
method: maximum likelihood estimation
- We've hit a classic problem, known as over-fitting
- By making Moby Dick's word sequences maximimally
likely, we've made all other sentences minimally likely!
We need to take some probability away from the trigrams in
Moby Dick
- So there's a bit left over to assign to the trigrams that
aren't there
There are a wide range of techniques for this, under the name of smoothing
7. Smoothing: Just add one
One of the oldest (it dates back to Laplace in the 18th century) methods
is to just "add one"
- That is, to adjust the raw counts by assuming everything happened one
more time
- This has the relevant side-effect of increasing the counts for things
that didn't happen from 0 to 1
- Jurafsky & Martin work through an example of this in careful
detail in section 4.5.1
This appears to presuppose that you know all the things that
might happen, even if you didn't count any occurrences
- The workaround in cases where you don't is to include a single
'unknown' item
- This often makes more sense anyway
8. Laplace smoothing example
Let's look at our example in detail
- There is only one trigram beginning "i spent" in
Moby Dick: "i spent in", and it only occurs once.
- So is 1 if and 0 for all other words.
- If we added one to every possible trigram beginning "i
spent", using words known to appear in Moby Dick
- The good news: goes up from 0 to 0.00006, because
we added "i spent three" in with a count of 1
- The bad news: goes down from 1 to .0001
- because we increased our N from 1 to 17231
- and only increased the count for "i spent three" from 1 to 2
In general, this version of Laplace smoothing takes too much away from
the knowns, to cover all the unknowns
9. Laplace smoothing, cont'd
The alternative is to treat all unknowns as the same
- I.e. just add a single additional entry to each bigram
context, spelled, say, as UNK
- In our example, this would only reduce from 1 to 0.67
- and say that was 0.33
- So what's that give for ?
- Divide by the number of words in Moby Dick
- Or the number of words in English
- Or use a unigram-frequency weighted scaling
- Or . . .
Empirical evaluation is the only real way to determine what works well and
what doesn't
Values less than one can be added -- this is known as Lidstone smoothing
10. Better smoothing: Good-Turing
Based on a suggestion by Alan Turing, Good-Turing smoothing takes a
different, more sensible approach
It has three key aspects:
- The sample size, N, doesn't change
- The probability estimate for missing items is based on the frequency
of items which appear only once (hapaxes)
- To keep the total probability constant (at 1) this means the
probability estimates for the occuring items have to be reduced from their MLE value
- The necessary reduction in all other estimates, to provide the necessary
extra for redistribution to unseen items, is based on a local ratio of counts
- That is, the number of words occuring once vs. the number appearing twice
- the number of words appearing twice vs. the number appearing three times
- etc.
11. Good-Turing in detail
Specifically, we push a bit of probability total down to the count class below
- So we have some probability to give to the words with count of 0
Working backwards, this amounts to each count being reduced slightly
- This is slightly counter-intuitive
- It's called discounting. . .
Schematically, we get the following:
where
-
is a count or frequency
-
is the number of different items with that frequency
- is
the (maximum likelihood estimate) of the probability for an item with that count
- is the total probability mass for
all the items with that count
-
is the Good-Turing smoothed version of the count
- and are the Good-Turing smoothed versions of and
12. What about ?
Note that we only get the parenthesised bits in the 0 row if we actually
know what we're missing
- Which is the case for e.g. bigrams
- Where we can at least estimate the number
- As the total possible () less the number we've seen ()
- where is the number of word types in the corpus
- But often not for unigrams
13. Good-Turing discounts
The basic idea is to arrange the discounts so that the amount we
add to the total probability in row 0 is matched
by all the discounting in the other rows
Specifically, the Good-Turing discount depends on the adjacent counts:
This is usually stated by first defining a discounted
count and then just defining
Either way, the important thing is that since the counts tend to go down
as c goes up, the multiplier is less than one, and we
get a discount as required.
The sum of the impact on of all the discounts is
indeed , as required to balance out row 0
- We see this not directly, but because the sum of all the is evidently the same as the sum of all the
An example, based on the first few rows of the frequency of frequencies
for Moby Dick, is worked through in a Good-Turing example spreadsheet
14. Required reading
Jurafsky & Martin, Chapter 4, sections 4.4--4.7
15. Administrative details
Course home page is at http://www.inf.ed.ac.uk/teaching/courses/fnlp/
Do your own work yourself
Tutorial groups now underway: please go to a tutorial.
You really will need to have access to a copy of the text, and the
edition you got for Inf2A last year will not do.
Notices will come via the mailing list.
Assignment due in just over two weeks
- Some clarifications forthcoming