FNLP 2014: Lecture 5: N-gram model evaluation, smoothing

1. Evaluating N-Grams: Extrinsic vs. intrinsic

So, we've estimated a lot of conditional probabilities

And we can multiply them together (or add their logs) to get a probablity for a whole sentence

How good are our estimates?

Extrinsic Evaluation: How much help are they?

Embed the language model in some application (e.g., MT), and quantify any improvement in performance
Best, but sometimes impractical or too expensive

Intrinsic Evaluation: How accurately do they model the training data?

Test accuracy of the model on unseen test data
Cheaper, and usually correlates with extrinsic evaluation

2. Evaluation: Entropy

There's no obvious absolute measure, but with a relative measure we can at least compare two models

We could simply compare the (log) probability estimate for the whole training set, or for a random sampling of sentences in it, or in a test corpus.

Traditionally this is expressed as the entropy, which is the negative of the sum of the log (base 2) probabilities of each constituent N-gram

That is, the N-gram estimate of the cost of the total constituent

Strictly speaking, this is cross-entropy

Using estimates derived from one set of examples
To compute the entropy of another set

3. Entropy example

So for our "I spent three years before the mast", the entropy using a bigram model trained on Moby Dick is (being careful about the left margin this time)

$entropy (I, spent, three, years, before, the, mast) =$
$- (\log_{2} (P (I)) + \log_{2} (P (spent | I)) + \log_{2} (P (three | spent)) + \log_{2} (P (years | three))$ $+ \log_{2} (P (before | years)) + \log_{2} (P (the | before)) + \log_{2} (P (mast | the)))$
$= - (-6.9381 + -11.0546 + -3.1699 + -4.2362 + -5.0 + -2.4426 + -8.4246)$
$= 41.2660$

In contrast, the entropy with respect to the unigram model trained on Moby Dick is $68.1506$

And to remind ourselves of why we're doing this in the log domain

This means the estimated probability using bigrams of our sentence is $3.8 x 10^{-13}$
But using unigrams it's only $3.1 x 10^{-21}$
That's roughly 10 thousand million times less likely.

This gives us a concrete estimate of how much better the bigram model is than the unigram model

4. Interpreting entropy

In calculating entropy, we use log (base 2) because this supports the interpretation of the entropy as the number of binary bits needed to encode the string using an information-theoretically optimal encoding, which uses short bitstrings for common words and longer ones for less common words.

So for our bigram-based example, an entropy of $41.2660$ means we would need 42 bits to encode the whole sentence
or 6 bits per word
which is at least a factor of 4 better than the average of 24 bits per word using ASCII

5. Perplexity

Perplexity is just $2^{entropy}$

Word-level perplexity can be understood as the average branching factor at each point in the language

So for our bigram example, a word-level entropy of 6 == an entropy of $2^{6}$ == 64, or approximately 64 choices at each point in the sentence.

Our unigram word-level entropy of 10 bits per word, giving a perplexity of around 1000

Still much better than ASCII, entropy of 24 bits per word == perplexity of around 17 million

"Typical perplexities yielded by n-gram models on English text range from about 50 to almost 1000 (corresponding to cross-entropies from about 6 to 10 bits/word), depending on the type of text."

From An Empirical Study of Smoothing Techniques for Language Modeling, Chen and Goodman 1998, which I recommend for anyone interested in the details of language modelling and smoothing.

6. The impact of missing data

What about the entropy of our sentence using a trigram model trained on Moby Dick?

We have a problem

The trigram "i spent three" doesn't occur in Moby Dick at all
So the probability estimate (using Moby Dick-trained trigrams) of our string is 0!
And the entropy is negative infinity. . .

The reason for this is hidden in the name of our probability estimation method: maximum likelihood estimation

We've hit a classic problem, known as over-fitting
By making Moby Dick's word sequences maximimally likely, we've made all other sentences minimally likely!

We need to take some probability away from the trigrams in Moby Dick

So there's a bit left over to assign to the trigrams that aren't there

There are a wide range of techniques for this, under the name of smoothing

7. Smoothing: Just add one

One of the oldest (it dates back to Laplace in the 18th century) methods is to just "add one"

That is, to adjust the raw counts by assuming everything happened one more time
This has the relevant side-effect of increasing the counts for things that didn't happen from 0 to 1
Jurafsky & Martin work through an example of this in careful detail in section 4.5.1

This appears to presuppose that you know all the things that might happen, even if you didn't count any occurrences

The workaround in cases where you don't is to include a single 'unknown' item
This often makes more sense anyway

8. Laplace smoothing example

Let's look at our example in detail

There is only one trigram beginning "i spent" in Moby Dick: "i spent in", and it only occurs once.
So $P (w | i, spent)$ is 1 if $w = in$ and 0 for all other words.
If we added one to every possible trigram beginning "i spent", using words known to appear in Moby Dick
- The good news: $P (three | i, spent)$ goes up from 0 to 0.00006, because we added "i spent three" in with a count of 1
- The bad news: P(in|i,spent) goes down from 1 to .0001
  - because we increased our N from 1 to 17231
  - and only increased the count for "i spent three" from 1 to 2

In general, this version of Laplace smoothing takes too much away from the knowns, to cover all the unknowns

9. Laplace smoothing, cont'd

The alternative is to treat all unknowns as the same

I.e. just add a single additional entry to each bigram context, spelled, say, as UNK
- In our example, this would only reduce $P (in | i, spent)$ from 1 to 0.67
- and say that $P (UNK | i, spent)$ was 0.33
So what's that give for $P (three | i, spent)$ ?
Divide by the number of words in Moby Dick
- Or the number of words in English
- Or use a unigram-frequency weighted scaling
- Or . . .

Empirical evaluation is the only real way to determine what works well and what doesn't

Values less than one can be added -- this is known as Lidstone smoothing

10. Better smoothing: Good-Turing

Based on a suggestion by Alan Turing, Good-Turing smoothing takes a different, more sensible approach

It has three key aspects:

The sample size, N, doesn't change
The probability estimate for missing items is based on the frequency of items which appear only once (hapaxes)
To keep the total probability constant (at 1) this means the probability estimates for the occuring items have to be reduced from their MLE value
The necessary reduction in all other estimates, to provide the necessary extra for redistribution to unseen items, is based on a local ratio of counts
- That is, the number of words occuring once vs. the number appearing twice
- the number of words appearing twice vs. the number appearing three times
- etc.

11. Good-Turing in detail

Specifically, we push a bit of probability total down to the count class below

So we have some probability to give to the words with count of 0

Working backwards, this amounts to each count being reduced slightly

This is slightly counter-intuitive
It's called discounting. . .

Schematically, we get the following:

$\begin{matrix} c & N_{c} & P_{c} & P_{c} [total] & c* & {P*}_{c} & {P*}_{c} [total] \\ 0 & (N_{0}) & 0 & 0 & (\frac{N_{1}}{N_{0}}) & (\frac{\frac{N_{1}}{N_{0}}}{N}) & \frac{N_{1}}{N} \\ 1 & N_{1} & \frac{1}{N} & \frac{N_{1}}{N} & 2 \frac{N_{2}}{N_{1}} & \frac{2 \frac{N_{2}}{N_{1}}}{N} & \frac{2 N_{2}}{N} \\ 2 & N_{2} & \frac{2}{N} & \frac{2 N_{2}}{N} & 3 \frac{N_{3}}{N_{2}} & \frac{3 \frac{N_{3}}{N_{2}}}{N} & \frac{3 N_{3}}{N} \\ 3 & N_{3} & \frac{3}{N} & \frac{3 N_{3}}{N} & 4 \frac{N_{4}}{N_{3}} & \frac{4 \frac{N_{4}}{N_{3}}}{N} & \frac{4 N_{4}}{N} \end{matrix}$

where

$c$ is a count or frequency
$N_{c}$ is the number of different items with that frequency
$P_{c}$ is the (maximum likelihood estimate) of the probability for an item with that count
$P_{c} [total]$ is the total probability mass for all the items with that count
$c*$ is the Good-Turing smoothed version of the count
${P*}_{c}$ and ${P*}_{c} [total]$ are the Good-Turing smoothed versions of $P_{c}$ and $P_{c} [total]$

12. What about $N_{0}$ ?

Note that we only get the parenthesised bits in the 0 row if we actually know what we're missing

Which is the case for e.g. bigrams
- Where we can at least estimate the number
- As the total possible ( $V^{2}$ ) less the number we've seen ( $N$ )
- where $V$ is the number of word types in the corpus
But often not for unigrams

13. Good-Turing discounts

The basic idea is to arrange the discounts so that the amount we add to the total probability in row 0 is matched by all the discounting in the other rows

Specifically, the Good-Turing discount depends on the adjacent counts: ${P*}_{c} = \frac{(c + 1) \frac{N_{c + 1}}{N_{c}}}{N}$

This is usually stated by first defining a discounted count $c* = (c + 1) \frac{N_{c + 1}}{N_{c}}$ and then just defining $P* = \frac{c*}{N}$

Either way, the important thing is that since the counts tend to go down as c goes up, the multiplier is less than one, and we get a discount as required.

The sum of the impact on $P*$ of all the discounts is indeed $\frac{N_{1}}{N}$ , as required to balance out row 0

We see this not directly, but because the sum of all the P* is evidently the same as the sum of all the P
- $\sum_{c = 0}^{\max - 1} {P*}_{c} [total]$ $= \frac{N_{1}}{N} + \sum_{c = 1}^{\max - 1} N_{c} {P*}_{c}$ $= \frac{N_{1}}{N} + \sum_{c = 1}^{\max - 1} N_{c} \frac{(c + 1) \frac{N_{c + 1}}{N_{c}}}{N}$ $= \frac{N_{1}}{N} + \sum_{c = 1}^{\max - 1} \frac{(c + 1) N_{c + 1}}{N}$ $= \frac{N_{1}}{N} + \sum_{c = 2}^{\max} \frac{c N_{c}}{N}$ $= \sum_{c = 1}^{\max} \frac{c N_{c}}{N}$ $= \sum_{c = 1}^{\max} P_{c} [total]$ $= 1$

An example, based on the first few rows of the frequency of frequencies for Moby Dick, is worked through in a Good-Turing example spreadsheet

14. Required reading

Jurafsky & Martin, Chapter 4, sections 4.4--4.7

15. Administrative details

Course home page is at http://www.inf.ed.ac.uk/teaching/courses/fnlp/

Do your own work yourself

Tutorial groups now underway: please go to a tutorial.

You really will need to have access to a copy of the text, and the edition you got for Inf2A last year will not do.

Notices will come via the mailing list.

Assignment due in just over two weeks

Some clarifications forthcoming