FNLP 2014: Lecture 5: N-gram model evaluation, smoothing

Henry S. Thompson
28 January 2014
Creative CommonsAttributionShare Alike

1. Evaluating N-Grams: Extrinsic vs. intrinsic

So, we've estimated a lot of conditional probabilities

And we can multiply them together (or add their logs) to get a probablity for a whole sentence

How good are our estimates?

Extrinsic Evaluation: How much help are they?

Intrinsic Evaluation: How accurately do they model the training data?

2. Evaluation: Entropy

There's no obvious absolute measure, but with a relative measure we can at least compare two models

We could simply compare the (log) probability estimate for the whole training set, or for a random sampling of sentences in it, or in a test corpus.

Traditionally this is expressed as the entropy, which is the negative of the sum of the log (base 2) probabilities of each constituent N-gram

Strictly speaking, this is cross-entropy

3. Entropy example

So for our "I spent three years before the mast", the entropy using a bigram model trained on Moby Dick is (being careful about the left margin this time)

In contrast, the entropy with respect to the unigram model trained on Moby Dick is 68.1506

And to remind ourselves of why we're doing this in the log domain

This gives us a concrete estimate of how much better the bigram model is than the unigram model

4. Interpreting entropy

In calculating entropy, we use log (base 2) because this supports the interpretation of the entropy as the number of binary bits needed to encode the string using an information-theoretically optimal encoding, which uses short bitstrings for common words and longer ones for less common words.

5. Perplexity

Perplexity is just 2entropy

Word-level perplexity can be understood as the average branching factor at each point in the language

Our unigram word-level entropy of 10 bits per word, giving a perplexity of around 1000

"Typical perplexities yielded by n-gram models on English text range from about 50 to almost 1000 (corresponding to cross-entropies from about 6 to 10 bits/word), depending on the type of text."

From An Empirical Study of Smoothing Techniques for Language Modeling, Chen and Goodman 1998, which I recommend for anyone interested in the details of language modelling and smoothing.

6. The impact of missing data

What about the entropy of our sentence using a trigram model trained on Moby Dick?

We have a problem

The reason for this is hidden in the name of our probability estimation method: maximum likelihood estimation

We need to take some probability away from the trigrams in Moby Dick

There are a wide range of techniques for this, under the name of smoothing

7. Smoothing: Just add one

One of the oldest (it dates back to Laplace in the 18th century) methods is to just "add one"

This appears to presuppose that you know all the things that might happen, even if you didn't count any occurrences

8. Laplace smoothing example

Let's look at our example in detail

In general, this version of Laplace smoothing takes too much away from the knowns, to cover all the unknowns

9. Laplace smoothing, cont'd

The alternative is to treat all unknowns as the same

Empirical evaluation is the only real way to determine what works well and what doesn't

Values less than one can be added -- this is known as Lidstone smoothing

10. Better smoothing: Good-Turing

Based on a suggestion by Alan Turing, Good-Turing smoothing takes a different, more sensible approach

It has three key aspects:

11. Good-Turing in detail

Specifically, we push a bit of probability total down to the count class below

Working backwards, this amounts to each count being reduced slightly

Schematically, we get the following:

cNcPcPc[total]c*P*cP*c[total]0(N0)00(N1N0)(N1N0N)N1N1N11NN1N2N2N12N2N1N2N2N2N22N2N2N3N3N23N3N2N3N3N3N33N3N3N4N4N34N4N3N4N4N

where

12. What about N0?

Note that we only get the parenthesised bits in the 0 row if we actually know what we're missing

13. Good-Turing discounts

The basic idea is to arrange the discounts so that the amount we add to the total probability in row 0 is matched by all the discounting in the other rows

Specifically, the Good-Turing discount depends on the adjacent counts: P*c=(c+1)Nc+1NcN

This is usually stated by first defining a discounted count c*=(c+1)Nc+1Nc and then just defining P*=c*N

Either way, the important thing is that since the counts tend to go down as c goes up, the multiplier is less than one, and we get a discount as required.

The sum of the impact on P* of all the discounts is indeed N1N, as required to balance out row 0

An example, based on the first few rows of the frequency of frequencies for Moby Dick, is worked through in a Good-Turing example spreadsheet

14. Required reading

Jurafsky & Martin, Chapter 4, sections 4.4--4.7

15. Administrative details

Course home page is at http://www.inf.ed.ac.uk/teaching/courses/fnlp/

Do your own work yourself

Tutorial groups now underway: please go to a tutorial.

You really will need to have access to a copy of the text, and the edition you got for Inf2A last year will not do.

Notices will come via the mailing list.

Assignment due in just over two weeks