INF1-CG 2013 Lecture 24: Bayes' theorem, the noisy channel and word segmentation

1. From generation to analysis: Bayes' theorem

It's not accidental that we've been working from the source to the observations

It's much easier to estimate the source and channel probabilities in that direction
And combining them to give the probability of an observation

How can we get back to analysis?

That is, getting from observations back to the most probable source?

More formally, we've been thinking about $P (o_{1}^{n} | s_{1}^{n})$ when what we want is $P (s_{1}^{n} | o_{1}^{n})$

Those formulae for conditional probability have the answer.

We defined joint probability this way:

$P (x, y) = P (x) P (y | x)$
$P (x, y) = P (y) P (x | y)$

So, it follows that

$P (x) P (y | x) = P (y) P (x | y)$

And so, dividing through:

$P (y | x) = \frac{P (y) P (x | y)}{P (x)}$

And indeed that offer us hope, as the conditional probability in one direction has been expressed in terms of one in the other direction

2. Bayes' theorem to the rescue

We need to find $\underset{s_{1}^{n}}{argmax} P (s_{1}^{n} | o_{1}^{n})$

It's not obvious how to compute $P (s_{1}^{n} | o_{1}^{n})$ for even some particular $s_{1}^{n}$ and $o_{1}^{n}$

Much less for all $s_{1}^{n}$ , which is what a literal interpretation of $\underset{s_{1}^{n}}{argmax} P (s_{1}^{n} | o_{1}^{n})$ would require us to look at.

But if we apply Bayes' rule, that is $P (y | x) = \frac{P (y) P (x | y)}{P (x)}$ , we get something we can work with:

$\underset{s_{1}^{n}}{argmax} \frac{P (o_{1}^{n} | s_{1}^{n}) P (s_{1}^{n})}{P (o_{1}^{n})}$

Why is this an improvement? Because each of the three probabilities in the new formula is something we can do something with:

The denominator, $P (o_{1}^{n})$ , doesn't depend on $s_{1}^{n}$ (the thing we are varying), so it can be ignored in looking for the maximum;
$P (o_{1}^{n} | s_{1}^{n})$ , known as the likelihood, just needs a model of the effect of the channel, and we've just seen an example of how to get that via experiment;
$P (s_{1}^{n})$ , known as the prior, just needs a model of the source, that is, just the kind of language model we've been looking at this week.

In summary, we have the following, simpler, maximisation to undertake:

$\underset{s_{1}^{n}}{argmax} \overset{likelihood}{\overset{︷}{P (o_{1}^{n} | s_{1}^{n})}} \overset{prior}{\overset{︷}{P (s_{1}^{n})}}$

3. Noisy word decoding

So let's apply Bayes' theorem to get from a noisy letter string (say we think we see "VHAJE") back to a single-word source (what was really there behind the screen)

We'll use a bi-letter-gram language model for the prior

And the confusion matrix channel model for the likelihood

To ask, for example, if "WHALE" is the most likely source for "VHAJE" as observation

The number we need is the product of the prior (for "WHALE" as a word) with the likelihood (for "W" appearing as "V", "H" as "H" and so on). That's:

prior The product of six bigram probabilities:
- P(W| ) P(H|W) P(A|H) P(L|A) P(E|L) P( |E)
likelihood The product of five confusion probabilities
- P(V|W) P(H|H) P(A|A) P(J|L) P(E|E)

The total product is approximately 4/100,000,000,000

4. Noisy word decoding, cont'd

0.00000000004 may not seem like a very high probability :-)

But it's the result of multiplying 11 small numbers together
If we normalise for word length
- Useful as we'll be introducing the possibility of different length answers when we get to segmentation
- We'll use the geometric mean, in this case the sixth-root
It's only .018

What's more important is that this is the highest probability over all 5-letter sources.

It's higher than that for "VHAJE" itself: .0014
Or for "WHALF": .01

The technique for finding the highest probability source using a bigram language model (prior) and an independent channel model (likelihood) is called Viterbi decoding

It's very efficient
- Much more efficient than trying every possible source string
It and variants are used all over the place
You may learn about it in detail next year, or in 3rd year

5. Noisy word segmentation

To get closer to a simple (simplistic) model of speech recognition

We need to extend to multiple-word sequences
Without spaces

For example, to ask if "IS HE" is the most likely source for "JSHC"

Similarly to our previous example, we need the prior (for the two-word string "IS JE") and the likelihood (of "I" appearing as "J" and so on):

prior The product of six bigram probabilities:
- P(I| ) P(S|I) P( |S) P(H| ) P(E|H) P( |E)
likelihood The product of four confusion probabilities
- P(J|I) P(S|S) P(H|H) P(C|E)

The probability (per word) for this is 0.02

Which, again, is the highest value for any segmentation of any source sequence

We can try longer strings

The most probable source for "QFTAEWHAJE"
- is "OF THE WHALE"
But the most probable source for "SEESHIPS"
- is "SEES HIPS"
- Which isn't bad, but it's not right either

Proper evaluation of the quality of models such as this one is itself a complex process

Which we won't go into in this course

6. The story so far

We've seen how a probabilistic generative system

Combining sequence probabilities
with confusion probabilities

Can be interpreted as a simple model of language as perceived via a noisy channel

And that we can use such a model to decode observed sequences

Courtesy of Bayes' theorem
And Viterbi decoding

We illustrated this using

bi-letter-gram probabilities estimated from Moby Dick
And confusion probabilities estimated from experimental data

7. Full disclosure

Although the simple word recogniser I described first uses Viterbi decoding

In a very similar way to how e.g. practical predictive texting systems work on smart-phones today

The system I used to compute the most probable segmentation wasn't so simple

It actually had to take every possible segmentation
And compare the Viterbi-based probability of the most-probable source for each
To find the overall best analysis

For long strings, that would take a long time!

8. What's wrong with this picture?

What parts of the story so far make it unsuitable as a real psychological model?

It's not just that the time to find the best segmentation goes up too fast as the strings get longer

But the fact that it assumes we already know the answer!

That is, we built the sequence model from bigrams

Which came for a corpus which already had word boundaries!

But an infant trying to learn where the words are in the speech s/he hears

Wouldn't have this information

All the examples of machine 'learning' we've seen so far use some form of what is called supervised learning

That is, learning (or statistics) based on examples of questions and the right answers
Or, equivalently, of labelled data

But infant language learning, especially early on, can't be like that

It must be some kind of what is called unsupervised learning

9. They have to know something!

The challenge then is to find a starting point for learning segmentation which

Is plausibly present in the infant 'natively'
Is sufficient to accomplish the task

Sharon Goldwater works in this area

She's a colleague here
What I'm going to present is taken from her PhD work at Brown in the USA

She used the noisy channel model approach to explore just how simple a set of principles could suffice for the segmentation task

Basically by encoding them in the prior
After all, the prior is essentially what a model 'knows' about the source a priori, that is, "from the beginning"