INF1-CG 2013 Lecture 24: Bayes' theorem, the noisy channel and word segmentation

Henry S. Thompson
15 March 2013
Creative CommonsAttributionShare Alike

1. From generation to analysis: Bayes' theorem

It's not accidental that we've been working from the source to the observations

How can we get back to analysis?

More formally, we've been thinking about P(o1n|s1n) when what we want is P(s1n|o1n)

Those formulae for conditional probability have the answer.

We defined joint probability this way:

So, it follows that

And so, dividing through:

And indeed that offer us hope, as the conditional probability in one direction has been expressed in terms of one in the other direction

2. Bayes' theorem to the rescue

We need to find argmaxs1nP(s1n|o1n)

It's not obvious how to compute P(s1n|o1n) for even some particular s1n and o1n

But if we apply Bayes' rule, that is P(y|x)=P(y)P(x|y)P(x), we get something we can work with:

argmaxs1nP(o1n|s1n)P(s1n)P(o1n)

Why is this an improvement? Because each of the three probabilities in the new formula is something we can do something with:

In summary, we have the following, simpler, maximisation to undertake:

argmaxs1nP(o1n|s1n)likelihoodP(s1n)prior

3. Noisy word decoding

So let's apply Bayes' theorem to get from a noisy letter string (say we think we see "VHAJE") back to a single-word source (what was really there behind the screen)

We'll use a bi-letter-gram language model for the prior

And the confusion matrix channel model for the likelihood

To ask, for example, if "WHALE" is the most likely source for "VHAJE" as observation

The number we need is the product of the prior (for "WHALE" as a word) with the likelihood (for "W" appearing as "V", "H" as "H" and so on). That's:

The total product is approximately 4/100,000,000,000

4. Noisy word decoding, cont'd

0.00000000004 may not seem like a very high probability :-)

What's more important is that this is the highest probability over all 5-letter sources.

The technique for finding the highest probability source using a bigram language model (prior) and an independent channel model (likelihood) is called Viterbi decoding

5. Noisy word segmentation

To get closer to a simple (simplistic) model of speech recognition

For example, to ask if "IS HE" is the most likely source for "JSHC"

Similarly to our previous example, we need the prior (for the two-word string "IS JE") and the likelihood (of "I" appearing as "J" and so on):

The probability (per word) for this is 0.02

We can try longer strings

Proper evaluation of the quality of models such as this one is itself a complex process

6. The story so far

We've seen how a probabilistic generative system

Can be interpreted as a simple model of language as perceived via a noisy channel

And that we can use such a model to decode observed sequences

We illustrated this using

7. Full disclosure

Although the simple word recogniser I described first uses Viterbi decoding

The system I used to compute the most probable segmentation wasn't so simple

For long strings, that would take a long time!

8. What's wrong with this picture?

What parts of the story so far make it unsuitable as a real psychological model?

It's not just that the time to find the best segmentation goes up too fast as the strings get longer

That is, we built the sequence model from bigrams

But an infant trying to learn where the words are in the speech s/he hears

All the examples of machine 'learning' we've seen so far use some form of what is called supervised learning

But infant language learning, especially early on, can't be like that

9. They have to know something!

The challenge then is to find a starting point for learning segmentation which

Sharon Goldwater works in this area

She used the noisy channel model approach to explore just how simple a set of principles could suffice for the segmentation task