1. From generation to analysis: Bayes' theorem
It's not accidental that we've been working from the source
to the observations
- It's much easier to estimate the source and channel probabilities in
that direction
- And combining them to give the probability of an observation
How can we get back to analysis?
- That is, getting from observations back to the most
probable source?
More formally, we've been thinking about when what we want is
Those formulae for conditional probability have the answer.
We defined joint probability this way:
So, it follows that
And so, dividing through:
And indeed that offer us hope, as the conditional probability in one
direction has been expressed in terms of one in the other direction
2. Bayes' theorem to the rescue
We need to find
It's not obvious how to compute for even some particular and
- Much less for all, which is what a literal
interpretation of would require us to look at.
But if we apply Bayes' rule, that is , we get something we can work with:
Why is this an improvement? Because each of the three probabilities in
the new formula is something we can do something with:
- The denominator,
, doesn't depend on (the thing we are varying), so it can be ignored in looking for
the maximum;
- , known as the likelihood, just needs a model of the
effect of the channel, and we've just seen an example of how to get that via experiment;
- , known as the prior, just needs a model of the
source, that is, just the kind of language model we've been looking at this week.
In summary, we have the following, simpler, maximisation to undertake:
3. Noisy word decoding
So let's apply Bayes' theorem to get from a noisy letter
string (say we think we see "VHAJE") back to a single-word source
(what was really there behind the screen)
We'll use a bi-letter-gram language model for the prior
And the confusion matrix channel model for the likelihood
To ask, for example, if "WHALE" is the most likely source for "VHAJE" as observation
The number we need is the product of the prior (for "WHALE" as a word)
with the likelihood (for "W" appearing as "V", "H" as "H" and so on). That's:
- prior
The product of six bigram probabilities:
- P(W| ) P(H|W) P(A|H) P(L|A) P(E|L) P( |E)
- likelihood
The product of five confusion probabilities
- P(V|W) P(H|H) P(A|A) P(J|L) P(E|E)
The total product is approximately 4/100,000,000,000
4. Noisy word decoding, cont'd
0.00000000004 may not seem like a very high probability :-)
- But it's the result of multiplying 11 small numbers together
- If we normalise for word length
- Useful as we'll be introducing the possibility of different length
answers when we get to segmentation
- We'll use the geometric mean, in this case the sixth-root
- It's only .018
What's more important is that this is the highest
probability over all 5-letter sources.
- It's higher than that for "VHAJE" itself: .0014
- Or for "WHALF": .01
The technique for finding the highest probability source using a bigram
language model (prior) and an independent channel model (likelihood) is called
Viterbi decoding
- It's very efficient
- Much more efficient than trying every possible source string
- It and variants are used all over the place
- You may learn about it in detail next year, or in 3rd year
5. Noisy word segmentation
To get closer to a simple (simplistic) model of speech recognition
- We need to extend to multiple-word sequences
- Without spaces
For example, to ask if "IS HE" is the most likely source for "JSHC"
Similarly to our previous example, we need the prior (for the two-word
string "IS JE") and the likelihood (of "I" appearing as "J" and so on):
- prior
The product of six bigram probabilities:
- P(I| ) P(S|I) P( |S) P(H| ) P(E|H) P( |E)
- likelihood
The product of four confusion probabilities
- P(J|I) P(S|S) P(H|H) P(C|E)
The probability (per word) for this is 0.02
- Which, again, is the highest value for any segmentation
of any source sequence
We can try longer strings
- The most probable source for "QFTAEWHAJE"
- But the most probable source for "SEESHIPS"
- is "SEES HIPS"
- Which isn't bad, but it's not right either
Proper evaluation of the quality of models such as this one is itself a
complex process
- Which we won't go into in this course
6. The story so far
We've seen how a probabilistic generative system
- Combining sequence probabilities
- with confusion probabilities
Can be interpreted as a simple model of language as
perceived via a noisy channel
And that we can use such a model to decode observed sequences
- Courtesy of Bayes' theorem
- And Viterbi decoding
We illustrated this using
- bi-letter-gram probabilities estimated from Moby Dick
- And confusion probabilities estimated from experimental data
7. Full disclosure
Although the simple word recogniser I described first uses
Viterbi decoding
- In a very similar way to how e.g. practical predictive texting
systems work on smart-phones today
The system I used to compute the most probable segmentation wasn't so simple
- It actually had to take every possible segmentation
- And compare the Viterbi-based probability of the most-probable source
for each
- To find the overall best analysis
For long strings, that would take a long time!
8. What's wrong with this picture?
What parts of the story so far make it unsuitable as a real
psychological model?
It's not just that the time to find the best segmentation goes up too
fast as the strings get longer
- But the fact that it assumes we already know the answer!
That is, we built the sequence model from bigrams
- Which came for a corpus which already had word boundaries!
But an infant trying to learn where the words are in the
speech s/he hears
- Wouldn't have this information
All the examples of machine 'learning' we've seen so far use some form of
what is called supervised learning
- That is, learning (or statistics) based on examples of questions
and the right answers
- Or, equivalently, of labelled data
But infant language learning, especially early on, can't be like that
- It must be some kind of what is called unsupervised learning
9. They have to know something!
The challenge then is to find a starting point for learning segmentation which
- Is plausibly present in the infant 'natively'
- Is sufficient to accomplish the task
Sharon Goldwater works in this area
- She's a colleague here
- What I'm going to present is taken from her PhD work at Brown in the USA
She used the noisy channel model approach to explore just how simple a
set of principles could suffice for the segmentation task
- Basically by encoding them in the prior
- After all, the prior is essentially what a model 'knows' about the
source a priori, that is, "from the beginning"