For any noisy channel decoding task, we're trying to find
In the POS tagging case, the source is tags and the observations are words, so we have
We make our two simplifying assumptions (independence of likelihoods and bigram modelling for the priors), and get
We can use dynamic programming to find the most likely path through an HMM given a sequence of observations
Our dynamic programming table has
See J&M 6.4 for a full worked example.
We'll work the first two steps of an example modified from J&M section 5.5.3, namely the question "want race?", using their priors and likelihoods.
First, the starting point, then the first cell, then the rest of the first column:
NN | ||
TO | ||
VB | ||
PP SS | ||
want | race |
NN | P(NN|.)*P(want|NN) .041*.000054 .0000022 | |
TO | ||
VB | ||
PP SS | ||
want | race |
NN | P(NN|.)*P(want|NN) .041*.000054 .0000022 | |
TO | P(TO|.)*P(want|TO) .0043*0 0 | |
VB | P(VB|.)*P(want|VB) .019*.0093 .00018 | |
PP SS | P(PS|.)*P(want|PS) .067*0 0 | |
want | race |
Now the second column, where we have to maximise over each possible preceding state:
NN | .0000022 |
| ||||||||||||
TO | 0 |
| ||||||||||||
VB | .00018 |
| ||||||||||||
PP SS | 0 |
| ||||||||||||
want | race |
So if we stopped here, with no exit probabilities, we would pick NN for 'race', and follow the backpointer from that cell to VB for 'want'
VB NN
as our answerThis is easier to follow if we use costs (and minimise).
NN | ||
TO | ||
VB | ||
PP SS | ||
want | race |
NN | c(NN|.)+c(want|NN) 4.61+14.18 18.79 | |
TO | ||
VB | ||
PP SS | ||
want | race |
NN | c(NN|.)+c(want|NN) 4.61+14.18 18.79 | |
TO | c(TO|.)+c(want|TO) 7.86+∞ ∞ | |
VB | c(VB|.)+c(want|VB) 5.72+6.75 12.47 | |
PP SS | c(PS|.)+c(want|PS) 3.90+∞ ∞ | |
want | race |
Now the second column, where we have to minimise over each possible preceding state:
NN | 18.79 |
| ||||||||||||
TO | ∞ |
| ||||||||||||
VB | 12.47 |
| ||||||||||||
PP SS | ∞ |
| ||||||||||||
want | race |
So if we stopped here, with no exit probabilities, we would pick NN for 'race', and follow the backpointer from that cell to VB for 'want'
VB NN
as our answerA fully-connected, state-labelled, FSM for three simple POS tags:
A bigram-probability transition matrix gives us a Markov chain:
D | N | V | e | |
s | 0.81 | 0.15 | 0.00 | 0.03 |
D | 0.00 | 1.00 | 0.00 | 0.00 |
N | 0.09 | 0.31 | 0.06 | 0.54 |
V | 0.70 | 0.17 | 0.01 | 0.12 |
And the output probabilities:
the | a | sleep | man | sleeps | |
D | 0.70 | 0.23 | 0.0 | 0.0 | 0.0 |
N | 0.0 | 0.0003 | 0.0002 | 0.008 | 0.0 |
V | 0.0 | 0.0 | 0.01 | 0.0 | 0.0001 |
So the probability of getting a man sleeps from the
transition sequence s D N V e
is
The total probability that a given observation sequence will occur requires us to look at more than the best path
Think about generating at random from our example
We'll get a man sleeps from the
transition sequence s D N V e
But we'll also get it from the sequence s N N V e
This looks more plausible if we look at the "want race" example
Earlier we used Viterbi decoding to find the most likely path
want_VB race_NN
But there are three other ways to get "want race"
want_NN race_NN
So the total probability is at least
In practice, this just means replacing 'max P...' with 'sum P...'
VB | .00018 |
| ||||||||
TO | 0 |
| ||||||||
NN | .0000022 |
| ||||||||
PP SS | 0 |
| ||||||||
want | race |
It's worth noting that in each cell (except in the first column) of both lattices, we work with the pairwise product of two vectors: the probabilities from the previous column of the lattice and the transition probabilities into the state corresponding to the cell
.00018 0 .0000022 0
and
.035 0 .016 .00079
(See the table of transition probabilities from
Figure 5.15 on p. 180 of J&M)The dynamic programming approach to total probability calculation is called the forward algorithm
The cell for state i in column t (because observations are considered to come sequentially in time) is usually notated
The dual of the forward algorithm is the backward algorithm, which computes , the probability of seeing the observations from time t+1 to the end starting from state j:
I find it easier to think of this as the onward contribution
The forward-backward algorithm is used to train an HMM
At its heart is an expression for the joint probability, given a particular HMM, of
We get this by combining the forward and backward probabilities with the probability of the relevant transition from i to j and the relevant output probability at j:
The basic operation of the forward-backward algorithm is to repeatedly re-estimate the transitional and likelihood probabilities
Jurafsky & Martin, Chapter 5, sections 5.5.3, 5.5.4; Chapter 6, sections 6.3–6.5
Course home page is at http://www.inf.ed.ac.uk/teaching/courses/fnlp/
Do your own work yourself
The assignment is due on Thursday at 1600