FNLP 2014: Lecture 9: HMM decoding, forward, backward

Henry S. Thompson
11 February 2014
Creative CommonsAttributionShare Alike

1. Recap: POS tagging as noisy channel decoding

For any noisy channel decoding task, we're trying to find

argmaxs1nP(o1n|s1n)likelihoodP(s1n)prior

In the POS tagging case, the source is tags and the observations are words, so we have

argmaxt1nP(w1n|t1n)likelihoodP(t1n)prior

We make our two simplifying assumptions (independence of likelihoods and bigram modelling for the priors), and get

argmaxt1ni=1nP(wi|ti)i=1nP(ti|ti-1)

2. Viterbi search for decoding

We can use dynamic programming to find the most likely path through an HMM given a sequence of observations

Our dynamic programming table has

See J&M 6.4 for a full worked example.

3. Viterbi decoding for tagging: Partial worked example

We'll work the first two steps of an example modified from J&M section 5.5.3, namely the question "want race?", using their priors and likelihoods.

language and channel model numbers from J&M

First, the starting point, then the first cell, then the rest of the first column:

NN
TO
VB
PP
SS
wantrace
NNP(NN|.)*P(want|NN)
.041*.000054
.0000022
TO
VB
PP
SS
wantrace
NNP(NN|.)*P(want|NN)
.041*.000054
.0000022
TOP(TO|.)*P(want|TO)
.0043*0
0
VBP(VB|.)*P(want|VB)
.019*.0093
.00018
PP
SS
P(PS|.)*P(want|PS)
.067*0
0
wantrace

4. Viterbi decoding for tagging: Partial worked example, cont'd

Now the second column, where we have to maximise over each possible preceding state:

language and channel model numbers from J&M
NN.0000022
P(NN)*P(NN|NN)
.0000022*.087
.00000019
P(TO)*P(NN|TO)
0*.00047
0
P(VB)*P(NN|VB)
.00018*.047
.0000085
P(PS)*P(NN|PS)
0*.0012
0
max*P(race|NN)
.0000085*.00057
.0000000048
VB
TO0
P(NN)*P(TO|NN)
.0000022*.016
.000000035
P(TO)*P(TO|TO)
0*0
0
P(VB)*P(TO|VB)
.00018*.035
.0000063
P(PS)*P(TO|PS)
0*.00079
0
max*P(race|TO)
.0000063*0
0
VB
VB.00018
P(NN)*P(VB|NN)
.0000022*.0040
.0000000088
P(TO)*P(VB|TO)
0*.83
0
P(VB)*P(VB|VB)
.00018*.0038
.00000068
P(PS)*P(VB|PS)
0*.23
0
max*P(race|VB)
.00000068*.00012
.000000000082
VB
PP
SS
0
P(NN)*P(PS|NN)
.0000022*.0045
.0000000099
P(TO)*P(PS|TO)
0*0
0
P(VB)*P(PS|VB)
.00018*.0070
.0000013
P(PS)*P(PS|PS)
0*.00014
0
max*P(race|PS)
.0000013*0
0
VB
wantrace

So if we stopped here, with no exit probabilities, we would pick NN for 'race', and follow the backpointer from that cell to VB for 'want'

5. Viterbi decoding for tagging: Partial worked example using costs

This is easier to follow if we use costs (and minimise).

NN
TO
VB
PP
SS
wantrace
NNc(NN|.)+c(want|NN)
4.61+14.18
18.79
TO
VB
PP
SS
wantrace
NNc(NN|.)+c(want|NN)
4.61+14.18
18.79
TOc(TO|.)+c(want|TO)
7.86+∞
VBc(VB|.)+c(want|VB)
5.72+6.75
12.47
PP
SS
c(PS|.)+c(want|PS)
3.90+∞
wantrace

6. Viterbi decoding: Partial worked example using costs, cont'd

Now the second column, where we have to minimise over each possible preceding state:

NN18.79
c(NN)+c(NN|NN)
18.79+3.52
22.31
c(TO)+c(NN|TO)
∞+11.06
c(VB)+c(NN|VB)
12.47+4.41
16.85
c(PS)+c(NN|PS)
∞+9.70
min+c(race|NN)
16.85+10.78
27.63
VB
TO
c(NN)+c(TO|NN)
18.79+5.97
24.76
c(TO)+c(TO|TO)
∞+∞
c(VB)+c(TO|VB)
12.47+4.84
17.31
c(PS)+c(TO|PS)
∞+10.31
min+c(race|TO)
17.31+∞
VB
VB12.47
c(NN)+c(VB|NN)
18.79+7.97
26.76
c(TO)+c(VB|TO)
∞+0.27
c(VB)+c(VB|VB)
12.47+8.04
20.51
c(PS)+c(VB|PS)
∞+2.12
min+c(race|VB)
20.51+13.02
33.53
VB
PP
SS
c(NN)+c(PS|NN)
18.79+7.80
26.59
c(TO)+c(PS|TO)
∞+∞
c(VB)+c(PS|VB)
12.47+7.16
19.63
c(PS)+c(PS|PS)
∞+12.80
min+c(race|PS)
19.63+∞
VB
wantrace

So if we stopped here, with no exit probabilities, we would pick NN for 'race', and follow the backpointer from that cell to VB for 'want'

7. HMM for POS: trivial example

A fully-connected, state-labelled, FSM for three simple POS tags:

FSM with start, end and three fully-connected states labelled D, N and V

A bigram-probability transition matrix gives us a Markov chain:

DNVe
s0.810.150.000.03
D0.001.000.000.00
N0.090.310.060.54
V0.700.170.010.12

And the output probabilities:

theasleepmansleeps
D0.700.230.00.00.0
N0.00.00030.00020.0080.0
V0.00.00.010.00.0001

So the probability of getting a man sleeps from the transition sequence s D N V e is

(0.81×1.0×.06×.12)×(.23×.008×.0001)=1.07×10-9

8. Total probability

The total probability that a given observation sequence will occur requires us to look at more than the best path

Think about generating at random from our example

FSM with start, end and three fully-connected states labelled D, N and V

We'll get a man sleeps from the transition sequence s D N V e

But we'll also get it from the sequence s N N V e

9. Another example

This looks more plausible if we look at the "want race" example

transition and likelihood matrices from J&R for 'want race'

Earlier we used Viterbi decoding to find the most likely path

But there are three other ways to get "want race"

So the total probability is at least 4.84×10-9

10. Dynamic programming for total probability

In practice, this just means replacing 'max P...' with 'sum P...'

transition and likelihood matrices from J&R for 'want race'
VB.00018
P(VB)*P(VB|VB)
.00018*.0038
.00000068
P(TO)*P(VB|TO)
0*.83
0
P(NN)*P(VB|NN)
.0000022*.0040
.0000000088
P(PS)*P(VB|PS)
0*.23
0
sum*P(race|VB)
(.0000000088+.00000068)*.00012
.000000000083
TO0
P(VB)*P(TO|VB)
.00018*.035
.0000063
P(TO)*P(TO|TO)
0*0
0
P(NN)*P(TO|NN)
.0000022*.016
.000000035
P(PS)*P(TO|PS)
0*.00079
0
sum*P(race|TO)
(.000000035+.0000063)*0
0
NN.0000022
P(VB)*P(NN|VB)
.00018*.047
.0000085
P(TO)*P(NN|TO)
0*.00047
0
P(NN)*P(NN|NN)
.0000022*.087
.00000019
P(PS)*P(NN|PS)
0*.0012
0
sum*P(race|NN)
(.00000019+.0000085)*.00057
.0000000050
PP
SS
0
P(VB)*P(PS|VB)
.00018*.0070
.0000013
P(TO)*P(PS|TO)
0*0
0
P(NN)*P(PS|NN)
.0000022*.0045
.0000000099
P(PS)*P(PS|PS)
0*.00014
0
sum*P(race|PS)
(.0000000099+.0000013)*0
0
wantrace

It's worth noting that in each cell (except in the first column) of both lattices, we work with the pairwise product of two vectors: the probabilities from the previous column of the lattice and the transition probabilities into the state corresponding to the cell

11. Forward and onward (backward)

The dynamic programming approach to total probability calculation is called the forward algorithm

The cell for state i in column t (because observations are considered to come sequentially in time) is usually notated αt(i)

The dual of the forward algorithm is the backward algorithm, which computes βt(j), the probability of seeing the observations from time t+1 to the end starting from state j:

I find it easier to think of this as the onward contribution

12. Forward-backward combined

The forward-backward algorithm is used to train an HMM

At its heart is an expression for the joint probability, given a particular HMM, of

We get this by combining the forward and backward probabilities with the probability of the relevant transition from i to j and the relevant output probability at j:

The basic operation of the forward-backward algorithm is to repeatedly re-estimate the transitional and likelihood probabilities

13. Required reading

Jurafsky & Martin, Chapter 5, sections 5.5.3, 5.5.4; Chapter 6, sections 6.3–6.5

14. Administrative details

Course home page is at http://www.inf.ed.ac.uk/teaching/courses/fnlp/

Do your own work yourself

The assignment is due on Thursday at 1600