FNLP 2014: Lecture 9: HMM decoding, forward, backward

1. Recap: POS tagging as noisy channel decoding

For any noisy channel decoding task, we're trying to find

$\underset{s_{1}^{n}}{argmax} \overset{likelihood}{\overset{︷}{P (o_{1}^{n} | s_{1}^{n})}} \overset{prior}{\overset{︷}{P (s_{1}^{n})}}$

In the POS tagging case, the source is tags and the observations are words, so we have

$\underset{t_{1}^{n}}{argmax} \overset{likelihood}{\overset{︷}{P (w_{1}^{n} | t_{1}^{n})}} \overset{prior}{\overset{︷}{P (t_{1}^{n})}}$

We make our two simplifying assumptions (independence of likelihoods and bigram modelling for the priors), and get

$\underset{t_{1}^{n}}{argmax} \prod_{i = 1}^{n} P (w_{i} | t_{i}) \prod_{i = 1}^{n} P (t_{i} | t_{i - 1})$

2. Viterbi search for decoding

We can use dynamic programming to find the most likely path through an HMM given a sequence of observations

Parallel to the way we used it for the spelling correction task
When used with HMMs, we call it Viterbi decoding
'decoding' because we can then read off the most likely source sequence from the state sequence of the path

Our dynamic programming table has

observations along the bottom
states along the left-hand side
probabilities (or, typically, costs == negative log probabilities) in the cells
plus backpointers

See J&M 6.4 for a full worked example.

3. Viterbi decoding for tagging: Partial worked example

We'll work the first two steps of an example modified from J&M section 5.5.3, namely the question "want race?", using their priors and likelihoods.

Which are called transition and sensor models, respectively, in this borrowed figure

language and channel model numbers from J&M

First, the starting point, then the first cell, then the rest of the first column:

NN
TO
VB
PP SS
	want	race

NN	P(NN\|.)P(want\|NN) .041.000054 .0000022
TO
VB
PP SS
	want	race

NN	P(NN\|.)P(want\|NN) .041.000054 .0000022
TO	P(TO\|.)P(want\|TO) .00430 0
VB	P(VB\|.)P(want\|VB) .019.0093 .00018
PP SS	P(PS\|.)P(want\|PS) .0670 0
	want	race

4. Viterbi decoding for tagging: Partial worked example, cont'd

Now the second column, where we have to maximise over each possible preceding state:

.0000022

P(NN)P(NN\|NN) .0000022.087 .00000019	P(TO)P(NN\|TO) 0.00047 0	P(VB)P(NN\|VB) .00018.047 .0000085	P(PS)P(NN\|PS) 0.0012 0
maxP(race\|NN) .0000085.00057 .0000000048
VB

P(NN)P(TO\|NN) .0000022.016 .000000035	P(TO)P(TO\|TO) 00 0	P(VB)P(TO\|VB) .00018.035 .0000063	P(PS)P(TO\|PS) 0.00079 0
maxP(race\|TO) .00000630 0
VB

.00018

P(NN)P(VB\|NN) .0000022.0040 .0000000088	P(TO)P(VB\|TO) 0.83 0	P(VB)P(VB\|VB) .00018.0038 .00000068	P(PS)P(VB\|PS) 0.23 0
maxP(race\|VB) .00000068.00012 .000000000082
VB

PP
SS

P(NN)P(PS\|NN) .0000022.0045 .0000000099	P(TO)P(PS\|TO) 00 0	P(VB)P(PS\|VB) .00018.0070 .0000013	P(PS)P(PS\|PS) 0.00014 0
maxP(race\|PS) .00000130 0
VB

want

race

So if we stopped here, with no exit probabilities, we would pick NN for 'race', and follow the backpointer from that cell to VB for 'want'

giving VB NN as our answer

5. Viterbi decoding for tagging: Partial worked example using costs

This is easier to follow if we use costs (and minimise).

NN
TO
VB
PP SS
	want	race

NN	c(NN\|.)+c(want\|NN) 4.61+14.18 18.79
TO
VB
PP SS
	want	race

NN	c(NN\|.)+c(want\|NN) 4.61+14.18 18.79
TO	c(TO\|.)+c(want\|TO) 7.86+∞ ∞
VB	c(VB\|.)+c(want\|VB) 5.72+6.75 12.47
PP SS	c(PS\|.)+c(want\|PS) 3.90+∞ ∞
	want	race

6. Viterbi decoding: Partial worked example using costs, cont'd

Now the second column, where we have to minimise over each possible preceding state:

18.79

c(NN)+c(NN\|NN) 18.79+3.52 22.31	c(TO)+c(NN\|TO) ∞+11.06 ∞	c(VB)+c(NN\|VB) 12.47+4.41 16.85	c(PS)+c(NN\|PS) ∞+9.70 ∞
min+c(race\|NN) 16.85+10.78 27.63
VB

∞

c(NN)+c(TO\|NN) 18.79+5.97 24.76	c(TO)+c(TO\|TO) ∞+∞ ∞	c(VB)+c(TO\|VB) 12.47+4.84 17.31	c(PS)+c(TO\|PS) ∞+10.31 ∞
min+c(race\|TO) 17.31+∞ ∞
VB

12.47

c(NN)+c(VB\|NN) 18.79+7.97 26.76	c(TO)+c(VB\|TO) ∞+0.27 ∞	c(VB)+c(VB\|VB) 12.47+8.04 20.51	c(PS)+c(VB\|PS) ∞+2.12 ∞
min+c(race\|VB) 20.51+13.02 33.53
VB

PP
SS

∞

c(NN)+c(PS\|NN) 18.79+7.80 26.59	c(TO)+c(PS\|TO) ∞+∞ ∞	c(VB)+c(PS\|VB) 12.47+7.16 19.63	c(PS)+c(PS\|PS) ∞+12.80 ∞
min+c(race\|PS) 19.63+∞ ∞
VB

want

race

So if we stopped here, with no exit probabilities, we would pick NN for 'race', and follow the backpointer from that cell to VB for 'want'

giving VB NN as our answer

7. HMM for POS: trivial example

A fully-connected, state-labelled, FSM for three simple POS tags:

FSM with start, end and three fully-connected states labelled D, N and V

A bigram-probability transition matrix gives us a Markov chain:

	D	N	V	e
s	0.81	0.15	0.00	0.03
D	0.00	1.00	0.00	0.00
N	0.09	0.31	0.06	0.54
V	0.70	0.17	0.01	0.12

And the output probabilities:

	the	a	sleep	man	sleeps
D	0.70	0.23	0.0	0.0	0.0
N	0.0	0.0003	0.0002	0.008	0.0
V	0.0	0.0	0.01	0.0	0.0001

So the probability of getting a man sleeps from the transition sequence s D N V e is

$(0.81 \times 1.0 \times .06 \times .12) \times (.23 \times .008 \times .0001) = 1.07 \times 10^{-9}$

8. Total probability

The total probability that a given observation sequence will occur requires us to look at more than the best path

But rather the sum of all paths through the state/observation lattice

Think about generating at random from our example

We'll get a man sleeps from the transition sequence s D N V e

With the probabilility we calculated above

But we'll also get it from the sequence s N N V e

With probability $(0.09 \times .31 \times .06 \times .12) \times (.0003 \times .008 \times .0001) = 4.82 \times 10^{-14}$
So overall it will occur just a bit more often than the most likely path suggests
With frequency depending on the sum of the probabilities of all nine ways to get a three-word string from that HMM

9. Another example

This looks more plausible if we look at the "want race" example

transition and likelihood matrices from J&R for 'want race'

Earlier we used Viterbi decoding to find the most likely path

As want_VB race_NN
With probability $4.73 \times 10^{-9}$

But there are three other ways to get "want race"

Including want_NN race_NN
With probability $1.1 \times 10^{-10}$

So the total probability is at least $4.84 \times 10^{-9}$

10. Dynamic programming for total probability

In practice, this just means replacing 'max P...' with 'sum P...'

We go back to probabilities, because addition in the log domain isn't transparent

.00018

P(VB)P(VB\|VB) .00018.0038 .00000068	P(TO)P(VB\|TO) 0.83 0	P(NN)P(VB\|NN) .0000022.0040 .0000000088	P(PS)P(VB\|PS) 0.23 0
sumP(race\|VB) (.0000000088+.00000068).00012 .000000000083

P(VB)P(TO\|VB) .00018.035 .0000063	P(TO)P(TO\|TO) 00 0	P(NN)P(TO\|NN) .0000022.016 .000000035	P(PS)P(TO\|PS) 0.00079 0
sumP(race\|TO) (.000000035+.0000063)0 0

.0000022

P(VB)P(NN\|VB) .00018.047 .0000085	P(TO)P(NN\|TO) 0.00047 0	P(NN)P(NN\|NN) .0000022.087 .00000019	P(PS)P(NN\|PS) 0.0012 0
sumP(race\|NN) (.00000019+.0000085).00057 .0000000050

PP
SS

P(VB)P(PS\|VB) .00018.0070 .0000013	P(TO)P(PS\|TO) 00 0	P(NN)P(PS\|NN) .0000022.0045 .0000000099	P(PS)P(PS\|PS) 0.00014 0
sumP(race\|PS) (.0000000099+.0000013)0 0

want

race

It's worth noting that in each cell (except in the first column) of both lattices, we work with the pairwise product of two vectors: the probabilities from the previous column of the lattice and the transition probabilities into the state corresponding to the cell

For example for the second cell in the TO row above, that's the sum (or max) of the pairwise products of .00018 0 .0000022 0 and .035 0 .016 .00079 (See the table of transition probabilities from Figure 5.15 on p. 180 of J&M)

11. Forward and onward (backward)

The dynamic programming approach to total probability calculation is called the forward algorithm

The cell for state i in column t (because observations are considered to come sequentially in time) is usually notated $α_{t} (i)$

That is, the probability of being in state i after seeing the first t observations:
αt(i)=P(o1,o2…ot,qt=i|λ)
- Where λ stands for the HMM model itself
- and $q_{t}$ stands for the state we're in at time t

The dual of the forward algorithm is the backward algorithm, which computes $β_{t} (j)$ , the probability of seeing the observations from time t+1 to the end starting from state j:

$β_{t} (j) = P (o_{t + 1}, o_{t + 2} \dots o_{T} | q_{t} = j, λ)$

I find it easier to think of this as the onward contribution

Although the numbers don't care
You can compute it left-to-right or right-to-left. . .

12. Forward-backward combined

The forward-backward algorithm is used to train an HMM

When we don't have labelled data from which to compute the language and channel models directly

At its heart is an expression for the joint probability, given a particular HMM, of

Seeing a particular sequence of observations while
being in state i at time t and
being in state j at time t+1

We get this by combining the forward and backward probabilities with the probability of the relevant transition from i to j and the relevant output probability at j:

$α_{t} (i) a_{i j} b_{j} (o_{t + 1}) β_{t + 1} (j)$

The basic operation of the forward-backward algorithm is to repeatedly re-estimate the transitional and likelihood probabilities

In such a way that the probability of the observations in the training set increase
Until we find a (local) maximum

13. Required reading

Jurafsky & Martin, Chapter 5, sections 5.5.3, 5.5.4; Chapter 6, sections 6.3–6.5

14. Administrative details

Course home page is at http://www.inf.ed.ac.uk/teaching/courses/fnlp/

Do your own work yourself

The assignment is due on Thursday at 1600

If anyone isn't sure how to use submit, go try right now!
Later submissions will override earlier ones, up to the deadline