FNLP 2017: Lecture 17: Machine translation

Henry S. Thompson
21 March 2017
Creative CommonsAttributionShare Alike

1. Words, and other words

A brief introduction to machine translation

Machine Translation covers a wide range of goals

The contrast between hyped-up promises of success and poor actual performance led to

2. From knowledge-rich to machine-learning

MT has followed the same trajectory as many other aspects of speech and language technology

3. Before his time: Warren Weaver

Stimulated by the success of the codebreakers at Bletchley Park (including Alan Turing), Weaver had an surprisingly prescient idea:

[...] knowing nothing official about, but having guessed and inferred considerable about, powerful new mechanized methods in cryptography. . .one naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: “This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.” Have you ever thought about this? As a linguist and expert on computers, do you think it is worth thinking about?
from a letter from Warren Weaver to Norbert Wiener, dated April 30, 1947

4. A very noisy channel

Applying the noisy channel model to translation requires us to stand normal terminology on its head

But from the perspective of the noisy channel model

5. Priors and likelihood for MT

Remember the basic story (using e for English and r for Russian):

argmaxe1nP(r1n|e1n)likelihoodP(e1n)prior

The prior is just our old friend, some form of language model

But the channel model needs to be articulated a bit for translation, in several ways

So we need a channel model that takes all of these into account

6. Translation modeling for MT

J&M 2nd ed. Chapter 25 takes you through a step-by-step motivation for the first successful attempt at doing things this way

All their approaches start with a formal notion of alignment

7. Translation modelling, cont'd

Their simplest model then has three conceptual steps:

  1. Choose a length for the Russian, given the length of the English
    • Remember, from the model's perspective we are generating the Russian observation, starting from the English source
    • Think of the POS-tagging HMM
    • Which 'generates' English words (the observations) from a sequence of POS tags (the source)
  2. Choose an alignment from the words in the source (English) to the words in the observations (Russian)
  3. For each position in the Russian, choose a translation of the English word which aligns to it

Following simplifying assumptions of the usual Markov nature, we end up with

P(r1J,a1J|e1I)=P(J|I)×j=1JP(aj|aj-1,I)×P(rj|eaj)

Where

For the translation model before we do the Bayes rule switch and the argmax

8. Contemporary MT

We've barely scratched the surface

But state-of-the-art MT systems today all derive from essentially this starting point

9. Getting started: The role of data

Broadly speaking, we have two models to learn:

data flow for SMT: bilingual data to translation model, monolingual data to language model

10. Getting started: segmentation and sentence alignment

Just as with other corpora, we need to pre-process the raw materials

These will vary in difficulty given the form of the raw data

But for the translation model, with respect to the bilingual data, we need more

11. Sentence alignment details: Gale and Church (1993)

Assumptions:

Paragraph by paragraph, the algorithm matches source sentences to zero, one or two target sentences

12. Gale and Church, cont'd

Start with some empirical observations:

What does a hand-aligned corpus tell us about sentence alignment?

frequency of sentence alignments: 1--1 the vast majority
Table 5 from Gale, William A. and Kenneth W. Church, "A Program for Aligning Sentences in Bilingual Corpora", 1993. Computational Linguistics 19 (1): 75–102

What about relative length?

From this kind of data G&C get what they need to feed into a dynamic programming search for the optimal combination of local alignments within a paragraph

13. Evaluation-driven development

From 2006–2014, an annual competition was held

Shared task, many language pairs

14. Evaluation

How can we evaluate systems?

As with other similar tasks, in one of two ways:

Extrinsic evaluation
Measure the utility of the result with respect to some (real) use
  • As first step in HQ production: how much post-editing required?
  • Comprehension tests
  • As basis for search or information retrieval: measure quality of that result
Intrinsic evaluation
Measure the quality of the result against some (more-or-less explicit) standard
  • Human quality assessment
  • Automatic comparison to gold standard

Any measure involving humans is

Any automatic measure is

15. Human evaluation

One or more judges, working from

Different dimensions for judgement

Typically judged on a numeric scale

WMT evaluation tool snapshop

As mentioned before, agreement can be a problem

varying histograms of judges' ratings from WMT

16. Automatic evaluation

There was no accepted automatic evaluation measure for MT for a long time

The advent of the BLEU methodology (BiLingual Evaluation Understudy) around 2000 helped a lot

It correlates surprisingly well with human judgements

17. A digression about headroom

When you need numerical scores to facilitate hillclimbing

If your system is doing pretty well already

But if you're doing pretty badly

So we can ask "How much headroom do we have?"

18. BLEU: overview

BLEU starts from the observation that just getting the 'right' words counts for a lot

So BLEU counts not just word overlap

between candidate output sentence and reference translation(s)

Two parts to the evaluation of each sentence:

Combined over paragraphs or documents

19. A digression about precision and recall

To test the overlap between two sets

We need to answer two questions

What proportion of the answers are right?

What proportion of the truth was found?

You need both

20. BLEU: Three versions of the formula

As described, the BLEU formula is a product

The 4th root is usually expressed via the log domain

BLEU=BPexp(n=1Nwnlogpn)

The whole thing is usually then moved into the log domain

logBLEU=min(1-rc,0)+n=1Nwnlogpn

See J&M 2nd ed. 25.9 for details and a worked example

21. Getting something to evaluate: Back to translation modelling

How do we learn even the simple model we sketched last time?

We saw how to align sentence (groups) between parallel corpora

From the 20,000-foot level, this is similar to the HMM-learning problem for tagging

That is, we start out with nothing

Then, very much as in the forward-backward algorithm

And use those counts to re-estimate all the probabilities

22. Expectation maximisation

What's common to the approach just described and the forward-backward algorithm is that they're both examples of expectation maximisation

How can this possibly work?

three french/english fragments, all possible alignments, la/the present in all three

As we iterate over the data, we will 'count' la being realised as the more often than anything else

What happened with 'fleur' is called the pigeon hole principle

23. But this is all changing

Over the last two years, there's been a huge shift in emphasis

Google announced a few months ago that Google Translate had made such a shift for some of the most common language pairs

24. Expectation maximisation, cont'd

By aggregating across all sentence pairs, we can count how much probability attaches to e.g. the bleu/blue pair, compared to all the bleu/... pairs, to get a new ML estimate

In the simplest IBM model, so-called IBM Model 1, they started with the assumption that all alignments were equally likely

The overall shape of their EM process was as follows

Step 1: Expectation
For every sentence pair
  • For every possible word alignment
    • Use the word-word translation probabilities to assign a total probability
Step 2: Maximisation
  • Suppose the assigned values to be true
  • Collect probability-weighted counts for all word translation pairs
  • Re-estimate the probabilities for every pair

Iterate until convergence

In IBM's first efforts (IBM Model 1)

This approach performed well enough to launch a revolution

25. Expectation maximisation: simplified example

[Modelled on J&M section 25.6.1 (q.v.) (in turn based on "Knight, K. (1999b). A statistical MT tutorial workbook. Manuscript prepared for the 1999 JHU Summer Workshop.")]

Assume one-to-one word alignments only

So we have

P(A,F|E)=j=1Jt(fj|eaj)

Where

And just two pairs of sentences

Giving the following vocabularies:

26. EM Example, cont'd

We start with uniform word translation probabilities

t(chat|black)=1/3t(chat|cat)=1/3t(chat|the)=1/3
t(le|black)=1/3t(le|cat)=1/3t(le|the)=1/3
t(noir|black)=1/3t(noir|cat)=1/3t(noir|the)=1/3

Do the Expectation step: first compute the probability of each possible alignment of each sentence:

uniform P(a,f|e) for all possible word alignments of two 2-word sentence pairs, from uniform word translation probs

Normalise P(a,f|e) to get P(a|f,e) for each pair by dividing by the sum of all possible alignments for that pair

Expected (fractional) counts for each possible word alignment, all equal

Finally sum the fractional 'counts' for each pair and each source

tcount(chat|black)=1/2tcount(chat|cat)=1/2 + 1/2tcount(chat|the)=1/2
tcount(le|black)=0tcount(le|cat)=1/2tcount(le|the)=1/2
tcount(noir|black)=1/2tcount(noir|cat)=1/2tcount(noir|the)=0
total(black)=1total(cat)=2total(the)=1

The maximisation step: normalise the counts to give ML estimates

t(chat|black)=1/2 / 1 = 1/2t(chat|cat)=(1/2 + 1/2) / 2 = 1/2t(chat|the)=1/2 / 1 = 1/2
t(le|black)=0 / 1 = 0t(le|cat)=1/2 / 2 = 1/4t(le|the)=1/2 / 1 = 1/2
t(noir|black)=1/2 / 1 = 1/2t(noir|cat)=1/2 / 2 = 1/4t(noir|the)=0 / 1 = 0

All the correct mappings have increased, and some of the incorrect ones have decreased!

Feeding the new probabilities back in, what we now see for each alignment is

revised P(a,f|e), now preferring the right answers

And the right answers have pulled ahead.