FNLP 2014: Lab Session 7: Word/phrase alignment for Machine Translation

Alex Lascarides
Week of 17th March, 2014

1. Building a bilingual (phrasal) lexicon

In this tutorial our aim is to explore one approach (described in Zhang et al. 2003 "Integrated Phrase Segmentation and Alignment Algorithm for Statistical Machine Translation") to using a sentence-aligned bilingual corpus to automatically detect word- and phrase-level translation pairs, based on Mutual Information.

2. The Corpus

We will use a small part of the Europarl corpus, as provided by NLTK as the data for our work:

"The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: . . . French, . . . English, . . . German.

"The goal of the extraction and processing was to generate sentence aligned text for statistical machine translation systems. . . We sentence aligned the data using a tool based on the Church and Gale algorithm."

NLTK makes available as the comtrans corpus a subset of Europarl's sentence-aligned English/French, German/English and German/French data, approximately 33,000 sentence pairs in each case.

3. Reminder: Mutual Information

Last week we looked at MI between words in Melville's novel Moby Dick. Recall the MI is a measure of the affinity between two words, computed by comparing their joint probability with the product of their individual probabilities:

Mutual information log2 ( P ( X , Y ) P ( X ) P ( Y ))

Last week we looked at the MI between adjacent words in a monolingual text, using unigram and bigram frequencies from that text.

This week we use three different sources of data:

  1. Unigram frequencies from a 'source' language (labeled e)
  2. Unigram frequencies from a 'target' language (labeled f)
  3. Bigram frequencies based on pairs of words (e,f) that occur in aligned pairs of sentences

This gives us the following verion of MI: log2 ( P ( e , f ) P ( e ) P ( f ))

This will enable us to detect at least word pairs which are likely candidates for being translations of one-another

4. Getting Started: Calculate three lots of MI

Before launching Python, we need the usual PYTHONPATH setting:

export PYTHONPATH=/group/ltg/projects/fnlp
python

Set up the python context as follows:

from nltk.corpus.util import LazyCorpusLoader
from nltk.corpus.reader import AlignedCorpusReader
from isa import ISA

comtrans = LazyCorpusLoader(
    'comtrans', AlignedCorpusReader, r'(?!\.).*\.txt',
     encoding='iso-8859-1')

Now read in the English/French data and have a look around

fe=comtrans.aligned_sents('alignment-en-fr.txt')
len(fe)
fe[0]
fe[0].words
fe[0].mots
help(fe[0])

Next, we tabulate the unigram frequencies we need:

from nltk import FreqDist, MLEProbDist
eu=FreqDist(w.lower() for s in fe for w in s.words if w[0].isalpha())
eu
eu.items()[:10]
eup=MLEProbDist(eu)
eup
fu=FreqDist(w.lower() for s in fe for w in s.mots if w[0].isalpha())
fu
fu.items()[:10]
fup=MLEProbDist(fu)
fup

And, a bigger task, a brute-force (is this really right?) creation of the bilingual bigrams:

efb=FreqDist()
for s in fe:
 for ew in s.words:
  if ew[0].isalpha():
   for fw in s.mots:
    if fw[0].isalpha():
     efb.inc((ew.lower(),fw.lower()))

efb
efb.items()[:10]
efbp=MLEProbDist(efb)

(Notice the [0].isalpha() check -- the French tokeniser produces e.g. c' est for c'est, and we don't want to lose the c'.)

This will take a while—while you're waiting, bring up /group/ltg/projects/fnlp/isa.py and look at the definition of the pmi method, compared to the equation above.

Let's see what we've got:

eup.logprob('the')
fup.logprob('le')
efbp.logprob(('the','le'))

def pmi(e,f,ep,fp,bp):
 return bp.logprob((e,f))-(ep.logprob(e)+fp.logprob(f))

pmi('the','le',eup,fup,efbp)
pmi('we','nous',eup,fup,efbp)
pmi('book','livre',eup,fup,efbp)
pmi('book','nous',eup,fup,efbp)

We've defined a local copy of pmi so we can easily use it from the command line.

5. Finding phrases

If we imagine an array with a source sentence along the bottom and its aligned target along the left

Zhang's ISA algorithm starts with such a matrix, with each cell containing the relevant pair's MI:

fe[0].words
fe[0].mots
sa=ISA(eup,fup,efbp,fe[0])
sa.show()

It then iterates the following steps:

  1. Find the 'live' cell with the largest MI (call this the seed)
  2. Grow rectangular blocks outwards from that cell, stopping if you hit
    1. Any cell with a negative MI
    2. And cell which
      1. has a 'significantly' smaller MI than some non-included cell in the same row or column
      2. isn't in the same row or column as the seed
  3. Mark all the cells in the same rows and columns as the 'best' block you just found as 'dead'

We'll try this on fe[0]. Before we start, see if you can locate the first seed, and where you think the largest rectangle satisfying the above constraints will be found:

sa.tile()

A display of the status of each cell is shown as each block is found.

Why didn't the 0 block grow rightwards? Why not upwards?

Let's try a larger sentence pair, fe[19]

sa.load(fe[19])
sa.show()
sa.tile()

This time we get some rectangles, and at least block 0 looks right, including the two-to-one of "thank you" and "merci".

This dataset is really too small to give very good results, I've been picking the examples carefully so far. fe[5] is less good

sa.load(fe[5])
sa.show()
sa.tile()

Block 0 looks good, and 4 isn't bad, but 1 clearly misses, it should have included 2 as well. . .

Full disclosure: as is irritatingly often the case, although at first glance the ISA paper seemed to have all the details necessary to enable me to reimplement it, as I wrote the code I discovered this wasn't the case. I may therefore not have actually re-constructed the algorithm correctly. . .

6. Further exploration

It's interesting to see what happens with the other language pairs -- have a go at using the other two, the German/English or German/French pairs, i.e.

e=comtrans.aligned_sents('alignment-de-en.txt')

or

e=comtrans.aligned_sents('alignment-de-fr.txt')

Followed by the unigram and bigram tabulations above, etc.


Home : Teaching : Courses : FNLP
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 650 2690, Fax: +44 131 651 1426, E-mail: hod@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh