In this tutorial our aim is to explore one approach (described in Zhang et al. 2003 "Integrated Phrase Segmentation and Alignment Algorithm for Statistical Machine Translation") to using a sentence-aligned bilingual corpus to automatically detect word- and phrase-level translation pairs, based on Mutual Information.
We will use a small part of the Europarl corpus, as provided by NLTK as the data for our work:
"The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 21 European languages: . . . French, . . . English, . . . German.
"The goal of the extraction and processing was to generate sentence aligned text for statistical machine translation systems. . . We sentence aligned the data using a tool based on the Church and Gale algorithm."
NLTK makes available as the comtrans
corpus a subset of Europarl's sentence-aligned English/French, German/English and
German/French data, approximately 33,000 sentence pairs in each case.
Last week we looked at MI between words in Melville's novel Moby Dick. Recall the MI is a measure of the affinity between two words, computed by comparing their joint probability with the product of their individual probabilities:
Mutual information
Last week we looked at the MI between adjacent words in a monolingual text, using unigram and bigram frequencies from that text.
This week we use three different sources of data:
This gives us the following verion of MI:
This will enable us to detect at least word pairs which are likely candidates for being translations of one-another
Before launching Python, we need the usual PYTHONPATH setting:
export PYTHONPATH=/group/ltg/projects/fnlp
python
Set up the python context as follows:
from nltk.corpus.util import LazyCorpusLoader
from nltk.corpus.reader import AlignedCorpusReader
from isa import ISA
comtrans = LazyCorpusLoader(
'comtrans', AlignedCorpusReader, r'(?!\.).*\.txt',
encoding='iso-8859-1')
Now read in the English/French data and have a look around
fe=comtrans.aligned_sents('alignment-en-fr.txt')
len(fe)
fe[0]
fe[0].words
fe[0].mots
help(fe[0])
Next, we tabulate the unigram frequencies we need:
from nltk import FreqDist, MLEProbDist
eu=FreqDist(w.lower() for s in fe for w in s.words if w[0].isalpha())
eu
eu.items()[:10]
eup=MLEProbDist(eu)
eup
fu=FreqDist(w.lower() for s in fe for w in s.mots if w[0].isalpha())
fu
fu.items()[:10]
fup=MLEProbDist(fu)
fup
And, a bigger task, a brute-force (is this really right?) creation of the bilingual bigrams:
efb=FreqDist()
for s in fe:
for ew in s.words:
if ew[0].isalpha():
for fw in s.mots:
if fw[0].isalpha():
efb.inc((ew.lower(),fw.lower()))
efb
efb.items()[:10]
efbp=MLEProbDist(efb)
(Notice the [0].isalpha()
check -- the French tokeniser
produces e.g. c' est for c'est, and
we don't want to lose the c'.)
This will take a while—while you're waiting, bring up
/group/ltg/projects/fnlp/isa.py
and look at the definition of the
pmi method, compared to the equation above.
Let's see what we've got:
eup.logprob('the')
fup.logprob('le')
efbp.logprob(('the','le'))
def pmi(e,f,ep,fp,bp):
return bp.logprob((e,f))-(ep.logprob(e)+fp.logprob(f))
pmi('the','le',eup,fup,efbp)
pmi('we','nous',eup,fup,efbp)
pmi('book','livre',eup,fup,efbp)
pmi('book','nous',eup,fup,efbp)
We've defined a local copy of pmi so we can easily use it from the command line.
If we imagine an array with a source sentence along the bottom and its aligned target along the left
Zhang's ISA algorithm starts with such a matrix, with each cell containing the relevant pair's MI:
fe[0].words
fe[0].mots
sa=ISA(eup,fup,efbp,fe[0])
sa.show()
It then iterates the following steps:
We'll try this on fe[0]
. Before we start, see if you can
locate the first seed, and where you think the largest rectangle
satisfying the above constraints will be found:
sa.tile()
A display of the status of each cell is shown as each block is found.
Why didn't the 0 block grow rightwards? Why not upwards?
Let's try a larger sentence pair, fe[19]
sa.load(fe[19])
sa.show()
sa.tile()
This time we get some rectangles, and at least block 0 looks right, including the two-to-one of "thank you" and "merci".
This dataset is really too small to give very good results, I've been
picking the examples carefully so far. fe[5]
is less good
sa.load(fe[5])
sa.show()
sa.tile()
Block 0 looks good, and 4 isn't bad, but 1 clearly misses, it should have included 2 as well. . .
Full disclosure: as is irritatingly often the case, although at first glance the ISA paper seemed to have all the details necessary to enable me to reimplement it, as I wrote the code I discovered this wasn't the case. I may therefore not have actually re-constructed the algorithm correctly. . .
It's interesting to see what happens with the other language pairs -- have a go at using the other two, the German/English or German/French pairs, i.e.
e=comtrans.aligned_sents('alignment-de-en.txt')
or
e=comtrans.aligned_sents('alignment-de-fr.txt')
Followed by the unigram and bigram tabulations above, etc.
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 650 2690, Fax: +44 131 651 1426, E-mail: hod@inf.ed.ac.uk Please contact our webadmin with any comments or corrections. Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh |