Accelerated Natural Language Processing 2016


Lecture 17: Words in PCFGs, collocations and mutual information

Henry S. Thompson
7 November 2016
Creative CommonsAttributionShare Alike

1. Another big problem with simple PCFGs

For a given structural ambiguity, say PP attachment

Consider the two alternative parses we would get from the Treebank grammar for Mr Vinken is chairman of Elsevier:

2. PCFG example, cont'd

How did we get those two analyses?

[no description, sorry]

So the only difference is the probabilities of the rules highlighted in red

And those will always give the same answer

We really need to pay attention to the probability of those words being tightly or loosely connected

3. Paying attention to words

Improving our approach to probabilistic grammar requires paying more attention to individual words

Back to bigrams, but in a little more detail

Here are the top 10 bigrams from Herman Melville's famous American novel Moby Dick:

Aside from the whale, these are all made up of very high-frequency closed-class items

The highest bigram of open-class items doesn't come until position 27: sperm whale, with frequency 182

None-the-less if feels as if there's something particularly interesting about that one. . .

4. Collocations

"You shall know a word by the company it keeps" (J. R. Firth)

One of the things we evidently know about our language is what words go with what

Choosing the right word from among a set of synonyms is a common problem for second-language learners

Or consider the old (linguists') joke:

5. What measure to find collocations?

The name for an 'interesting' pair is collocation

How can we separate the interesting pairs from the dull ones?

We could try just throwing out the 'little words'

Some of these feel special (right whale, moby dick), but others (old man in particular) just seem ordinary

6. Normalising by expection: mutual information

What we want is some way of factoring in frequency more generally

Conditional and joint probability are the answer

One way of getting at our intuition might be to say we're looking for cases where the two probabilities are not independent

Now the bigram frequency gives us an MLE of the joint probability directly

So the ratio of that probability, to what it would be if they were independent, would be illuminating:

Pointwise mutual information log2(P(X,Y)P(X)P(Y))

Terminology note: Strictly speaking we should distinguish between pointwise mutual information and mutual information as such. The latter is a measure over distributions, as opposed to individuals.

7. Mutual information example

Let's compare the most frequent bigram (of the) with the first interesting one we saw, sperm whale

>>> f[('of', 'the')]
1879
>>> u['of']
6609
>>> u['the']
14431
>>> 1879.0/218360
0.0086050558710386513
>>> (6609.0 * 14431)/(218361*218361)
0.0020002396390988637
>>> log((1879.0/218360)/((6609.0 * 14431)/(218361*218361)),2)
2.105011706733956

>>> f[('sperm','whale')]
182
>>> u['sperm']
244
>>> u['whale']
1226
>>> 182.0/218361
0.00083348216943501818
>>> (244.0*1226)/(218361*218361)
6.2737924534150313e-06
>>> log((182.0/218361)/((244.0*1226)/(218361*218361)),2)
7.0536697225202696

Simply put, the mutual information between sperm and whale is 5 binary orders of magnitude greater than that between of and the

Why are we using log base 2?

8. Collocations and machine translation

We can use the same approach to build a translation lexicon

Instead of bigrams within a single text

[((u'commission', u'commission'), 113),
 ((u'rapport', u'report'), 84),
 ((u'régions', u'regions'), 71),
 ((u'parlement', u'parliament'), 66),
 ((u'politique', u'policy'), 62),
 ((u'voudrais', u'like'), 58),
 ((u'président', u'president'), 57),
 ((u'fonds', u'funds'), 52),
 ((u'monsieur', u'president'), 50),
 ((u'union', u'union'), 48),
 ((u'états', u'states'), 46),
 ((u'membres', u'member'), 46),
 ((u'états', u'member'), 46),
 ((u'développement', u'development'), 44),
 ((u'membres', u'states'), 43),
 ((u'également', u'also'), 42),
 ((u'structurels', u'structural'), 41),
 ((u'fonds', u'structural'), 41),
 ((u'structurels', u'funds'), 40),
 ((u'cohésion', u'cohesion'), 38),
 ((u'voudrais', u'would'), 38),
 ((u'européenne', u'european'), 37),
 ((u'orientations', u'guidelines'), 37),
 ((u'commission', u'would'), 36),
 ((u'madame', u'president'), 34),
 ((u'groupe', u'group'), 33),
 ((u'commissaire', u'commissioner'), 33),
 ((u'présidente', u'president'), 32),
 ((u'sécurité', u'safety'), 32),
 ((u'transports', u'transport'), 30)]

Pretty good

And would be better if we had done monolingual collocation detection first!

9. Words, heads and grammar

We mentioned the idea of the head of a constituent in earlier lectures

Approaches to grammar which focus on heads are called dependency grammars

The standard form of diagram shows where this name comes from:

'Mr Vinken is chairman of Elsevier' with arcs showing e.g. 'Vinken' and 'chairman' depending on 'is'

(Green for the preferred attachment, red for the less likely one)

10. Dependency grammar

Dependency grammars don't have rules in the way that a CFG does

A given approach to dependency grammar will also involve an inventory of relations

Dependency graphs are not required to avoid crossings

11. Lexicalised PCFG

It's possible to add some of the benefits of dependency grammar to PCFGs

We can't do statistics directly on these augmented categories

But a range of techniques have been developed in the last few years to work around this

12. Putting words first: Categorial grammar

Categorial grammar represents a different approach to putting words at the centre of things

In the simplest form of CG, all we need is a lexicon like this

N: duck, cat, ...
NP/N: the, a, ...
S\NP: ran, slept, ...
(S\NP)/NP: saw, liked

Where we read e.g. NP/N as the category for things which combine with an N to their right to produce an NP

In the obvious way this gives us the following derivation for the cat saw a duck:

derivation diagram from the above grammar, with two interior NP nodes and one S\NP

The arrows next to the derivation steps identify which of the (meta-)rules was used for that step:

forward combination (>)
X/Y Y → X
backward combination (<)
Y X\Y → X

These are the only two rules, or rule schemata, needed for the simplest categorial grammars