FNLP 2014: Lecture 11: Grammar and parsing

1. Parsing: What and why

Parsing means determining how a grammar accepts a sentence

The details will depend on the nature of the grammar
For many formalisms, a parse tree is a good way of demonstrating a parse

What practical reasons are there for parsing?

At least three:

To support a compositional semantics
- Not just in NLP: consider expressions in a programming language
- By parsing x = 3+2/4
- We get an obvious scaffold on which to erect the evaluation
- which embodies the precedence rules of the language

2. Why parse, cont'd

More reasons to parse:

To eliminate (or at least reduce) ambiguity
- For example, in speech recognition, consider "we just [eh ii t]"
- Bigrams won't do much good at distinguishing "just ate" from "just eight"
- But a parser will rule out "We just eight"
To identify phrase boundaries for text-to-speech
- To get the intonation (timing, pitch) right

3. Why is parsing hard?

Ambiguity makes parsing (potentially) hard (or, at least, expensive)

Some (local) ambiguity is eliminated

But other (global) ambiguity remains

What makes a parsing algorithm good is how well it contains the cost of dealing with ambiguity

We will consider two main types of ambiguity

Part-of-speech ambiguity
- "break/vb the bank" vs. "take a break/nn"
Structural ambiguity
- "young (men and women)" vs. "(empty cups) and (fag ends)"

These can of course combine, as in the famous "He saw her duck" or "I'm interested in growing plants"

4. The impact of ambiguity

Examples such as "I'm interested in growing plants" are typical, in that humans often fail to spot the ambiguity at first

Our context-driven expectations and/or common sense hide one meaning or another

But machines typically have no way of avoiding having to enumerate all possible readings

The early so-called "broad coverage" grammars found thousands of parses for many sentences in the original Wall Street Journal corpus from the ACL/DCI.

Tagging doesn't help:

The best machine taggers are somewhere around 95% accurate
The average sentence length in the WSJ corpus is 25 words
.95 raised to the 25th power is .28

5. Tagging doesn't help enough, cont'd

Taggers make errors, say 1 time in 20

In other words, the probability of having a completely accurate tag sequence for the average WSJ sentence of 25 words is less than 30%
Even if we could get 99% word tagging accuracy, we'd still be looking at 78% accuracy for the average sentence
And working backwards, Mary Dalrymple has shown that even with perfect tagging, 30% of sentences would have no reduction in ambiguity, as all parses share the same tag sequence

We'll come back to this when we talk about probabilistic parsing

6. The Chomsky Hierarchy

A reminder of something you looked hard at in INF 2A

Regular languages: Regular expressions; Finite-state Automata
Context-free languages: (CF) Phrase structure grammars; Pushdown Automata ( $a^{n} b^{n}$ is not regular)
Context-sensitive languages: (CS) PSG; Linear-bounded automata ( $a^{n} b^{n} c^{n}$ is not context-free)
Recursively enumerable languages: General rewriting rules; Turing machine

The current consensus is that the natural languages are just a bit more complex than context-free.

There's a point in-between CF and CS, sometimes referred to as the indexed languages, which may be the sweet spot
$w w$ , a reduplication language, i.e. every sentence is a pair of identical strings over some alphabet, is an indexed language

We'll start with context-free grammars, and their parsers, as they cover almost all the grammatical phenomena of natural languages

7. Context-free phrase structure grammars

We'll skip the formalities

And just use the standard rule notation
J&M Chapter 12 section 12.2.1 has the full story

We'll always use S for the start symbol

Capitalised or camel-case for non-terminals and pre-terminals

pre-terminals are a common convention for natural language CF-PSGs: they are the lexical categories
That is, their expansions are aways a single terminal

And lower-case for terminals

And we'll use some obvious abbreviations using vertical bar:

NT → ... | ... | ...
PT → term1 | term2 | term3

8. CF-PSG example

Here's enough to get both readings of "he saw her duck"

S → NP VP
NP → D N | Pro | PropN
D → PosPro | Art | NP 's
VP → Vi | Vt NP | Vp NP VP
Pro → i | we | you | he | she | him | her
PosPro → my | our | your | his | her
PropN → Robin | Jo
Art → a | an | the
N → cat | dog | duck | park | telescope | bench
Vi → sleep | run | duck
Vt → eat | break | see | saw
Vp → see | saw | heard

Parse 1

Parse 2

Parse tree with 'her duck' as an NP plus complement VP

9. Parsing as search: top-down and bottom-up

We can think of parsing as a search problem:

Search the space of possible parse trees in an orderly fashion
Until the right answer is found
- Or the first answer
- Or all answers
- Or the best answer

The top-down search space looks like this for our grammar: