FNLP 2014: Lecture 11: Grammar and parsing

Henry S. Thompson
25 February 2014
Creative CommonsAttributionShare Alike

1. Parsing: What and why

Parsing means determining how a grammar accepts a sentence

What practical reasons are there for parsing?

At least three:

  1. To support a compositional semantics
    • Not just in NLP: consider expressions in a programming language
    • By parsing x = 3+2/4
      • Parse tree with effect of precedence
    • We get an obvious scaffold on which to erect the evaluation
    • which embodies the precedence rules of the language

2. Why parse, cont'd

More reasons to parse:

  1. To eliminate (or at least reduce) ambiguity
    • For example, in speech recognition, consider "we just [eh ii t]"
    • Bigrams won't do much good at distinguishing "just ate" from "just eight"
    • But a parser will rule out "We just eight"
  2. To identify phrase boundaries for text-to-speech
    • To get the intonation (timing, pitch) right

3. Why is parsing hard?

Ambiguity makes parsing (potentially) hard (or, at least, expensive)

Some (local) ambiguity is eliminated

What makes a parsing algorithm good is how well it contains the cost of dealing with ambiguity

We will consider two main types of ambiguity

These can of course combine, as in the famous "He saw her duck" or "I'm interested in growing plants"

4. The impact of ambiguity

Examples such as "I'm interested in growing plants" are typical, in that humans often fail to spot the ambiguity at first

But machines typically have no way of avoiding having to enumerate all possible readings

The early so-called "broad coverage" grammars found thousands of parses for many sentences in the original Wall Street Journal corpus from the ACL/DCI.

Tagging doesn't help:

5. Tagging doesn't help enough, cont'd

Taggers make errors, say 1 time in 20

We'll come back to this when we talk about probabilistic parsing

6. The Chomsky Hierarchy

A reminder of something you looked hard at in INF 2A

Regular languages
Regular expressions; Finite-state Automata
Context-free languages
(CF) Phrase structure grammars; Pushdown Automata ( anbn is not regular)
Context-sensitive languages
(CS) PSG; Linear-bounded automata (anbncn is not context-free)
Recursively enumerable languages
General rewriting rules; Turing machine

The current consensus is that the natural languages are just a bit more complex than context-free.

We'll start with context-free grammars, and their parsers, as they cover almost all the grammatical phenomena of natural languages

7. Context-free phrase structure grammars

We'll skip the formalities

We'll always use S for the start symbol

Capitalised or camel-case for non-terminals and pre-terminals

And lower-case for terminals

And we'll use some obvious abbreviations using vertical bar:

8. CF-PSG example

Here's enough to get both readings of "he saw her duck"

Parse 1

Parse tree with 'her duck' as an NP

Parse 2

Parse tree with 'her duck' as an NP plus complement VP

9. Parsing as search: top-down and bottom-up

We can think of parsing as a search problem:

The top-down search space looks like this for our grammar:

Four layers of parse tree, depths 1-4, the last only partial

We can search the top-down space breadth-first:

outlining each layer

Or depth-first:

outlining the first tree in each layer

We stop breadth-first when we get longer than the input

We stop depth-first when we get mismatches

Loops in the grammar cause problems!

The bottom-up search space can also be searched either breadth-first or 'height'-first

10. Brute-force parsing: recursive descent

Recursive descent parsing explores the search space top-down and, usually, depth-first

It's trivial to implement

11. Required reading

Jurafsky & Martin, second edition, Chapter 12 sections 12.1–12.3, 12.6; Chapter 13 sections 13.1–13.2

12. Administrative details

Course home page is at http://www.inf.ed.ac.uk/teaching/courses/fnlp/

Do your own work yourself

Notices will come via the mailing list

2nd assignment is out