FNLP 2014: Lecture 2: Introduction to working with corpora

Henry S. Thompson
17 January 2014
Creative CommonsAttributionShare Alike

1. Working with corpora: preliminaries

Platform and format are pervasive, irritating, issues:

Most early academic work was done with line-oriented data on UN*X

This had the advantage of being fast (~500ms for ~120000 words on a 2GHz machine).

But wasn't very portable

And was very vulnerable to variations in corpus format

[Who uses what platform?]

2. Using NLTK for corpus work

NLTK is an Open Source effort (see http://www.nltk.org/), backed up with a book Natural Language Processing: Analyzing Text with Python and the Natural Language Toolkit, by Bird, Klein and Loper.

We'll mostly use NLTK to get around the platform and format issues:

It's not as fast as custom C-code, but fast enough for our purposes.

Let's look at our Jane Austen example using NLTK:

from nltk.book import *
len(text2)
text2[:60]
f=FreqDist(text2)
f
len(f)
f.items()[:50]
g=FreqDist(x.lower() for x in text2)
g
len(g)
g.items()[:50]
g.plot(50)
h=FreqDist(x.lower() for x in text2 if x[0].lower().isalpha())
len(h)
h.items()[:50]
h.plot(50)

This is the first of many examples we'll see of Zipf's Law: the frequency of items in natural language falls off exponentially.

3. One concrete benefit of NLTK

Note that NLTK and the simple command line tools disagree about the word counts

This is because NLTK does a better job of tokenisation

We can get some idea of where UN*X is going wrong. . .

egrep -i '[^ a-zA-Z]the ' austen-sense.txt | less

And then improve it:

tr -s ' "-' '\012' < austen-sense.txt | tr A-Z a-z | sort
| uniq -c | sort -nr | less

That get's us close for 'the', but still way out for 'to'. . .

4. A very brief history of corpus

The word corpus is just Latin for 'body'.

So a corpus is just a body of text (or speech or ...).

We'll use the word to mean a collection of (possibly annotated) language in machine-processable form.

Virtually all corpora are structured.

That is, they use some conventions to make manifest properties of the language they contain.

Those properties can be separated into

We'll look at a series of examples to see how the representation of corpora has evolved.

5. The earliest corpora: punched cards!

The first corpora were created using punched cards,

An 80-column punchcard

and distributed, if at all, using magnetic tape.

A 1/2 inch mag. tape reel

The 80-column width and lack of lower-case letters meant a lot of workarounds were necessary just to reproduce ordinary typed text,

to say nothing of printed material.

6. Early corpora: The Brown Corpus

Produced initially in 1964 by Henry Kucera and Nelson Francis of the Linguistics Department of Brown University:

It's composed of 500 texts of roughly 2000 words each.

Part-of-speech information was added in 1979.

7. Brown corpus example

Here are a few punched card lines from the original Brown corpus (text cf03):


TELEVISION IMPULSES, SOUND WAVES, ULTRA-VIOLET RAYS, ETC**., THAT MAY   1020ElF03
OC*                                                                     1025ElF03
CUPY THE VERY SAME SPACE, EACH SOLITARY UPON ITS OWN FREQUENCY, IS INF  1030ElF03
INITE. *SO WE MAY CONCEIVE THE COEXISENCE OF THE INFINITE NUMBER OF U   1040ElF03
NIVERSAL, APPARENTLY MOMENTARY STATES OF MATTER, SUCCESSIVE ONE AFTER   1050ElF03

This represents the following text:

television impulses, sound waves, ultra-violet rays, etc., that may occupy the very same space, each solitary upon its own frequency, is infinite. So we may conceive the coexistence of the infinite number of universal, apparently momentary states of matter, successive one after another

* and ** are used as escape characters, with a quite complex set of rules for reconstructing the original based on their use.

Note that whereas *SO represents a difference really present in the original:

on the other hand the **. coding represents analytical information:

8. Digression: Three phases of corpus origin: Phase 1

In the beginning, all corpus material originated outside computers.

So the computer-based representation came second:

So decisions had to be made about every aspect of the original which could be represented:

  1. Whether to represent it or not;
  2. If so, how to represent it.

Note for instance that neither linebreaks nor number of spaces is preserved in the Brown corpus.

The cost of transcription necessarily limited the scale of corpora created in this way.

9. Phases 2 and 3 of corpus origin

The real breakthrough came when computers began to be used to create text for printing

Corpus creation then became a matter of:

  1. Getting access to the designed-for-print representation;
  2. Getting the necessary license to work with and distribute it;
  3. Converting it from a print-appropriate form to one suitable for NLP.

By this time, (reproduction for) distribution was a bit easier

The third and most recent phase came when texts began to be created for online distribution.

Access is no longer a problem, but issues of rights and conversion still arise, just in different forms.

Distribution is by DVD or online.

10. The breakthrough: The ACL/DCI and the Wall Street Journal

The arrival of the second phase was signalled by the release of the ACL/DCI corpus in 1989

The core of the corpus was material derived from the printers' tapes for the Wall Street Journal newspaper.

With over 30 million words, it was a major increase in size over the Brown corpus.

Its widespread availability launched a new way of approaching language processing, called by some at the time "data-intensive".

Started by Mark Liberman and colleagues at Bell Labs

11. Beyond American English: The ECI

Susan Armstrong (Geneva) and Henry S. Thompson (Edinburgh) produced a CD-ROM of non-American English data in 1994

Some parallel multilingual data was included.

The European Corpus Initiative went on to produce a more substantial corpus

12. The British National Corpus

The Brown corpus excepted, efforts up to this point had been largely opportunistic:

The BNC effort, jointly funded by the British government and a consortium of dictionary publishers, changed all that.

The BNC was designed to be a representative corpus of contemporary British English

Originally published in 1995

The spoken part was particularly ambitious

13. The institutions: LDC and ELRA

In the US, out of the ACL/DCI came the Linguistic Data Consortium, funded by NSF and DARPA.

In Europe, the European Union launched the European Language Resources Association in 1995.

Both of these bodies have produced a great deal of material.

14. We have data

Above and beyond what NLTK has

You should explore a bit from the root directory, which is /group/corpora/public

Some examples

15. The revolution in attitude: There's no data like more data

From

My data is my scholarly identity
Our data is our corporate advantage

to

By giving away my/our data, we all win

Bob Mercer, then at IBM, is widely credited with the tag line "There's no data like more data."

This attitude spread first at the level of recordings of spoken material

16. Corpus content: Data vs. metadata

A corpus is more than just the words it consists of.

For many purposes, you need to know a lot more:

and all the other things you would expect in a bibiographic record.

And because what you get in the corpus is almost never exactly what was originally published, you need information about how the original, whether electronic or not, was processed to produce the corpus:

We use the name metadata for all this kind of information

17. Markup

Before we can even look at the details of metadata or data

Early corpora, e.g. the tagged Brown corpus, used a simple record/field approach

SAID                          VBD       A01001006E1

Over the last twenty years, standardised markup languages have slowly taken over:

The form in which corpora are processed is not always the same as its interchange or archival form

But even these almost always expect to import, export and archive their data in XML

18. What are SGML/XML?

Markup languages used for annotating text / information

Concerned with logical structure

Not concerned with appearance

19. Markup: a simple example

Basic SGML/XML markup is just labelled brackets:

The text content is in bold here

An SGML or XML document is underlyingly a tree

20. Why SGML/XML?

Use structurally annotated text

21. Administrative details

Course home page is at http://www.inf.ed.ac.uk/teaching/courses/fnlp/

Do your own work yourself

Labs will typically involve hands-on use of NLTK, and are as follows:

Monday: 1510–1600
AT 5.04
Wednesday: 1110–1200
AT 5.04

If you need to swap, please try to arrange this privately before letting me know.

There will be two pieces of coursework, due dates as per the timetable.

You really will need to have access to a copy of the text, must be the 2nd edition.

Notices will come via the mailing list, so do register for the course ASAP even if you're not sure you will take it.