FNLP 2014: Lecture 2: Introduction to working with corpora

1. Working with corpora: preliminaries

Platform and format are pervasive, irritating, issues:

UN*X vs. Windoz vs. MacWhich
line-oriented vs. XML

Most early academic work was done with line-oriented data on UN*X

Using pipelines of UN*X command line tools to do the work
For example, working with Jane Austen's Sense and Sensibility as transcribed by Project Gutenberg:
```
tr -s ' ' '\012' < austen-sense.txt | sort | uniq -c | sort -nr
```

tr -s ' ' '\012' < austen-sense.txt | tr A-Z a-z | sort | uniq -c | sort -nr | less

tr -s ' ' '\012' < austen-sense.txt | sort | uniq | wc

tr -s ' ' '\012' < austen-sense.txt | tr A-Z a-z | sort | uniq | wc

time tr -s ' ' '\012' < austen-sense.txt | tr A-Z a-z | sort | uniq -c | sort -nr >/dev/null

```
wc austen-sense.txt
```

This had the advantage of being fast (~500ms for ~120000 words on a 2GHz machine).

But wasn't very portable

And was very vulnerable to variations in corpus format

[Who uses what platform?]

2. Using NLTK for corpus work

NLTK is an Open Source effort (see http://www.nltk.org/), backed up with a book Natural Language Processing: Analyzing Text with Python and the Natural Language Toolkit, by Bird, Klein and Loper.

We'll mostly use NLTK to get around the platform and format issues:

It's cross-platform, being implemented in Python
It can handle a number of different corpus formats

It's not as fast as custom C-code, but fast enough for our purposes.

In particular, for interactive exploratory data analysis
- Which you will experience first-hand in the lab sessions

Let's look at our Jane Austen example using NLTK:

from nltk.book import *
len(text2)
text2[:60]
f=FreqDist(text2)
f
len(f)
f.items()[:50]
g=FreqDist(x.lower() for x in text2)
g
len(g)
g.items()[:50]
g.plot(50)
h=FreqDist(x.lower() for x in text2 if x[0].lower().isalpha())
len(h)
h.items()[:50]
h.plot(50)

This is the first of many examples we'll see of Zipf's Law: the frequency of items in natural language falls off exponentially.

3. One concrete benefit of NLTK

Note that NLTK and the simple command line tools disagree about the word counts

NLTK gives 4116 for 'to' and 4105 for 'the'
Whereas the UN*X tools give 4073 for 'the' and 4050 for 'to'

This is because NLTK does a better job of tokenisation

That is, splitting the input into words

We can get some idea of where UN*X is going wrong. . .

egrep -i '[^ a-zA-Z]the ' austen-sense.txt | less

And then improve it:

tr -s ' "-' '\012' < austen-sense.txt | tr A-Z a-z | sort
| uniq -c | sort -nr | less

That get's us close for 'the', but still way out for 'to'. . .

Best to let NLTK do the work from now on

4. A very brief history of corpus

The word corpus is just Latin for 'body'.

So a corpus is just a body of text (or speech or ...).

We'll use the word to mean a collection of (possibly annotated) language in machine-processable form.

Virtually all corpora are structured.

That is, they use some conventions to make manifest properties of the language they contain.

Those properties can be separated into

the inherent or original ones, that is, the ones which constitute the language data as such
- The characters with which it was originally written or typed, and their physical configuration
- The sounds with which it was originally uttered
The analytical ones, that is, the ones which are the result of explicit analysis of the data.

We'll look at a series of examples to see how the representation of corpora has evolved.

5. The earliest corpora: punched cards!

The first corpora were created using punched cards,

and distributed, if at all, using magnetic tape.

The 80-column width and lack of lower-case letters meant a lot of workarounds were necessary just to reproduce ordinary typed text,

to say nothing of printed material.

6. Early corpora: The Brown Corpus

Produced initially in 1964 by Henry Kucera and Nelson Francis of the Linguistics Department of Brown University:

The Brown Corpus was a balanced corpus of roughly 1 million words of American English text.

It's composed of 500 texts of roughly 2000 words each.

Part-of-speech information was added in 1979.

7. Brown corpus example

Here are a few punched card lines from the original Brown corpus (text cf03):


TELEVISION IMPULSES, SOUND WAVES, ULTRA-VIOLET RAYS, ETC**., THAT MAY   1020ElF03
OC*                                                                     1025ElF03
CUPY THE VERY SAME SPACE, EACH SOLITARY UPON ITS OWN FREQUENCY, IS INF  1030ElF03
INITE. *SO WE MAY CONCEIVE THE COEXISENCE OF THE INFINITE NUMBER OF U   1040ElF03
NIVERSAL, APPARENTLY MOMENTARY STATES OF MATTER, SUCCESSIVE ONE AFTER   1050ElF03

This represents the following text:

television impulses, sound waves, ultra-violet rays, etc., that may occupy the very same space, each solitary upon its own frequency, is infinite. So we may conceive the coexistence of the infinite number of universal, apparently momentary states of matter, successive one after another

* and ** are used as escape characters, with a quite complex set of rules for reconstructing the original based on their use.

Note that whereas *SO represents a difference really present in the original:

Letters prefixed with a * represent upper-case;
Letters not so prefixed represent lower-case.

on the other hand the **. coding represents analytical information:

The corpus creators judged that the full stop at the end of 'etc.' was an abbreviation marker,
whereas the full stop at the end of 'infinite' marks the end of a sentence.

8. Digression: Three phases of corpus origin: Phase 1

In the beginning, all corpus material originated outside computers.

So the computer-based representation came second:

it was some transcription or transcoding of an original which existed on paper:
- hand-written
- typewritten
- printed

So decisions had to be made about every aspect of the original which could be represented:

Whether to represent it or not;
If so, how to represent it.

Note for instance that neither linebreaks nor number of spaces is preserved in the Brown corpus.

The cost of transcription necessarily limited the scale of corpora created in this way.

9. Phases 2 and 3 of corpus origin

The real breakthrough came when computers began to be used to create text for printing

particularly for large-volume printing.

Corpus creation then became a matter of:

Getting access to the designed-for-print representation;
Getting the necessary license to work with and distribute it;
Converting it from a print-appropriate form to one suitable for NLP.

By this time, (reproduction for) distribution was a bit easier

using CD-ROMs

The third and most recent phase came when texts began to be created for online distribution.

Access is no longer a problem, but issues of rights and conversion still arise, just in different forms.

Distribution is by DVD or online.

10. The breakthrough: The ACL/DCI and the Wall Street Journal

The arrival of the second phase was signalled by the release of the ACL/DCI corpus in 1989

first on tape
a few years later on CD-ROMs

The core of the corpus was material derived from the printers' tapes for the Wall Street Journal newspaper.

With over 30 million words, it was a major increase in size over the Brown corpus.

Its widespread availability launched a new way of approaching language processing, called by some at the time "data-intensive".

Started by Mark Liberman and colleagues at Bell Labs

the work was quickly adopted and supported by the Association for Computational Linguistics with the name Data Collection Initiative

11. Beyond American English: The ECI

Susan Armstrong (Geneva) and Henry S. Thompson (Edinburgh) produced a CD-ROM of non-American English data in 1994

with everything from Dutch newspaper data to the Japanese constitution

Some parallel multilingual data was included.

The European Corpus Initiative went on to produce a more substantial corpus

called MLCC, for "Multi-lingual corpora for cooperation"
supported by the European Commission
including comparable data from six European financial newspapers
and parallel data from European Commission proceedings

12. The British National Corpus

The Brown corpus excepted, efforts up to this point had been largely opportunistic:

Who has data
- who is willing to donate it
- with acceptable licence terms

The BNC effort, jointly funded by the British government and a consortium of dictionary publishers, changed all that.

The BNC was designed to be a representative corpus of contemporary British English

as written (90 million words)
and spoken (10 million words)
I.e. a hundred times the size of the Brown corpus

Originally published in 1995

two subsequent editions have been issued

The spoken part was particularly ambitious

involving the recruiting of 100s of participants to record their daily conversation

13. The institutions: LDC and ELRA

In the US, out of the ACL/DCI came the Linguistic Data Consortium, funded by NSF and DARPA.

In Europe, the European Union launched the European Language Resources Association in 1995.

Both of these bodies have produced a great deal of material.

And continue to do so

14. We have data

Above and beyond what NLTK has

There are currently around 160 corpora available from DICE machines

You should explore a bit from the root directory, which is /group/corpora/public

Some examples

Spanish newswires
- <a href="../3/spanex.html">example of spanish newswire data</a>
Hong Kong Hansard, in English and Chinese and aligned versions
- <a href="../3/chinex.html">example of aligned chinese and english</a>
Tagged and parsed versions of 14 books by Charles Dickens
- <a href="../3/dickex.html">example of tagged dickens KWIC</a>

15. The revolution in attitude: There's no data like more data

From

My data is my scholarly identity

Our data is our corporate advantage

By giving away my/our data, we all win

Bob Mercer, then at IBM, is widely credited with the tag line "There's no data like more data."

This attitude spread first at the level of recordings of spoken material

And slowly spread up the speech chain
It's now part of a larger shift towards empirical/data-intensive methodologies

16. Corpus content: Data vs. metadata

A corpus is more than just the words it consists of.

For many purposes, you need to know a lot more:

Who wrote it
When
Who published it
What language it is in

and all the other things you would expect in a bibiographic record.

And because what you get in the corpus is almost never exactly what was originally published, you need information about how the original, whether electronic or not, was processed to produce the corpus:

Who processed it
What tools they used
What they represented, and how, and what the left out, if anything
What invariants the corpus representation guarantees

We use the name metadata for all this kind of information

17. Markup

Before we can even look at the details of metadata or data

We need to consider how to annotate corpus material at all

Early corpora, e.g. the tagged Brown corpus, used a simple record/field approach

SAID                          VBD       A01001006E1

One record per line
Fields separated by tab, blank, comma, or other delimiter

Over the last twenty years, standardised markup languages have slowly taken over:

First SGML (Standardised General Markup Language)
More recently XML (Extensible Markup Language)

The form in which corpora are processed is not always the same as its interchange or archival form

Some systems use a relational database
Others have idiosyncratic internal formats

But even these almost always expect to import, export and archive their data in XML

And other systems simply process corpora with sequences of operations each of which maps from XML in one form to XML in another

18. What are SGML/XML?

Markup languages used for annotating text / information

Concerned with logical structure

to identify sections, titles, section headers, chapters, paragraphs,...; but also: addressee, drug name, company name, diagnosis, part number, price, supplier, ...

Not concerned with appearance

you say 'this is a subtitle' not 'this is in bold, 14pt, centered'
you say 'this is an example' not 'this is in verbatim, indented by 5pts, ragged right'

19. Markup: a simple example

Basic SGML/XML markup is just labelled brackets:

The text content is in bold here

The markup is in blue, red and grey

An SGML or XML document is underlyingly a tree

Made up of elements
- Here labelled memo, sender, message, etc.
Elements may have attributes
- In this example status and docid

20. Why SGML/XML?

Use structurally annotated text

To make explicit for machines what is implicit for humans
To add value with annotation
To reuse text
- in different documents
- in different formats
- write once, output often
To simplify the tool set/services stack
- And improve interoperability

21. Administrative details

Course home page is at http://www.inf.ed.ac.uk/teaching/courses/fnlp/

Do your own work yourself

Labs will typically involve hands-on use of NLTK, and are as follows:

Monday: 1510–1600: AT 5.04
Wednesday: 1110–1200: AT 5.04

If you need to swap, please try to arrange this privately before letting me know.

There will be two pieces of coursework, due dates as per the timetable.

If you don't attend the labs, you will have a much harder time with the coursework (and the exam).

You really will need to have access to a copy of the text, must be the 2nd edition.

Notices will come via the mailing list, so do register for the course ASAP even if you're not sure you will take it.

You should have received email from me yesterday—if you didn't, you're not registered!