Platform and format are pervasive, irritating, issues:
Most early academic work was done with line-oriented data on UN*X
tr -s ' ' '\012' < austen-sense.txt | sort | uniq -c | sort -nr
tr -s ' ' '\012' < austen-sense.txt | tr A-Z a-z | sort | uniq -c | sort -nr | less
tr -s ' ' '\012' < austen-sense.txt | sort | uniq | wc
tr -s ' ' '\012' < austen-sense.txt | tr A-Z a-z | sort | uniq | wc
time tr -s ' ' '\012' < austen-sense.txt | tr A-Z a-z | sort | uniq -c | sort -nr >/dev/null
wc austen-sense.txt
This had the advantage of being fast (~500ms for ~120000 words on a 2GHz machine).
But wasn't very portable
And was very vulnerable to variations in corpus format
[Who uses what platform?]
NLTK is an Open Source effort (see http://www.nltk.org/), backed up with a book Natural Language Processing: Analyzing Text with Python and the Natural Language Toolkit, by Bird, Klein and Loper.
We'll mostly use NLTK to get around the platform and format issues:
It's not as fast as custom C-code, but fast enough for our purposes.
Let's look at our Jane Austen example using NLTK:
from nltk.book import *
len(text2)
text2[:60]
f=FreqDist(text2)
f
len(f)
f.items()[:50]
g=FreqDist(x.lower() for x in text2)
g
len(g)
g.items()[:50]
g.plot(50)
h=FreqDist(x.lower() for x in text2 if x[0].lower().isalpha())
len(h)
h.items()[:50]
h.plot(50)
This is the first of many examples we'll see of Zipf's Law: the frequency of items in natural language falls off exponentially.
Note that NLTK and the simple command line tools disagree about the word counts
This is because NLTK does a better job of tokenisation
We can get some idea of where UN*X is going wrong. . .
egrep -i '[^ a-zA-Z]the ' austen-sense.txt | less
And then improve it:
tr -s ' "-' '\012' < austen-sense.txt | tr A-Z a-z | sort
| uniq -c | sort -nr | less
That get's us close for 'the', but still way out for 'to'. . .
The word corpus is just Latin for 'body'.
So a corpus is just a body of text (or speech or ...).
We'll use the word to mean a collection of (possibly annotated) language in machine-processable form.
Virtually all corpora are structured.
That is, they use some conventions to make manifest properties of the language they contain.
Those properties can be separated into
We'll look at a series of examples to see how the representation of corpora has evolved.
The first corpora were created using punched cards,
and distributed, if at all, using magnetic tape.
The 80-column width and lack of lower-case letters meant a lot of workarounds were necessary just to reproduce ordinary typed text,
to say nothing of printed material.
Produced initially in 1964 by Henry Kucera and Nelson Francis of the Linguistics Department of Brown University:
It's composed of 500 texts of roughly 2000 words each.
Part-of-speech information was added in 1979.
Here are a few punched card lines from the original Brown corpus (text cf03):
TELEVISION IMPULSES, SOUND WAVES, ULTRA-VIOLET RAYS, ETC**., THAT MAY 1020ElF03
OC* 1025ElF03
CUPY THE VERY SAME SPACE, EACH SOLITARY UPON ITS OWN FREQUENCY, IS INF 1030ElF03
INITE. *SO WE MAY CONCEIVE THE COEXISENCE OF THE INFINITE NUMBER OF U 1040ElF03
NIVERSAL, APPARENTLY MOMENTARY STATES OF MATTER, SUCCESSIVE ONE AFTER 1050ElF03
This represents the following text:
television impulses, sound waves, ultra-violet rays, etc., that may occupy the very same space, each solitary upon its own frequency, is infinite. So we may conceive the coexistence of the infinite number of universal, apparently momentary states of matter, successive one after another
* and ** are used as escape characters, with a quite complex set of rules for reconstructing the original based on their use.
Note that whereas *SO
represents a difference really present
in the original:
on the other hand the **.
coding represents analytical
information:
In the beginning, all corpus material originated outside computers.
So the computer-based representation came second:
So decisions had to be made about every aspect of the original which could be represented:
Note for instance that neither linebreaks nor number of spaces is preserved in the Brown corpus.
The cost of transcription necessarily limited the scale of corpora created in this way.
The real breakthrough came when computers began to be used to create text for printing
Corpus creation then became a matter of:
By this time, (reproduction for) distribution was a bit easier
The third and most recent phase came when texts began to be created for online distribution.
Access is no longer a problem, but issues of rights and conversion still arise, just in different forms.
Distribution is by DVD or online.
The arrival of the second phase was signalled by the release of the ACL/DCI corpus in 1989
The core of the corpus was material derived from the printers' tapes for the Wall Street Journal newspaper.
With over 30 million words, it was a major increase in size over the Brown corpus.
Its widespread availability launched a new way of approaching language processing, called by some at the time "data-intensive".
Started by Mark Liberman and colleagues at Bell Labs
Susan Armstrong (Geneva) and Henry S. Thompson (Edinburgh) produced a CD-ROM of non-American English data in 1994
Some parallel multilingual data was included.
The European Corpus Initiative went on to produce a more substantial corpus
The Brown corpus excepted, efforts up to this point had been largely opportunistic:
The BNC effort, jointly funded by the British government and a consortium of dictionary publishers, changed all that.
The BNC was designed to be a representative corpus of contemporary British English
Originally published in 1995
The spoken part was particularly ambitious
In the US, out of the ACL/DCI came the Linguistic Data Consortium, funded by NSF and DARPA.
In Europe, the European Union launched the European Language Resources Association in 1995.
Both of these bodies have produced a great deal of material.
Above and beyond what NLTK has
You should explore a bit from the root directory, which is /group/corpora/public
Some examples
From
My data is my scholarly identity
Our data is our corporate advantage
to
By giving away my/our data, we all win
Bob Mercer, then at IBM, is widely credited with the tag line "There's no data like more data."
This attitude spread first at the level of recordings of spoken material
A corpus is more than just the words it consists of.
For many purposes, you need to know a lot more:
and all the other things you would expect in a bibiographic record.
And because what you get in the corpus is almost never exactly what was originally published, you need information about how the original, whether electronic or not, was processed to produce the corpus:
We use the name metadata for all this kind of information
Before we can even look at the details of metadata or data
Early corpora, e.g. the tagged Brown corpus, used a simple record/field approach
SAID VBD A01001006E1
Over the last twenty years, standardised markup languages have slowly taken over:
The form in which corpora are processed is not always the same as its interchange or archival form
But even these almost always expect to import, export and archive their data in XML
Markup languages used for annotating text / information
Concerned with logical structure
Not concerned with appearance
Basic SGML/XML markup is just labelled brackets:
The text content is in bold here
An SGML or XML document is underlyingly a tree
Use structurally annotated text
Course home page is at http://www.inf.ed.ac.uk/teaching/courses/fnlp/
Do your own work yourself
Labs will typically involve hands-on use of NLTK, and are as follows:
If you need to swap, please try to arrange this privately before letting me know.
There will be two pieces of coursework, due dates as per the timetable.
You really will need to have access to a copy of the text, must be the 2nd edition.
Notices will come via the mailing list, so do register for the course ASAP even if you're not sure you will take it.