|ICL Home >> Assignments >> Assignment 1|
The main aim of the assignment is to compare and evaluate two POS taggers on two data sets. The POS taggers are both trained on Wall Street Journal (WSJ) text; one of the data sets is well matched to the training data (also from the WSJ), the other less so. You should compare the performance of the taggers with each other, and between the data sets. The deliverable of this assessment will be a four-page report.
This assignment will use two taggers:
Both taggers were trained on the same data - about 1 million words of text from the Wall Street Journal newspaper. The POS tags in the training set were hand-annotated during the construction of the Penn Treebank.
The NLTK Backoff tagger is a combined tagger, that was trained as follows
defTagger = tag.Regexp([(r'^[0-9]+(.[0-9]+)?$', 'CD'), (r'.*', 'NN')])
boTagger = tag.Unigram(backoff=defTagger)
The complete tagger has been pickled from Python using
cPickle class. Pickling is Python's way of
saving the contents of an object to file. The backoff tagger
created above was saved to the file
using the following Python code:
tagfil = open("/home/srenals/public/icl/boTagger.pic", 'w')
p = cPickle.Pickler(tagfil)
Your program can read this tagger object using the following segment of code:
tagfil = open("/home/srenals/public/icl/boTagger.pic", 'r')
p = cPickle.Unpickler(tagfil)
boTagger = p.load()
boTagger now contains the trained
tagger, which can be used as normal, eg:
tnt was trained on the same data as the NLTK
backoff tagger, and its parameter files (
wsj.123) are to be found in directory
tnt expects files in a simple format (that can be
output from a tokenizer), one word per line, eg:
If a test file (eg
test.t) is in
this format, then you can run
tnt from the linux
command line as
tnt [model] [data] - note
that a model argument
foo will pick up both
tnt /home/srenals/public/icl/wsj test.t > test.tagged
The tagged text is output to standard output in the following format:
You can call
tnt from within Python by using the
os.system() call, whose argument is a string specifying
the command to run, for example:
You can then call the tagger from the Python interpreter as:
tnt_executable = 'tnt'
tnt_model = '/home/srenals/public/icl/wsj'
def run_tnt(input=tokenised.txt', output='tagged.txt'):
"call tnt from Linux"
tnt_command = '%s %s %s > %s' % (tnt_executable, tnt_model, input, output)
Two sources of test data will be used for this experiment: Wall Street Journal newspaper text, and transcripts of radio news broadcasts from the Boston University Radio News Corpus. Since the taggers were trained on data from the Wall Street Journal, one would expect that they will perform better on this data, compared with the Radio News Corpus.
The text from the Wall Street Journal on which you should
evaluate the taggers is to be found in the file:
This text has already been tokenized in the same way as the
training data used to train the taggers, so a
WhitespaceTokenizer should be adequate in this case.
The text from the Boston Radio News Corpus is to be found in the
It requires some preprocessing and tokenization. This text comes from a transcription of the speech in news broadcasts, and also contains some timing information. The data is split into segments (corresponding to 20-40 seconds of speech, typically) and each segment has a header, eg:
and a footer:
The file contains 66 such segments. The text to be tagged is between the header and the footer:
0.150000 76 In
0.220000 76 the
0.500000 76 twenty
0.950000 76 years
30.610001 76 continuing
30.769999 76 his
31.039999 76 drug
31.370001 76 habit
The first column is the start time of the word and the second column comes from the transcription software (it always takes value 76). Both these can be ignored. You need to extract the third column (the words to be tagged). Some tokenization is also necessary, since (for example) "I'm" appears as a single word. There is no punctuation in this data. Ignore upper/lower case distinctions.
The Boston University Radio News Corpus (includes the list of POS tags used).
To evaluate the performance of the taggers on the test data you
require gold standard data, tagged by humans. This is supplied
in the files:
You can perform the evaluation by using the program
tnt-diff (and by using the NLTK function
tag.accuracy that is discussed in the NLTK
tutorial on tagging; however, this will only work with your
In either case you will need to make sure
.pos files are tokenized correctly.
The files you need to use are in the directory
/home/srenals/public/icl. Here is the way I
suggest you approach this assignment:
Please submit your assignment, in pdf format, by doing the following command under unix on a DICE machine (don't forget the "1" in this command!):
submit ai3 icl 1 filename-of-your-submission
(NB: even if you are an MSc student, please use the
ai3 code with the submit command.)
This report should be no longer than 4 pages conforming to the formatting guidelines published for the Interspeech conference. These guidelines include a template for Word users and a LaTeX style file and template. The important points are that your paper should be 4 pages maximum, 2 columns per page and no smaller than 10 point text.
You paper should contain the following sections:
You will not be assessed on the quality of your Python code. However please do the following:
HANDIN DEADLINE: 16:00, Friday 3 November 2006
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: firstname.lastname@example.org
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh