ICL Home >> Assignments >> Assignment 1 |
This assignment is about part of speech (POS) tagging. Make sure that you have done the NLTK-Lite Tagging tutorial and the first four ICL lab sessions before you start work on the assignment.
The main aim of the assignment is to compare and evaluate two POS taggers on two data sets. The POS taggers are both trained on Wall Street Journal (WSJ) text; one of the data sets is well matched to the training data (also from the WSJ), the other less so. You should compare the performance of the taggers with each other, and between the data sets. The deliverable of this assessment will be a four-page report.
This assignment will use two taggers:
Both taggers were trained on the same data - about 1 million words of text from the Wall Street Journal newspaper. The POS tags in the training set were hand-annotated during the construction of the Penn Treebank.
The NLTK Backoff tagger is a combined tagger, that was trained as follows
defTagger = tag.Regexp([(r'^[0-9]+(.[0-9]+)?$', 'CD'), (r'.*', 'NN')])
boTagger = tag.Unigram(backoff=defTagger)
boTagger.train(treebank.tagged())
The complete tagger has been pickled from Python using
the cPickle
class. Pickling is Python's way of
saving the contents of an object to file. The backoff tagger
created above was saved to the file
/home/srenals/public/icl/boTagger.pic
using the following Python code:
import cPickle
tagfil = open("/home/srenals/public/icl/boTagger.pic", 'w')
p = cPickle.Pickler(tagfil)
p.dump(boTagger)
tagfil.close()
Your program can read this tagger object using the following
segment of code:
import cPickle
tagfil = open("/home/srenals/public/icl/boTagger.pic", 'r')
p = cPickle.Unpickler(tagfil)
boTagger = p.load()
tagfil.close()
The object boTagger
now contains the trained
tagger, which can be used as normal, eg:
boTagger.tag(test_token)
tnt
was trained on the same data as the NLTK
backoff tagger, and its parameter files (wsj.lex
and
wsj.123
) are to be found in directory
/home/srenals/public/icl
.
tnt
expects files in a simple format (that can be
output from a tokenizer), one word per line, eg:
No
,
it
was
n't
Black
Monday
.
But
while
If a test file (eg test.t
) is in
this format, then you can run tnt
from the linux
command line as tnt [model] [data]
- note
that a model argument foo
will pick up both
foo.123
and foo.lex
, eg:
tnt /home/srenals/public/icl/wsj test.t > test.tagged
The tagged text is output to standard output in the following format:
No RB
, ,
it PRP
was VBD
n't RB
Black NNP
Monday NNP
. .
But CC
while IN
You can call tnt
from within Python by using the
os.system() call, whose argument is a string specifying
the command to run, for example:
import os
You can then call the tagger from the Python interpreter as:
tnt_executable = 'tnt'
tnt_model = '/home/srenals/public/icl/wsj'
def run_tnt(input=tokenised.txt', output='tagged.txt'):
"call tnt from Linux"
tnt_command = '%s %s %s > %s' % (tnt_executable, tnt_model, input, output)
os.system(tnt_command)
>>>run_tnt('file.in', 'file.out')
Two sources of test data will be used for this experiment: Wall Street Journal newspaper text, and transcripts of radio news broadcasts from the Boston University Radio News Corpus. Since the taggers were trained on data from the Wall Street Journal, one would expect that they will perform better on this data, compared with the Radio News Corpus.
The text from the Wall Street Journal on which you should
evaluate the taggers is to be found in the file:
/home/srenals/public/icl/test23.txt
This text has already been tokenized in the same way as the
training data used to train the taggers, so a
WhitespaceTokenizer
should be adequate in this case.
Part of Speech Tagging Guidelines for the Penn Treebank Project
The text from the Boston Radio News Corpus is to be found in the
file:
/home/srenals/public/icl/bu-m4b.wrd
It requires some
preprocessing and tokenization. This text comes from a
transcription of the speech in news broadcasts, and also
contains some timing information.
The data is split into segments (corresponding to 20-40 seconds of
speech, typically) and each segment has a header, eg:
##
and a footer:
s10/m4bs10p1
#
##end##
The file contains 66 such segments. The text to be tagged is
between the header and the footer:
0.150000 76 In
0.220000 76 the
0.500000 76 twenty
0.950000 76 years
[...]
30.610001 76 continuing
30.769999 76 his
31.039999 76 drug
31.370001 76 habit
The first column is the start time of the word and the second
column comes from the transcription software (it always takes
value 76). Both these can be ignored. You need to extract the
third column (the words to be tagged). Some tokenization is
also necessary, since (for example) "I'm" appears as a single
word. There is no punctuation in this data. Ignore upper/lower
case distinctions.
The Boston University Radio News Corpus (includes the list of POS tags used).
To evaluate the performance of the taggers on the test data you
require gold standard data, tagged by humans. This is supplied
in the files:
/home/srenals/public/icl/test23.pos
/home/srenals/public/icl/bu-m4b.pos
You can perform the evaluation by using the program
tnt-diff
(and by using the NLTK function
tag.accuracy
that is discussed in the NLTK
tutorial on tagging; however, this will only work with your
backoff tagger).
In either case you will need to make sure
that the .pos
files are tokenized correctly.
The files you need to use are in the directory
/home/srenals/public/icl
. Here is the way I
suggest you approach this assignment:
Please submit your assignment, in pdf format, by doing the following command under unix on a DICE machine (don't forget the "1" in this command!):
submit ai3 icl 1 filename-of-your-submission
(NB: even if you are an MSc student, please use the
ai3
code with the submit command.)
This report should be no longer than 4 pages conforming to the formatting guidelines published for the Interspeech conference. These guidelines include a template for Word users and a LaTeX style file and template. The important points are that your paper should be 4 pages maximum, 2 columns per page and no smaller than 10 point text.
You paper should contain the following sections:
You will not be assessed on the quality of your Python code. However please do the following:
HANDIN DEADLINE: 16:00, Friday 3 November 2006
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk Please contact our webadmin with any comments or corrections. Logging and Cookies Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh |