>> >> Assignment 1

Introduction to Computational Linguistics

Assignment 1 - Evaluating POS Taggers

Introduction

This assignment is about part of speech (POS) tagging. Make sure that you have done the NLTK-Lite Tagging tutorial and the first four ICL lab sessions before you start work on the assignment.

The main aim of the assignment is to compare and evaluate two POS taggers on two data sets. The POS taggers are both trained on Wall Street Journal (WSJ) text; one of the data sets is well matched to the training data (also from the WSJ), the other less so. You should compare the performance of the taggers with each other, and between the data sets. The deliverable of this assessment will be a four-page report.

The Taggers

This assignment will use two taggers:

An NLTK backoff tagger (as covered in the NLTK tutorial), which is a simple unigram tagger, combined with a simple default tagger.
The TnT tagger developed by Thorsten Brants, which is a high quality tagger, based on trigram statistics, but with more sophisticated smoothing. For information see the TnT home page, in particular see the TnT manual. There is no need to download tnt, since it is available on DICE - more details on using it below.

Both taggers were trained on the same data - about 1 million words of text from the Wall Street Journal newspaper. The POS tags in the training set were hand-annotated during the construction of the Penn Treebank.

The NLTK Backoff Tagger

The NLTK Backoff tagger is a combined tagger, that was trained as follows
defTagger = tag.Regexp([(r'^[0-9]+(.[0-9]+)?$', 'CD'), (r'.*', 'NN')]) boTagger = tag.Unigram(backoff=defTagger) boTagger.train(treebank.tagged())

The complete tagger has been pickled from Python using the cPickle class. Pickling is Python's way of saving the contents of an object to file. The backoff tagger created above was saved to the file /home/srenals/public/icl/boTagger.pic using the following Python code:
  import cPickle   tagfil = open("/home/srenals/public/icl/boTagger.pic", 'w')   p = cPickle.Pickler(tagfil)   p.dump(boTagger)   tagfil.close()
Your program can read this tagger object using the following segment of code:
  import cPickle   tagfil = open("/home/srenals/public/icl/boTagger.pic", 'r')   p = cPickle.Unpickler(tagfil)   boTagger = p.load()   tagfil.close()
The object boTagger now contains the trained tagger, which can be used as normal, eg:
  boTagger.tag(test_token)

The TnT Tagger

tnt was trained on the same data as the NLTK backoff tagger, and its parameter files (wsj.lex and wsj.123) are to be found in directory /home/srenals/public/icl.

tnt expects files in a simple format (that can be output from a tokenizer), one word per line, eg:
  No   ,   it   was   n't   Black   Monday   .   But   while
If a test file (eg test.t) is in this format, then you can run tnt from the linux command line as tnt [model] [data] - note that a model argument foo will pick up both foo.123 and foo.lex, eg:
  tnt /home/srenals/public/icl/wsj test.t > test.tagged
The tagged text is output to standard output in the following format:
  No RB   , ,   it PRP   was VBD   n't RB   Black NNP   Monday NNP   . .   But CC   while IN

You can call tnt from within Python by using the os.system() call, whose argument is a string specifying the command to run, for example:
import os tnt_executable = 'tnt' tnt_model = '/home/srenals/public/icl/wsj' def run_tnt(input=tokenised.txt', output='tagged.txt'): "call tnt from Linux" tnt_command = '%s %s %s > %s' % (tnt_executable, tnt_model, input, output) os.system(tnt_command) You can then call the tagger from the Python interpreter as:
>>>run_tnt('file.in', 'file.out')

The Test Data

Two sources of test data will be used for this experiment: Wall Street Journal newspaper text, and transcripts of radio news broadcasts from the Boston University Radio News Corpus. Since the taggers were trained on data from the Wall Street Journal, one would expect that they will perform better on this data, compared with the Radio News Corpus.

Wall Street Journal Text

The text from the Wall Street Journal on which you should evaluate the taggers is to be found in the file:
/home/srenals/public/icl/test23.txt This text has already been tokenized in the same way as the training data used to train the taggers, so a WhitespaceTokenizer should be adequate in this case.

Part of Speech Tagging Guidelines for the Penn Treebank Project

Boston University Radio News Corpus

The text from the Boston Radio News Corpus is to be found in the file:
  /home/srenals/public/icl/bu-m4b.wrd
It requires some preprocessing and tokenization. This text comes from a transcription of the speech in news broadcasts, and also contains some timing information. The data is split into segments (corresponding to 20-40 seconds of speech, typically) and each segment has a header, eg:
  ##   s10/m4bs10p1   # and a footer:
  ##end##
The file contains 66 such segments. The text to be tagged is between the header and the footer:
   0.150000 76 In    0.220000 76 the    0.500000 76 twenty    0.950000 76 years   [...]    30.610001 76 continuing    30.769999 76 his    31.039999 76 drug    31.370001 76 habit
The first column is the start time of the word and the second column comes from the transcription software (it always takes value 76). Both these can be ignored. You need to extract the third column (the words to be tagged). Some tokenization is also necessary, since (for example) "I'm" appears as a single word. There is no punctuation in this data. Ignore upper/lower case distinctions.

The Boston University Radio News Corpus (includes the list of POS tags used).

Evaluation

To evaluate the performance of the taggers on the test data you require gold standard data, tagged by humans. This is supplied in the files:
/home/srenals/public/icl/test23.pos /home/srenals/public/icl/bu-m4b.pos
You can perform the evaluation by using the program tnt-diff (and by using the NLTK function tag.accuracy that is discussed in the NLTK tutorial on tagging; however, this will only work with your backoff tagger). In either case you will need to make sure that the .pos files are tokenized correctly.

Suggested Approach

The files you need to use are in the directory /home/srenals/public/icl. Here is the way I suggest you approach this assignment:

Pre-process the text data. This includes transforming the test data into a formats suitable for the taggers (eg stripping the timing information from the radio news data).
Tokenization, including dealing with things such as "I'm" which can be converted to "I" "am"
Run each tagger on each data set.
Using the files with hand tags in them evaluate the accuracy of each tagger on each data set.
Analyse your results and write your paper.

Submission and Assessment

Please submit your assignment, in pdf format, by doing the following command under unix on a DICE machine (don't forget the "1" in this command!):

submit ai3 icl 1 filename-of-your-submission

(NB: even if you are an MSc student, please use the ai3 code with the submit command.)

This report should be no longer than 4 pages conforming to the formatting guidelines published for the Interspeech conference. These guidelines include a template for Word users and a LaTeX style file and template. The important points are that your paper should be 4 pages maximum, 2 columns per page and no smaller than 10 point text.

You paper should contain the following sections:

Abstract A brief explanation of the aim of the study, your methodology, your results and the conclusions that you have drawn from them.
Introduction This should explain the aim of the study and the motivation for it, together with a review of related work in the field.
Approach Outline the basis of the taggers that you used
Data Describe the data sets used to evaluate the taggers, including any unusual or important aspects, and any preprocessing steps that were required.
Results Describe your testing and tagger evaluation strategy and the results of the tagging evaluation.
Conclusions and Future Work Discuss the merits of the tagging approaches with respect to each other across the different data sets. What would you do to improve this work if you had more time?

You will not be assessed on the quality of your Python code. However please do the following:

Keep all your code and outputs from your code obtained while running the experiments in a directory in your home account on DICE. Include a short README which explains what the files in the directory are for.
At the end of your report, specify the location of this directory, so that we can see it if necessary.

HANDIN DEADLINE: 16:00, Friday 3 November 2006

Home : Teaching : Courses : Icl