ICL Home >> Assignments >> Assignment 1
 

Introduction to Computational Linguistics

Assignment 1 - Evaluating POS Taggers

Introduction

This assignment is about part of speech (POS) tagging. Make sure that you have done the NLTK-Lite Tagging tutorial and the first four ICL lab sessions before you start work on the assignment.

The main aim of the assignment is to compare and evaluate two POS taggers on two data sets. The POS taggers are both trained on Wall Street Journal (WSJ) text; one of the data sets is well matched to the training data (also from the WSJ), the other less so. You should compare the performance of the taggers with each other, and between the data sets. The deliverable of this assessment will be a four-page report.

The Taggers

This assignment will use two taggers:

Both taggers were trained on the same data - about 1 million words of text from the Wall Street Journal newspaper. The POS tags in the training set were hand-annotated during the construction of the Penn Treebank.

The NLTK Backoff Tagger

The NLTK Backoff tagger is a combined tagger, that was trained as follows
  defTagger = tag.Regexp([(r'^[0-9]+(.[0-9]+)?$', 'CD'), (r'.*', 'NN')])
  boTagger = tag.Unigram(backoff=defTagger)
  boTagger.train(treebank.tagged())

The complete tagger has been pickled from Python using the cPickle class. Pickling is Python's way of saving the contents of an object to file. The backoff tagger created above was saved to the file /home/srenals/public/icl/boTagger.pic using the following Python code:
  import cPickle
  tagfil = open("/home/srenals/public/icl/boTagger.pic", 'w')
  p = cPickle.Pickler(tagfil)
  p.dump(boTagger)
  tagfil.close()

Your program can read this tagger object using the following segment of code:
  import cPickle
  tagfil = open("/home/srenals/public/icl/boTagger.pic", 'r')
  p = cPickle.Unpickler(tagfil)
  boTagger = p.load()
  tagfil.close()

The object boTagger now contains the trained tagger, which can be used as normal, eg:
  boTagger.tag(test_token)

The TnT Tagger

tnt was trained on the same data as the NLTK backoff tagger, and its parameter files (wsj.lex and wsj.123) are to be found in directory /home/srenals/public/icl.

tnt expects files in a simple format (that can be output from a tokenizer), one word per line, eg:
  No
  ,
  it
  was
  n't
  Black
  Monday
  .
  But
  while

If a test file (eg test.t) is in this format, then you can run tnt from the linux command line as tnt [model] [data] - note that a model argument foo will pick up both foo.123 and foo.lex, eg:
  tnt /home/srenals/public/icl/wsj test.t > test.tagged

The tagged text is output to standard output in the following format:
  No RB
  , ,
  it PRP
  was VBD
  n't RB
  Black NNP
  Monday NNP
  . .
  But CC
  while IN

You can call tnt from within Python by using the os.system() call, whose argument is a string specifying the command to run, for example:
  import os
  tnt_executable = 'tnt'
  tnt_model = '/home/srenals/public/icl/wsj'

  def run_tnt(input=tokenised.txt', output='tagged.txt'):
      "call tnt from Linux"
      tnt_command = '%s %s %s > %s' % (tnt_executable, tnt_model, input, output)
      os.system(tnt_command)
 
You can then call the tagger from the Python interpreter as:
  >>>run_tnt('file.in', 'file.out')

The Test Data

Two sources of test data will be used for this experiment: Wall Street Journal newspaper text, and transcripts of radio news broadcasts from the Boston University Radio News Corpus. Since the taggers were trained on data from the Wall Street Journal, one would expect that they will perform better on this data, compared with the Radio News Corpus.

Wall Street Journal Text

The text from the Wall Street Journal on which you should evaluate the taggers is to be found in the file:
  /home/srenals/public/icl/test23.txt
This text has already been tokenized in the same way as the training data used to train the taggers, so a WhitespaceTokenizer should be adequate in this case.

Part of Speech Tagging Guidelines for the Penn Treebank Project

Boston University Radio News Corpus

The text from the Boston Radio News Corpus is to be found in the file:
  /home/srenals/public/icl/bu-m4b.wrd

It requires some preprocessing and tokenization. This text comes from a transcription of the speech in news broadcasts, and also contains some timing information. The data is split into segments (corresponding to 20-40 seconds of speech, typically) and each segment has a header, eg:
  ##
  s10/m4bs10p1
  #
and a footer:
  ##end##

The file contains 66 such segments. The text to be tagged is between the header and the footer:
   0.150000 76 In
   0.220000 76 the
   0.500000 76 twenty
   0.950000 76 years
  [...]
   30.610001 76 continuing
   30.769999 76 his
   31.039999 76 drug
   31.370001 76 habit

The first column is the start time of the word and the second column comes from the transcription software (it always takes value 76). Both these can be ignored. You need to extract the third column (the words to be tagged). Some tokenization is also necessary, since (for example) "I'm" appears as a single word. There is no punctuation in this data. Ignore upper/lower case distinctions.

The Boston University Radio News Corpus (includes the list of POS tags used).

Evaluation

To evaluate the performance of the taggers on the test data you require gold standard data, tagged by humans. This is supplied in the files:
  /home/srenals/public/icl/test23.pos
  /home/srenals/public/icl/bu-m4b.pos

You can perform the evaluation by using the program tnt-diff (and by using the NLTK function tag.accuracy that is discussed in the NLTK tutorial on tagging; however, this will only work with your backoff tagger). In either case you will need to make sure that the .pos files are tokenized correctly.

Suggested Approach

The files you need to use are in the directory /home/srenals/public/icl. Here is the way I suggest you approach this assignment:

Submission and Assessment

Please submit your assignment, in pdf format, by doing the following command under unix on a DICE machine (don't forget the "1" in this command!):

submit ai3 icl 1 filename-of-your-submission

(NB: even if you are an MSc student, please use the ai3 code with the submit command.)

This report should be no longer than 4 pages conforming to the formatting guidelines published for the Interspeech conference. These guidelines include a template for Word users and a LaTeX style file and template. The important points are that your paper should be 4 pages maximum, 2 columns per page and no smaller than 10 point text.

You paper should contain the following sections:

You will not be assessed on the quality of your Python code. However please do the following:

HANDIN DEADLINE: 16:00, Friday 3 November 2006


Home : Teaching : Courses : Icl 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh