ICL Home >> Lab Sessions >> Lab 4
 

Introduction to Computational Linguistics

Lab 4 - Taggers

This lab is based on the NLTK-Lite tagging tutorial and the first lecture on PoS tagging.

Import the POS-tagged treebank text into NLTK-lite and answer the following questions:

  1. What is the most frequent tag?
  2. Which word has the greatest number of distinct tags?
  3. What proportion of word types are always assigned the same part-of-speech tag?
  4. Which nouns are more common in their plural form, rather than their singular form?
  5. Produce a list of word types, sorted alphabetically, that have been tagged as RBS (adverb, superlative)
  6. Identify those words which can be plural nouns or third person singular verbs (eg flies)
  7. Identify three-word prepositional phrases of the form IN + DET + NN (eg in the lab).
  8. For the word with the greatest number of distrinct tags, print out sentences from the corpus containing the word, and giving examples of the different tags.

The regular expression tagger (NN_CD_tagger) defined in the notes, aims to identify cardinal numbers and tags everything else as NN. The performance of this tagger could be imporved by extending the regular expression used for tagging, eg by tagging any word that ends with s as a plural noun. Propose three new rules, plus the plural noun rule just mentioned, which could be used to tag unknown words based on the shape (e.g., suffixes or other formal properties) of the word.


Home : Teaching : Courses : Icl 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh