ICL Home >> Lab Sessions >> Lab 7
 

Introduction to Computational Linguistics

Lab 7 — Chunking

This lab is based on the NLTK-Lite chunking tutorial.

  1. Start an interactive Python session from the command line, and enter the following statements (or put them in a file and execute them):

    >>> from nltk_lite.corpora import conll2000
    >>> from itertools import islice
    >>> from nltk_lite import parse
    >>> 
    >>> tagged = conll2000.tagged('train')   # get a tagged version of training corpus
    >>> taggedsample = list(islice(tagged,10,13))  # make a list of 3 sentences
    >>> 
    >>> rule = parse.ChunkRule('<DT>*<JJ>*<NN>+', "Chunk a sequence of DT, JJ and NN") 
    >>> chp = parse.RegexpChunk([rule], chunk_node = 'NP', top_node='S')
    >>> 
    >>> chunk_tree = chp.parse(taggedsample[0], trace=1)
    >>> print chunk_tree
    

    Now have a look at the chunked version of the data, and compare it with the output of your rule:

    >>> chunked = conll2000.chunked('train') # get a chunked version of training corpus
    >>> chunkedsample = list(islice(chunked,10,13))
    >>> print chunkedsample
    

    Try to improve or add to the rule above so as to improve your coverage of NP chunks.

  2. You can try measuring how well your chunker does on the first sentence of the sample by using the NLTK-Lite chunk scorer.

    >>> chunkscore = parse.ChunkScore()
    >>> correct = chunkedsample[0]
    >>> guess = chunk_tree
    >>> chunkscore.score(correct, guess)
    >>> print chunkscore
    

    Your result should look something like this:

    ChunkParse score:
        Precision:  33.3%
        Recall:     14.3%
        F-Measure:  20.0%
    
  3. To compare the results of your chunker against the training data chunks in a more systematic manner, we should look at more of the data. We can also make things a bit simpler by using the leaves method to strip out the tree structure (i.e., the chunks) from the chunked training data:

    >>> for correct in chunked:
    ... 	guess = chp.parse(correct.leaves())
    ... 	chunkscore.score(correct, guess)
    >>> print chunkscore
    


Home : Teaching : Courses : Icl 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh