Lab 4: POS Tagging

Authors:Henry Thompson, Bharat Ram Ambati
Date:2014-10-01
Copyright:This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License: You may re-use, redistribute, or modify this work for non-commercial purposes provided you retain attribution to any previous author(s).

POS Tagset

Distribution of sentence lengths

Distribution of tags

Distribution of tags and words

Unigram Tagger

Going Further

  1. Run the simple tagger developed on the sentence "I bought two new books." What error do you see, and what improvement can you think of that can handle it?

    Splitting the sentence based on space treats "books." as a single token. As this is an unseen word, tagger assigns the default tag. Instead of splitting the sentence, running a tokenizer word_tokenize(sentence) treats "books" and "." as two lexical items and hence the tagger is able assign correct tags.

  2. Do you think the Penn tagset will work well for social media text such as twitter data which contains non-standard English text?

    Not really. There has been some recent work on developing a new tagset for twitter data. See : https://www.aclweb.org/anthology/P/P11/P11-2008.pdf

  3. NLTK has libraries to train different taggers. Using these libraries, build unigram and Hidden Markov Model taggers and evaluate them. First, split the data into two parts(90%, 10%). Consider first 90% of the data as training data and the remaining 10% of the data as testing data. Build taggers using the training data. Then run the taggers on the test data and evaluate the performance of the tagger

    unigram_tagger = nltk.UnigramTagger(train_sents, backoff=nltk.DefaultTagger('NN'))

    hmm_tagger = HiddenMarkovModelTagger.train(train_sents)

    unigram_tagger.evaluate(test_sents)

    hmm_tagger.evaluate(test_sents)