FNLP: Lab Session 3

Hidden Markov Models - Construction and Use

In [5]:
# Import the packages used for this lab

import nltk

#import brown corpus
from nltk.corpus import brown

# module for training a Hidden Markov Model and tagging sequences
from nltk.tag.hmm import HiddenMarkovModelTagger

# module for computing a Conditional Frequency Distribution
from nltk.probability import ConditionalFreqDist

# module for computing a Conditional Probability Distribution
from nltk.probability import ConditionalProbDist

# module for computing a probability distribution with the Maximum Likelihood Estimate
from nltk.probability import MLEProbDist

import operator
import random

Corpora tagged with part-of-speech information

NLTK provides corpora annotated with part-of-speech (POS) information and some tools to access this information. The Penn Treebank tagset is commonly used for annotating English sentences. We can inspect this tagset in the following way:

In [ ]:
nltk.help.upenn_tagset()

The Brown corpus provided with NLTK is also tagged with POS information, although the tagset is slightly different than the Penn Treebank tagset. Information about the Brown corpus tagset can be found here: http://www.scs.leeds.ac.uk/ccalas/tagsets/brown.html

We can retrieve the tagged sentences in the Brown corpus by calling the tagged_sents() function and looking at an annotated sentence:

In [ ]:
tagged_sentences = brown.tagged_sents(categories= 'news')
print tagged_sentences[29]

Sometimes it is useful to use a coarser label set in order to avoid data sparsity or to allow a mapping between the POS labels for different languages. The Universal tagset was designed to be applicable for all languages:

https://code.google.com/p/universalpostags/.

There are mappings between the POS tagset of several languages and the Universal tagset. We can access the Universal tags for the Brown corpus sentences by changing the tagset argument:

In [ ]:
tagged_sentences_universal = brown.tagged_sents(categories= 'news', tagset='universal')
print tagged_sentences_universal[29]

Exercise 1:

In this exercise we will compute a Frequency Distribution over tags that appear in the Brown corpus. The template of the function that you have to implement takes two parameters: one is the category of the text and the other is the tagset name. You are given the code to retrieve the list of (word, tag) tuples from the brown corpus corresponding to the given category and tagset.

  1. Convert the list of word+tag pairs to a list of tags
  2. Using the list of tags to compute a frequency distribution over the tags, useing FreqDist()
  3. Compute the total number of tags in the Frequency Distribution
  4. Retrieve the top 10 most frequent tags
In [9]:
############# EXERCISE 1 #################
# Solution for exercise 1
# Input: genre (string), tagset (string)
# Output: number_of_tags (int), top_tags (list of string)


# get the number of tags found in the corpus
# compute the Frequency Distribution of tags

def ex1(genre,tagset):
  
    # get the tagged words from the corpus
    tagged_words = brown.tagged_words(categories= genre, tagset=tagset)
  
    # TODO: build a list of tags
    tags = 
  
    # TODO: using the above list compute the Frequency Distribution of tags in the corpus
    # hint: use nltk.FreqDist()
    tagsFDist = 
  
    # TODO: retrieve the total number of tags in the tagset
    number_of_tags = 
  
    #TODO: retrieve the top 10 most frequent tags
    top_tags = 

    return (number_of_tags,top_tags)
In [ ]:
# Test your code for excercise 1

def test_ex1():
    print "Tag FreqDist for news:"
    print ex1('news',None)
  
    print "Tag FreqDist for science_fiction:"
    print ex1('science_fiction',None)
  
    # Do the same thing for a different tagset: Universal. Observe differences
  
    print "Tag FreqDist for news with Universal tagset:"
    print ex1('news','universal')
  
    print "Tag FreqDist for science_fiction with Universal tagset:"
    print ex1('science_fiction','universal')


# Let's look at the top tags for different genre and tagsets. Observe differences
test_ex1()

Training and Evaluating an HMM Tagger

NLTK provides a module for training a Hidden Markov Model for sequence tag- ging.

In [ ]:
help(nltk.tag.hmm.HiddenMarkovModelTagger)

We can train the HMM for POS tagging given a labelled dataset. In Section 1 of this lab we learned how to access the labelled sentences of the Brown corpus. We will use this dataset to study the effect of the size of the training corpus on the accuracy of the tagger.

Exercise 2:

In this exercise we will train a HMM tagger on a training set and evaluate it on a test set. The template of the function that you have to implement takes two parameters: a sentence to be tagged and the size of the training corpus in number of sentences. You are given the code that creates the training and test datasets from the tagged sentences in the Brown corpus.

  1. Train a Hidden Markov Model tagger on the training dataset. Refer to help(nltk.tag.hmm.HiddenMarkovModelTagger.train) if necessary.
  2. Use the trained model to tag the sentence
  3. Use the trained model to evaluate the tagger on the test dataset
In [13]:
############# EXERCISE 2 #################
# Solution for exercise 2
# Input: sentence (list of string), size (<4600)
# Output: hmm_tagged_sentence (list of tuples), tagger (HiddenMarkovModelTagger)

# hint: use the help on HiddenMarkovModelTagger to find out how to train, tag and evaluate using this module
def ex2(sentence, size):
  
    tagged_sentences = brown.tagged_sents(categories= 'news')
    #ASSERT (0<size<1)
  
    # set up the training data
    train_data = tagged_sentences[-size:]
  
    # set up the test data
    test_data = tagged_sentences[:100]

    # TODO: train a HiddenMarkovModelTagger
    # hint: use the train() method
    tagger = 

    # TODO: using the hmm tagger tag the sentence
    hmm_tagged_sentence = 
  
    # TODO: using the hmm tagger evaluate on the test data
    eres = 

    return (tagger, hmm_tagged_sentence,eres)
In [ ]:
# Test your code for excercise 2
def test_ex2():
    tagged_sentences = brown.tagged_sents(categories= 'news')
    words = [tp[0] for tp in tagged_sentences[42]]
    (tagger, hmm_tagged_sentence, eres ) = ex2(words,500)
    print "Sentenced tagged with nltk.HiddenMarkovModelTagger:"
    print hmm_tagged_sentence
    print "Eval score:"
    print eres
  
    (tagger, hmm_tagged_sentence, eres ) = ex2(words,3000)
    print "Sentenced tagged with nltk.HiddenMarkovModelTagger:"
    print hmm_tagged_sentence
    print "Eval score:"
    print eres


#Look at the tagged sentence and the accuracy of the tagger. How does the size of the training set affect the accuracy?
test_ex2()

Computing the Transition and Emission Probabilities

In the previous exercise we learned how to train and evaluate an HMM tagger. We have used the HMM tagger as a black box and have seen how the training data affects the accuracy of the tagger. In order to get a better understanding of the HMM we will look at the two components of this model:

  • The transition model
  • The emission model

The transition model estimates $P (tag_{i+1} |tag_i )$, the probability of a POS tag at position $i+1$ given the previous tag (at position $i$). The emission model estimates $P (word|tag)$, the probability of the observed word given a tag.

Given the above definitions, we will need to learn a Conditional Probability Distribution for each of the models.

In [ ]:
help(nltk.probability.ConditionalProbDist)

Exercise 3:

In this exercise we will estimate the emission model. In order to compute the Conditional Probability Distribution of $P (word|tag)$ we first have to compute the Conditional Frequency Distribution of a word given a tag.

In [ ]:
help(nltk.probability.ConditionalFreqDist)
help(nltk.probability.ConditionalProbDist)

The constructor of the ConditionalFreqDist class takes as input a list of tuples, each tuple consisting of a condition and an observation. For the emission model, the conditions are tags and the observations are the words. The template of the function that you have to implement takes as argument the list of tagged words from the Brown corpus.

  1. Build the dataset to be passed to the ConditionalFreqDist() constructor. Words should be lowercased. Each item of data should be a tuple of tag (a condition) and word (an observation).
  2. Compute the Conditional Frequency Distribution of words given tags.
  3. Return the top 10 most frequent words given the tag NN.
  4. Compute the Conditional Probability Distribution for the above Conditional Frequency Distribution. Use the MLEProbDist estimator when calling the ConditionalProbDist constructor.
  5. Compute the probabilities:

    $ P(year|N N ) $

    $P(year|DT)$

In [ ]:
############# EXERCISE 3 #################
# Solution for exercise 3
# Input: tagged_words (list of tuples)
# Output: emission_FD (ConditionalFreqDist), emission_PD (ConditionalProbDist), p_NN (float), p_DT (float)


# in the previous labs we've seen how to build a freq dist
# we need conditional distributions to estimate the transition and emission models
# in this exerise we estimate the emission model
def ex3(tagged_words):

    # TODO: prepare the data
    # the data object should be a list of tuples of conditions and observations
    # in our case the tuples will be of the form (tag,word) where words are lowercased
    data = 

    # TODO: compute a Conditional Frequency Distribution for words given their tags using our data
    emission_FD = 

    # TODO: return the top 10 most frequent words given the tag NN
    top_NN = 

    # TODO: Compute the Conditional Probability Distribution using the above Conditional Frequency Distribution. Use MLEProbDist estimator.
    emission_PD = 

    # TODO: compute the probabilities of P(year|NN) and P(year|DT)
    p_NN = 
    p_DT = 

    return (emission_FD, top_NN, emission_PD, p_NN, p_DT)
In [ ]:
### Test your solution for excersise 3

def test_ex3():
    tagged_words = brown.tagged_words(categories='news')
    (emission_FD, top_NN, emission_PD, p_NN, p_DT) = ex3(tagged_words)
    print "Frequency of words given the tag *NN*: ", top_NN
    print "P(year|NN) = ", p_NN
    print "P(year|DT) = ", p_DT


#Look at the estimated probabilities. Why is P(year|DT) = 0 ? What are the problems with having 0 probabilities and what can be done to avoid this?
test_ex3()

What are the problems with having zero probabilities and what can be done to avoid this?

Exercise 4:

In this exercise we will estimate the transition model. In order to compute the Conditional Probability Distribution of $P (tag_{i+1} |tag_i )$ we first have to compute the Conditional Frequency Distribution of a tag at position $i + 1$ given the previous tag.

The constructor of the ConditionalFreqDist class takes as input a list of tuples, each tuple consisting of a condition and an observation. For the transition model, the conditions are tags at position i and the observations are tags at position $i + 1$. The template of the function that you have to implement takes as argument the list of tagged sentences from the Brown corpus.

  1. Build the dataset to be passed to the ConditionalFreqDist() constructor. Each item in your data should be a pair of condition and observation: $(tag_i,tag_{i+1})$
  2. Compute the Conditional Frequency Distribution of a tag at position $i + 1$ given the previous tag.
  3. Compute the Conditional Probability Distribution for the above Conditional Frequency Distribution. Use the MLEProbDist estimator when calling the ConditionalProbDist constructor.
  4. Compute the probabilities

    $P(N N|V BD)$

    $P(N N|DT)$

In [ ]:
############# EXERCISE 4 #################
# Solution for exercise 4
# Input: tagged_sentences (list)
# Output: emission_FD (ConditionalFreqDist), emission_PD (ConditionalProbDist), p_VBD_NN, p_DT_NN

# compute the transition probabilities
# the probabilties of a tag at position i+1 given the tag at position i
def ex4(tagged_sentences):

    # TODO: prepare the data
    # the data object should be an array of tuples of conditions and observations
    # in our case the tuples will be of the form (tag_(i),tag_(i+1))
    data = 

    # TODO: compute the Conditional Frequency Distribution for a tag given the previous tag
    # hint: use the ConditionalFreqDist()
    transition_FD =

    # TODO: compute the transition probability P(tag_(i+1)|tag_(i)) using the MLEProbDist to estimate the probabilities
    # hint: use ConditionalProbDist()
    transition_PD =

    # TODO: compute the probabilities of P(NN|VBD) and P(NN|DT)
    p_VBD_NN = 
    p_DT_NN = 

    return (transition_FD, transition_PD,p_VBD_NN, p_DT_NN )
In [ ]:
### Test your solution for excercise 4

def test_ex4():
  tagged_sentences = brown.tagged_sents(categories= 'news')
  (transition_FD, transition_PD,p_VBD_NN, p_DT_NN ) = ex4(tagged_sentences)
  print "P(NN|VBD) = ", p_VBD_NN
  print "P(NN|DT) = ", p_DT_NN

    
# Are the results what you would expect? The sequence NN DT seems very probable. How will this affect the sequence tagging?
test_ex4()

Going further

Modify your code for exercise 3 to use a different estimator, to introduce some smoothing, and compare the results with the original. In exercise 4 we didn’t do anything about the boundaries. Modify your code for exercise 4 to use < s> at the beginning of every sentence and </ s> at the end.

Explore the resulting conditional probabilities. What is the most likely tag at the beginning of a sentence? At the end?