# Import the packages used for this lab
import nltk
#import brown corpus
from nltk.corpus import brown
# module for training a Hidden Markov Model and tagging sequences
from nltk.tag.hmm import HiddenMarkovModelTagger
# module for computing a Conditional Frequency Distribution
from nltk.probability import ConditionalFreqDist
# module for computing a Conditional Probability Distribution
from nltk.probability import ConditionalProbDist
# module for computing a probability distribution with the Maximum Likelihood Estimate
from nltk.probability import MLEProbDist
import operator
import random
NLTK provides corpora annotated with part-of-speech (POS) information and some tools to access this information. The Penn Treebank tagset is commonly used for annotating English sentences. We can inspect this tagset in the following way:
nltk.help.upenn_tagset()
The Brown corpus provided with NLTK is also tagged with POS information, although the tagset is slightly different than the Penn Treebank tagset. Information about the Brown corpus tagset can be found here: http://www.scs.leeds.ac.uk/ccalas/tagsets/brown.html
We can retrieve the tagged sentences in the Brown corpus by calling the tagged_sents()
function and looking at an annotated sentence:
tagged_sentences = brown.tagged_sents(categories= 'news')
print tagged_sentences[29]
Sometimes it is useful to use a coarser label set in order to avoid data sparsity or to allow a mapping between the POS labels for different languages. The Universal tagset was designed to be applicable for all languages:
https://code.google.com/p/universalpostags/.
There are mappings between the POS tagset of several languages and the Universal tagset. We can access the Universal tags for the Brown corpus sentences by changing the tagset argument:
tagged_sentences_universal = brown.tagged_sents(categories= 'news', tagset='universal')
print tagged_sentences_universal[29]
In this exercise we will compute a Frequency Distribution over tags that appear in the Brown corpus. The template of the function that you have to implement takes two parameters: one is the category of the text and the other is the tagset name. You are given the code to retrieve the list of (word, tag) tuples from the brown corpus corresponding to the given category and tagset.
FreqDist()
############# EXERCISE 1 #################
# Solution for exercise 1
# Input: genre (string), tagset (string)
# Output: number_of_tags (int), top_tags (list of string)
# get the number of tags found in the corpus
# compute the Frequency Distribution of tags
def ex1(genre,tagset):
# get the tagged words from the corpus
tagged_words = brown.tagged_words(categories= genre, tagset=tagset)
# TODO: build a list of tags
tags =
# TODO: using the above list compute the Frequency Distribution of tags in the corpus
# hint: use nltk.FreqDist()
tagsFDist =
# TODO: retrieve the total number of tags in the tagset
number_of_tags =
#TODO: retrieve the top 10 most frequent tags
top_tags =
return (number_of_tags,top_tags)
# Test your code for excercise 1
def test_ex1():
print "Tag FreqDist for news:"
print ex1('news',None)
print "Tag FreqDist for science_fiction:"
print ex1('science_fiction',None)
# Do the same thing for a different tagset: Universal. Observe differences
print "Tag FreqDist for news with Universal tagset:"
print ex1('news','universal')
print "Tag FreqDist for science_fiction with Universal tagset:"
print ex1('science_fiction','universal')
# Let's look at the top tags for different genre and tagsets. Observe differences
test_ex1()
NLTK provides a module for training a Hidden Markov Model for sequence tag- ging.
help(nltk.tag.hmm.HiddenMarkovModelTagger)
We can train the HMM for POS tagging given a labelled dataset. In Section 1 of this lab we learned how to access the labelled sentences of the Brown corpus. We will use this dataset to study the effect of the size of the training corpus on the accuracy of the tagger.
In this exercise we will train a HMM tagger on a training set and evaluate it on a test set. The template of the function that you have to implement takes two parameters: a sentence to be tagged and the size of the training corpus in number of sentences. You are given the code that creates the training and test datasets from the tagged sentences in the Brown corpus.
help(nltk.tag.hmm.HiddenMarkovModelTagger.train)
if necessary.############# EXERCISE 2 #################
# Solution for exercise 2
# Input: sentence (list of string), size (<4600)
# Output: hmm_tagged_sentence (list of tuples), tagger (HiddenMarkovModelTagger)
# hint: use the help on HiddenMarkovModelTagger to find out how to train, tag and evaluate using this module
def ex2(sentence, size):
tagged_sentences = brown.tagged_sents(categories= 'news')
#ASSERT (0<size<1)
# set up the training data
train_data = tagged_sentences[-size:]
# set up the test data
test_data = tagged_sentences[:100]
# TODO: train a HiddenMarkovModelTagger
# hint: use the train() method
tagger =
# TODO: using the hmm tagger tag the sentence
hmm_tagged_sentence =
# TODO: using the hmm tagger evaluate on the test data
eres =
return (tagger, hmm_tagged_sentence,eres)
# Test your code for excercise 2
def test_ex2():
tagged_sentences = brown.tagged_sents(categories= 'news')
words = [tp[0] for tp in tagged_sentences[42]]
(tagger, hmm_tagged_sentence, eres ) = ex2(words,500)
print "Sentenced tagged with nltk.HiddenMarkovModelTagger:"
print hmm_tagged_sentence
print "Eval score:"
print eres
(tagger, hmm_tagged_sentence, eres ) = ex2(words,3000)
print "Sentenced tagged with nltk.HiddenMarkovModelTagger:"
print hmm_tagged_sentence
print "Eval score:"
print eres
#Look at the tagged sentence and the accuracy of the tagger. How does the size of the training set affect the accuracy?
test_ex2()
In the previous exercise we learned how to train and evaluate an HMM tagger. We have used the HMM tagger as a black box and have seen how the training data affects the accuracy of the tagger. In order to get a better understanding of the HMM we will look at the two components of this model:
The transition model estimates $P (tag_{i+1} |tag_i )$, the probability of a POS tag at position $i+1$ given the previous tag (at position $i$). The emission model estimates $P (word|tag)$, the probability of the observed word given a tag.
Given the above definitions, we will need to learn a Conditional Probability Distribution for each of the models.
help(nltk.probability.ConditionalProbDist)
In this exercise we will estimate the emission model. In order to compute the Conditional Probability Distribution of $P (word|tag)$ we first have to compute the Conditional Frequency Distribution of a word given a tag.
help(nltk.probability.ConditionalFreqDist)
help(nltk.probability.ConditionalProbDist)
The constructor of the ConditionalFreqDist class takes as input a list of tuples, each tuple consisting of a condition and an observation. For the emission model, the conditions are tags and the observations are the words. The template of the function that you have to implement takes as argument the list of tagged words from the Brown corpus.
ConditionalFreqDist()
constructor. Words should be lowercased. Each item of data should be a tuple of tag (a condition) and word (an observation).MLEProbDist
estimator when calling the ConditionalProbDist constructor.Compute the probabilities:
$ P(year|N N ) $
$P(year|DT)$
############# EXERCISE 3 #################
# Solution for exercise 3
# Input: tagged_words (list of tuples)
# Output: emission_FD (ConditionalFreqDist), emission_PD (ConditionalProbDist), p_NN (float), p_DT (float)
# in the previous labs we've seen how to build a freq dist
# we need conditional distributions to estimate the transition and emission models
# in this exerise we estimate the emission model
def ex3(tagged_words):
# TODO: prepare the data
# the data object should be a list of tuples of conditions and observations
# in our case the tuples will be of the form (tag,word) where words are lowercased
data =
# TODO: compute a Conditional Frequency Distribution for words given their tags using our data
emission_FD =
# TODO: return the top 10 most frequent words given the tag NN
top_NN =
# TODO: Compute the Conditional Probability Distribution using the above Conditional Frequency Distribution. Use MLEProbDist estimator.
emission_PD =
# TODO: compute the probabilities of P(year|NN) and P(year|DT)
p_NN =
p_DT =
return (emission_FD, top_NN, emission_PD, p_NN, p_DT)
### Test your solution for excersise 3
def test_ex3():
tagged_words = brown.tagged_words(categories='news')
(emission_FD, top_NN, emission_PD, p_NN, p_DT) = ex3(tagged_words)
print "Frequency of words given the tag *NN*: ", top_NN
print "P(year|NN) = ", p_NN
print "P(year|DT) = ", p_DT
#Look at the estimated probabilities. Why is P(year|DT) = 0 ? What are the problems with having 0 probabilities and what can be done to avoid this?
test_ex3()
What are the problems with having zero probabilities and what can be done to avoid this?
In this exercise we will estimate the transition model. In order to compute the Conditional Probability Distribution of $P (tag_{i+1} |tag_i )$ we first have to compute the Conditional Frequency Distribution of a tag at position $i + 1$ given the previous tag.
The constructor of the ConditionalFreqDist
class takes as input a list of tuples, each tuple consisting of a condition and an observation. For the transition
model, the conditions are tags at position i and the observations are tags at
position $i + 1$. The template of the function that you have to implement takes
as argument the list of tagged sentences from the Brown corpus.
ConditionalFreqDist()
constructor. Each item in your data should be a pair of condition and observation: $(tag_i,tag_{i+1})$MLEProbDist
estimator when calling the ConditionalProbDist
constructor.Compute the probabilities
$P(N N|V BD)$
$P(N N|DT)$
############# EXERCISE 4 #################
# Solution for exercise 4
# Input: tagged_sentences (list)
# Output: emission_FD (ConditionalFreqDist), emission_PD (ConditionalProbDist), p_VBD_NN, p_DT_NN
# compute the transition probabilities
# the probabilties of a tag at position i+1 given the tag at position i
def ex4(tagged_sentences):
# TODO: prepare the data
# the data object should be an array of tuples of conditions and observations
# in our case the tuples will be of the form (tag_(i),tag_(i+1))
data =
# TODO: compute the Conditional Frequency Distribution for a tag given the previous tag
# hint: use the ConditionalFreqDist()
transition_FD =
# TODO: compute the transition probability P(tag_(i+1)|tag_(i)) using the MLEProbDist to estimate the probabilities
# hint: use ConditionalProbDist()
transition_PD =
# TODO: compute the probabilities of P(NN|VBD) and P(NN|DT)
p_VBD_NN =
p_DT_NN =
return (transition_FD, transition_PD,p_VBD_NN, p_DT_NN )
### Test your solution for excercise 4
def test_ex4():
tagged_sentences = brown.tagged_sents(categories= 'news')
(transition_FD, transition_PD,p_VBD_NN, p_DT_NN ) = ex4(tagged_sentences)
print "P(NN|VBD) = ", p_VBD_NN
print "P(NN|DT) = ", p_DT_NN
# Are the results what you would expect? The sequence NN DT seems very probable. How will this affect the sequence tagging?
test_ex4()
Modify your code for exercise 3 to use a different estimator, to introduce some smoothing, and compare the results with the original. In exercise 4 we didn’t do anything about the boundaries. Modify your code for exercise 4 to use < s> at the beginning of every sentence and </ s> at the end.
Explore the resulting conditional probabilities. What is the most likely tag at the beginning of a sentence? At the end?