FNLP 2017: Lab Session 5: Word Sense Disambiguation

Word Sense Disambiguation: Recap

In this tutorial we will be exploring the lexical sample task. This is a task where you use a corpus to learn how to disambiguate a small set of target words using supervised learning. The aim is to build a classifier that maps each occurrence of a target word in a corpus to its sense.

We will use a Naive Bayes classifier. In other words, where the context of an occurrence of a target word in the corpus is represented as a feature vector, the classifier estimates the word sense s on the basis of its context as shown below.

Alt Text

The corpus

We will use the senseval-2 corpus for our training and test data. This corpus consists of text from a mixture of places, including the British National Corpus and the Penn Treebank portion of the Wall Street Journal. Each word in the corpus is tagged with its part of speech, and the senses of the following target words are also manually annotated: the nouns interest, line; the verb serve and the adjective hard. You can find out more about the task from here.

The set of senses that are used to annotate each target word come from WordNet (more on that later).

Getting started: Run the code

Look at the code below, and try to understand how it works (don't worry if you don't understand some of it, it's not necessary for doing this task). Rememmber, help(...) is your friend:

  • help([class name]) for classes and all their methods and instance variables
  • help([any object]) likewise
  • help([function]) or help([class].[method]) for functions / methods

This code allows you to do several things. You can now run, train and evaluate a range of Naive Bayes classifiers over the corpus to acquire a model of WSD for a given target word: the adjective hard, the nouns interest or line, and the verb serve. We'll learn later how you do this. First, we're going to explore the nature of the corpus itself.

In [ ]:
# %load lab5.py
from __future__ import division
import nltk
import random
from nltk.corpus import senseval
from nltk.classify import accuracy, NaiveBayesClassifier, MaxentClassifier
from collections import defaultdict

# The following shows how the senseval corpus consists of instances, where each instance
# consists of a target word (and its tag), it position in the sentence it appeared in
# within the corpus (that position being word position, minus punctuation), and the context,
# which is the words in the sentence plus their tags.
#
# senseval.instances()[:1]
# [SensevalInstance(word='hard-a', position=20, context=[('``', '``'), ('he', 'PRP'),
# ('may', 'MD'), ('lose', 'VB'), ('all', 'DT'), ('popular', 'JJ'), ('support', 'NN'),
# (',', ','), ('but', 'CC'), ('someone', 'NN'), ('has', 'VBZ'), ('to', 'TO'),
# ('kill', 'VB'), ('him', 'PRP'), ('to', 'TO'), ('defeat', 'VB'), ('him', 'PRP'),
# ('and', 'CC'), ('that', 'DT'), ("'s", 'VBZ'), ('hard', 'JJ'), ('to', 'TO'), ('do', 'VB'),
# ('.', '.'), ("''", "''")], senses=('HARD1',))]

def senses(word):
    """
    This takes a target word from senseval-2 (find out what the possible
    are by running senseval.fileides()), and it returns the list of possible 
    senses for the word
    """
    return list(set(i.senses[0] for i in senseval.instances(word)))

# Both above and below, we depend on the (non-obvious?) fact that although the field is
#  called 'senses', there is always only 1, i.e. there is no residual ambiguity in the
#  data as we have it

def sense_instances(instances, sense):
    """
    This returns the list of instances in instances that have the sense sense
    """
    return [instance for instance in instances if instance.senses[0]==sense]

# >>> sense3 = sense_instances(senseval.instances('hard.pos'), 'HARD3')
# >>> sense3[:2]
# [SensevalInstance(word='hard-a', position=15,
#  context=[('my', 'PRP$'), ('companion', 'NN'), ('enjoyed', 'VBD'), ('a', 'DT'), ('healthy', 'JJ'), ('slice', 'NN'), ('of', 'IN'), ('the', 'DT'), ('chocolate', 'NN'), ('mousse', 'NN'), ('cake', 'NN'), (',', ','), ('made', 'VBN'), ('with', 'IN'), ('a', 'DT'), ('hard', 'JJ'), ('chocolate', 'NN'), ('crust', 'NN'), (',', ','), ('topping', 'VBG'), ('a', 'DT'), ('sponge', 'NN'), ('cake', 'NN'), ('with', 'IN'), ('either', 'DT'), ('strawberry', 'NN'), ('or', 'CC'), ('raspberry', 'JJ'), ('on', 'IN'), ('the', 'DT'), ('bottom', 'NN'), ('.', '.')],
#  senses=('HARD3',)),
#  SensevalInstance(word='hard-a', position=5,
#  context=[('``', '``'), ('i', 'PRP'), ('feel', 'VBP'), ('that', 'IN'), ('the', 'DT'), ('hard', 'JJ'), ('court', 'NN'), ('is', 'VBZ'), ('my', 'PRP$'), ('best', 'JJS'), ('surface', 'NN'), ('overall', 'JJ'), (',', ','), ('"', '"'), ('courier', 'NNP'), ('said', 'VBD'), ('.', '.')],
# senses=('HARD3',))]


_inst_cache = {}

STOPWORDS = ['.', ',', '?', '"', '``', "''", "'", '--', '-', ':', ';', '(',
             ')', '$', '000', '1', '2', '10,' 'I', 'i', 'a', 'about', 'after', 'all', 'also', 'an', 'any',
             'are', 'as', 'at', 'and', 'be', 'being', 'because', 'been', 'but', 'by',
             'can', "'d", 'did', 'do', "don'", 'don', 'for', 'from', 'had','has', 'have', 'he',
             'her','him', 'his', 'how', 'if', 'is', 'in', 'it', 'its', "'ll", "'m", 'me',
             'more', 'my', 'n', 'no', 'not', 'of', 'on', 'one', 'or', "'re", "'s", "s",
             'said', 'say', 'says', 'she', 'so', 'some', 'such', "'t", 'than', 'that', 'the',
             'them', 'they', 'their', 'there', 'this', 'to', 'up', 'us', "'ve", 'was', 'we', 'were',
             'what', 'when', 'where', 'which', 'who', 'will', 'with', 'years', 'you',
             'your']

STOPWORDS_SET=set(STOPWORDS)

NO_STOPWORDS = []

def wsd_context_features(instance, vocab, dist=3):
    features = {}
    ind = instance.position
    con = instance.context
    for i in range(max(0, ind-dist), ind):
        j = ind-i
        features['left-context-word-%s(%s)' % (j, con[i][0])] = True

    for i in range(ind+1, min(ind+dist+1, len(con))):
        j = i-ind
        features['right-context-word-%s(%s)' % (j, con[i][0])] = True

 
    features['word'] = instance.word
    features['pos'] = con[1][1]
    return features



def wsd_word_features(instance, vocab, dist=3):
    """
    Create a featureset where every key returns False unless it occurs in the
    instance's context
    """
    features = defaultdict(lambda:False)
    features['alwayson'] = True
    #cur_words = [w for (w, pos) in i.context]
    try:
        for(w, pos) in instance.context:
            if w in vocab:
                features[w] = True
    except ValueError:
        pass
    return features

def extract_vocab_frequency(instances, stopwords=STOPWORDS_SET, n=300):
    """
    Given a list of senseval instances, return a list of the n most frequent words that
    appears in its context (i.e., the sentence with the target word in), output is in order
    of frequency and includes also the number of instances in which that key appears in the
    context of instances.
    """
    fd = nltk.FreqDist()
    for i in instances:
        (target, suffix) = i.word.split('-')
        words = (c[0] for c in i.context if not c[0] == target)
        for word in set(words) - set(stopwords):
            fd[word] += 1
            #for sense in i.senses:
                #cfd[sense][word] += 1
    return fd.most_common()[:n+1]
        
def extract_vocab(instances, stopwords=STOPWORDS_SET, n=300):
    return [w for w,f in extract_vocab_frequency(instances,stopwords,n)]
    
##def wst_classifier(trainer, word, features,number=300):
##    print "Reading data..."
##    global _inst_cache
##    if word not in _inst_cache:
##        _inst_cache[word] = [(i, i.senses[0]) for i in senseval.instances(word)]
##    events = _inst_cache[word][:]
##    senses = list(set(l for (i, l) in events))
##    instances = [i for (i, l) in events]
##    vocab = extract_vocab(instances, n=number)
##    print ' Senses: ' + ' '.join(senses)
##
##    # Split the instances into a training and test set,
##    #if n > len(events): n = len(events)
##    n = len(events)
##    random.seed(5444522)
##    random.shuffle(events)
##    training_data = events[:int(0.8 * n)]
##    test_data = events[int(0.8 * n):n]
##    # Train classifier
##    print 'Training classifier...'
##    classifier = trainer([(features(i, vocab), label) for (i, label) in training_data])
##    # Test classifier
##    print 'Testing classifier...'
##    acc = accuracy(classifier, [(features(i, vocab), label) for (i, label) in test_data] )
##    print 'Accuracy: %6.4f' % acc

    
def wst_classifier(trainer, word, features, stopwords_list = STOPWORDS_SET, number=300, log=False, distance=3, confusion_matrix=False):
    """
    This function takes as arguments:
        a trainer (e.g., NaiveBayesClassifier.train);
        a target word from senseval2 (you can find these out with senseval.fileids(),
            and they are 'hard.pos', 'interest.pos', 'line.pos' and 'serve.pos');
        a feature set (this can be wsd_context_features or wsd_word_features);
        a number (defaults to 300), which determines for wsd_word_features the number of
            most frequent words within the context of a given sense that you use to classify examples;
        a distance (defaults to 3) which determines the size of the window for wsd_context_features (if distance=3, then
            wsd_context_features gives 3 words and tags to the left and 3 words and tags to
            the right of the target word);
        log (defaults to false), which if set to True outputs the errors into a file errors.txt
        confusion_matrix (defaults to False), which if set to True prints a confusion matrix.

    Calling this function splits the senseval data for the word into a training set and a test set (the way it does
    this is the same for each call of this function, because the argument to random.seed is specified,
    but removing this argument would make the training and testing sets different each time you build a classifier).

    It then trains the trainer on the training set to create a classifier that performs WSD on the word,
    using features (with number or distance where relevant).

    It then tests the classifier on the test set, and prints its accuracy on that set.

    If log==True, then the errors of the classifier over the test set are written to errors.txt.
    For each error four things are recorded: (i) the example number within the test data (this is simply the index of the
    example within the list test_data); (ii) the sentence that the target word appeared in, (iii) the
    (incorrect) derived label, and (iv) the gold label.

    If confusion_matrix==True, then calling this function prints out a confusion matrix, where each cell [i,j]
    indicates how often label j was predicted when the correct label was i (so the diagonal entries indicate labels
    that were correctly predicted).
    """
    print "Reading data..."
    global _inst_cache
    if word not in _inst_cache:
        _inst_cache[word] = [(i, i.senses[0]) for i in senseval.instances(word)]
    events = _inst_cache[word][:]
    senses = list(set(l for (i, l) in events))
    instances = [i for (i, l) in events]
    vocab = extract_vocab(instances, stopwords=stopwords_list, n=number)
    print ' Senses: ' + ' '.join(senses)

    # Split the instances into a training and test set,
    #if n > len(events): n = len(events)
    n = len(events)
    random.seed(5444522)
    random.shuffle(events)
    training_data = events[:int(0.8 * n)]
    test_data = events[int(0.8 * n):n]
    # Train classifier
    print 'Training classifier...'
    classifier = trainer([(features(i, vocab, distance), label) for (i, label) in training_data])
    # Test classifier
    print 'Testing classifier...'
    acc = accuracy(classifier, [(features(i, vocab, distance), label) for (i, label) in test_data] )
    print 'Accuracy: %6.4f' % acc
    if log==True:
        #write error file
        print 'Writing errors to errors.txt'
        output_error_file = open('errors.txt', 'w')
        errors = []
        for (i, label) in test_data:
            guess = classifier.classify(features(i, vocab, distance))
            if guess != label:
                con =  i.context
                position = i.position
                item_number = str(test_data.index((i, label)))
                word_list = []
                for (word, tag) in con:
                    word_list.append(word)
                hard_highlighted = word_list[position].upper()
                word_list_highlighted = word_list[0:position] + [hard_highlighted] + word_list[position+1:]
                sentence = ' '.join(word_list_highlighted)
                errors.append([item_number, sentence, guess,label])
        error_number = len(errors)
        output_error_file.write('There are ' + str(error_number) + ' errors!' + '\n' + '----------------------------' +
                                '\n' + '\n')
        for error in errors:
            output_error_file.write(str(errors.index(error)+1) +') ' + 'example number: ' + error[0] + '\n' +
                                    '    sentence: ' + error[1] + '\n' +
                                    '    guess: ' + error[2] + ';  label: ' + error[3] + '\n' + '\n')
        output_error_file.close()
    if confusion_matrix==True:
        gold = [label for (i, label) in test_data]
        derived = [classifier.classify(features(i,vocab)) for (i,label) in test_data]
        cm = nltk.ConfusionMatrix(gold,derived)
        print cm
        return cm
        
        
    
def demo():
    print "NB, with features based on 300 most frequent context words"
    wst_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_word_features)
    print
    print "NB, with features based word + pos in 6 word window"
    wst_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_context_features)
    print
##    print "MaxEnt, with features based word + pos in 6 word window"
##    wst_classifier(MaxentClassifier.train, 'hard.pos', wsd_context_features)
    

#demo()

# Frequency Baseline
##hard_sense_fd = nltk.FreqDist([i.senses[0] for i in senseval.instances('hard.pos')])
##most_frequent_hard_sense= hard_sense_fd.keys()[0]
##frequency_hard_sense_baseline = hard_sense_fd.freq(hard_sense_fd.keys()[0])

##>>> frequency_hard_sense_baseline
##0.79736902838679902

The Senseval corpus

Target words

You can find out the set of target words for the senseval-2 corpus by running:

In [ ]:
senseval.fileids()

The result doesn't tell you the syntactic category of the words, but see the description of the corpus in Section 1 or Section 4.2.

Word senses

Let's now find out the set of word senses for each target word in senseval. There is a function in lab5.py that returns this information. For example:

In [ ]:
print senses('hard.pos')

As you can see this gives you ['HARD1', 'HARD2', 'HARD3']

So there are 3 senses for the adjective hard in the corpus. You'll shortly be looking at the data to guess what these 3 senses are.

Now it's your turn:

  • What are the senses for the other target words? Find out by calling senses with appropriate arguments.
  • How many senses does each target have?
  • Let's now guess the sense definitions for HARD1, HARD2 and HARD3 by looking at the 100 most frequent open class words that occur in the context of each sense.

You can find out what these 100 words for HARD1 by running the following:

In [ ]:
instances1 = sense_instances(senseval.instances('hard.pos'), 'HARD1')
features1 = extract_vocab_frequency(instances1, n=100)

# Now lets try printing features1:
print features1

Now it's your turn:

  • Call the above functions for HARD2 and HARD3.
  • Look at the resulting lists of 100 most frequent words for each sense, and try to define what HARD1, HARD2 and HARD3 mean.
  • These senses are actually the first three senses for the adjective hard in WordNet. You can enter a word and get its list of WordNet senses from here. Do this for hard, and check whether your estimated definitions for the 3 word senses are correct.

The data structures: Senseval instances

Having extracted all instances of a given sense, you can look at what the data structures in the corpus look like:

In [ ]:
instances3[0]

So the senseval corpus is a collection of information about a set of tagged sentences, where each entry or instance consists of 4 attributes:

  • word specifies the target word together with its syntactic category (e.g., hard-a means that the word is hard and its category is 'adjective');
  • position gives its position within the sentence (ignoring punctuation);
  • context represents the sentence as a list of pairs, each pair being a word or punctuation mark and its tag; and finally
  • senses is a tuple, each item in the tuple being a sense for that target word. In the subset of the corpus we are working with, this tuple consists of only one argument. But there are a few examples elsewhere in the corpus where there is more than one, representing the fact that the annotator couldn't decide which sense to assign to the word. For simplicity, our classifiers are going to ignore any non-first arguments to the attribute senses.

Exploring different WSD classifiers

You're now going to compare the performance of different classifiers that perform word sense disambiguation. You do this by calling the function wst_classifer This function must have at least the following arguments specified by you:

  1. A trainer; e.g., NaiveBayesClassifier.train (if you want you could also try MaxentClassifier.train, but this takes longer to train).
  2. The target word that the classifier is going to learn to disambiguate: i.e., 'hard.pos', 'line.pos', 'interest.pos' or 'serve.pos'.
  3. A feature set. The code allows you to use two kinds of feature sets: #### wsd_word_features This feature set is based on the set S of the n most frequent words that occur in the same sentence as the target word w across the entire training corpus (as you'll see later, you can specify the value of n, but if you don't specify it then it defaults to 300). For each occurrence of w, wsd_word_features represents its context as the subset of those words from S that occur in the w's sentence. By default, the closed class words that are specified in STOPWORDS are excluded from the set S of most frequent words. But as we'll see later, you can also include closed class words in S, or re-define closed class words in any way you like! If you want to know what closed class words are excluded by default, then simply type STOPWORDS to the Python command. #### wsd_context_features This feature set represents the context of a word w as the sequence of m pairs (word,tag) that occur before w and the sequence of m pairs (word, tag) that occur after w. As we'll see shortly, you can specify the value of m (e.g., m=1 means the context consists of just the immediately prior and immediately subsequent word-tag pairs); otherwise, m defaults to 3.

Now let's train our first classifier

Type the following to the Python shell:

In [ ]:
wst_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_word_features) 

In other words, the adjective hard is tagged with 3 senses in the corpus (HARD1, HARD2 and HARD3), and the Naive Bayes Classifier using the feature set based on the 300 most frequent context words yields an accuracy of 0.8362.

Now it's your turn:

Use wst_classifier to train a classifier that disambiguates hard using wsd_context_features. Build classifiers for line and serve as well, using the word features and then the context features.

  • What's more accurate for disambiguating 'hard.pos', wsd_context_features or wst_word_features?
  • Does the same hold true for line.pos and serve.pos. Why do you think that might be?
  • Why is it not fair to compare the accuracy of the classifiers across different target words?

Baseline models

Just how good is the accuracy of these WSD classifiers? To find out, we need a baseline. There are two we consider here:

  1. A model which assigns a sense at random.
  2. A model which always assigns the most frequent sense.

Now it's your turn:

  • What is the accuracy of the random baseline model for hard.pos?
  • To compute the accuracy of the frequency baseline model for hard.pos, we need to find out the Frequency Distribution of the three senses in the corpus:
In [ ]:
hard_sense_fd = nltk.FreqDist([i.senses[0] for i in
senseval.instances('hard.pos')])
hard_sense_fd.most_common()

frequency_hard_sense_baseline = hard_sense_fd.freq('HARD1')
frequency_hard_sense_baseline

In other words, the frequency baseline has an accuracy of approx. 0.793. What is the most frequent sense for 'hard.pos'? And is the frequency baseline a better model than the random model?

  • Now compute the accuracy of the frequency baseline for other target words; e.g. 'line.pos'.

Rich features vs. sparse data

In this part of the tutorial we are going to vary the feature sets and compare the results. As well as being able to choose between wsd_context_features vs. wsd_word_features, you can also vary the following:

wsd_context_features

You can vary the number of word-tag pairs before and after the target word that you include in the feature vector. You do this by specifying the argument distance to the function wst_classifier. For instance, the following creates a classifier that uses 2 words to the left and right of the target word:

wst_classifier(NaiveBayesClassifier.train, 'hard.pos', 
    wsd_context_features, distance=2)

What about distance 1?

wsd_word_features

You can vary the closed class words that are excluded from the set of most frequent words, and you can vary the size of the set of most frequent words. For instance, the following results in a model which uses the 100 most frequent words including close class words:

wst_classifier(NaiveBayesClassifier.train, 'hard.pos', 
        wsd_word_features, stopwords_list=[], number=100)


Now it's your turn:

Build several WSD models for 'hard.pos', including at least the following: for the wsd_word_features version, vary number between 100, 200 and 300, and vary the stopwords_list between [] (i.e., the null list) and STOPWORDS; for the wsd_context_features version, vary the distance between 1, 2 and 3, and vary the stopwords_list between [] and STOPWORDS.

  • Why does setting number to less than 300 seem to improve the word model? Why does making the context window before and after the target word to a number smaller than 3 improve the model?
  • Why does including closed class words in word model improve overall performance? Hint: For each sense of hard, construct a list of all its instances in the training data using the function sense_instances (see instances1 above). Then, call, for example, extract_vocab_frequency(instances1, stopwords=[], n=100) and compare this with what you get for instances2 and instances3.
  • It seems slightly odd that the word features for 'hard.pos' include harder and hardest. Try using a stopwords list which adds them to STOPWORDS: is the effect what you expected? Can you explain it?

Error analysis

The function wst_classifier allows you to explore the errors of the model it creates:

Confusion Matrix

You can output a confusion matrix as follows:

In [ ]:
wst_classifier(NaiveBayesClassifier.train, 'hard.pos',
               wsd_context_features, distance=2, confusion_matrix=True)

Note that the rows in the matrix are the gold labels, and the columns are the estimated labels. Recall that the diagonal line represents the number of items that the model gets right.

Errors

You can also output each error from the test data into a file errors.txt that gets written to the same directory from where you're running Python. For example:

In [ ]:
wst_classifier(NaiveBayesClassifier.train, 'hard.pos',
               wsd_context_features, distance=2, confusion_matrix=True, log=True)

Produces within the file errors.txt the following first few lines:

In [5]:
cat errors.txt
cat: errors.txt: No such file or directory

The example number is the (list) index of the error in the test_data.

Now it's your turn:

  1. Choose your best performing model from Section 7, and train the model again, but add the arguments confusion_matrix=True and log=True.
  2. Using the confusion matrix, identify which sense is the hardest one for the model to estimate.
  3. Look in errors.txt for examples where that hardest word sense is the correct label. Do you see any patterns or systematic errors? If so, can you think of a way to adapt feature vector so as to improve the model?