FNLP 2017: Lab Session 5: Word Sense Disambiguation

1. Word Sense Disambiguation: Recap

In this tutorial we will be exploring the lexical sample task. This is a task where you use a corpus to learn how to disambiguate a small set of target words using supervised learning. The aim is to build a classifier that maps each occurrence of a target word in a corpus to its sense.

We will use a Naive Bayes classifier. In other words, where the context of an occurrence of a target word in the corpus is represented as a feature vector, the classifier estimates the word sense s on the basis of its context as shown below. [no description, sorry]

2. The corpus

We will use the senseval-2 corpus for our training and test data. This corpus consists of text from a mixture of places, including the British National Corpus and the Penn Treebank portion of the Wall Street Journal. Each word in the corpus is tagged with its part of speech, and the senses of the following target words are also manually annotated: the nouns interest, line; the verb serve and the adjective hard. You can find out more about the task from here.

The set of senses that are used to annotate each target word come from WordNet (more on that later).

3. Getting started: Run the code

Download wsd_code.py.
Now open it in Python using interactive mode (the -i option, or use ipython if you're familiar with it):
```
% python -i wsd_code.py
```
Look at the code, and try to understand how it works (don't worry if you don't understand some of it, it's not necessary for doing this task).
Rememmber, help(...) is your friend:
- help([class name]) for classes and all their methods and instance variables
- help([any object] likewise
- help([function] or help([class].[method] for functions / methods

This code allows you to do several things. You can now run, train and evaluate a range of Naive Bayes classifiers over the corpus to acquire a model of WSD for a given target word: the adjective hard, the nouns interest or line, and the verb serve. We'll learn later how you do this. First, we're going to explore the nature of the corpus itself.

4. The Senseval corpus

4.1. Target words

You can find out the set of target words for the senseval-2 corpus by running:
```
>>> senseval.fileids()
```

The result doesn't tell you the syntactic category of the words, but see the description of the corpus in Section 1 or Section 4.2.

4.2. Word senses

Let's now find out the set of word senses for each target word in senseval. There is a function in wsd_code.py that returns this information. For example:

>>> senses('hard.pos')

returns

>>> ['HARD1', 'HARD2', 'HARD3']

So there are 3 senses for the adjective hard in the corpus. You'll shortly be looking at the data to guess what these 3 senses are.

Now it's your turn:

What are the senses for the other target words? Find out by calling senses with appropriate arguments.
How many senses does each target have?
Let's now guess the sense definitions for HARD1, HARD2 and HARD3 by looking at the 100 most frequent open class words that occur in the context of each sense.

You can find out what these 100 words for HARD1 by running the following:

>>> instances1 = sense_instances(senseval.instances('hard.pos'), 'HARD1')
>>> features1 = extract_vocab_frequency(instances1, n=100)

Now call features1 and you get this:

[('harder', 476), ('time', 316), ('would', 224), ('get', 223),
('find', 206), ('make', 195), ('very', 187), ('out', 174), ('people',
164), ('even', 151), ('believe', 151), ('hardest', 137), ('going',
136), ('much', 127), ('like', 126), ('just', 119), ('now', 112),
('way', 107), ('imagine', 106), ('see', 104), ('other', 104), ('may',
103), ('know', 102), ('could', 95), ('into', 93), ('many', 93),
('new', 91), ('tell', 89), ('these', 86), ('come', 85), ('too', 83),
('without', 81), ('good', 80), ('part', 79), ('take', 78), ('think',
77), ('thing', 75), ('really', 74), ('work', 74), ('two', 73), ('go',
71), ('those', 71), ('back', 71), ('makes', 70), ('things', 69),
('still', 69), ('made', 67), ('last', 66), ('having', 66), ('most',
66), ('our', 65), ('first', 64), ('here', 63), ('enough', 62),
('life', 61), ('lot', 61), ('keep', 61), ('making', 60), ('year', 58),
('over', 58), ('long', 57), ('getting', 57), ('only', 56), ('off',
55), ('san', 54), ('understand', 53), ('home', 52), ('through', 52),
('day', 52), ('why', 52), ('times', 50), ('might', 50), ('sometimes',
50), ('does', 49), ('something', 49), ('since', 47), ('game', 47),
('down', 47), ('little', 47), ('children', 47), ('kind', 47), ('put',
46), ('while', 46), ('before', 46), ('job', 45), ('especially', 44),
('become', 43), ('often', 43), ('want', 42), ('place', 42), ('right',
41), ('around', 41), ('found', 41), ('once', 41), ('anything', 40),
('doing', 40), ('although', 40), ('someone', 40), ('women', 40),
('ever', 39), ('same', 39)] >>>

Now it's your turn:

Call the above functions for HARD2 and HARD3.
Look at the resulting lists of 100 most frequent words for each sense, and try to define what HARD1, HARD2 and HARD3 mean.
These senses are actually the first three senses for the adjective hard in WordNet. You can enter a word and get its list of WordNet senses from here. Do this for hard, and check whether your estimated definitions for the 3 word senses are correct.

4.3. The data structures: Senseval instances

Having extracted all instances of a given sense, you can look at what the data structures in the corpus look like:

>>> instances3[0]
SensevalInstance(word='hard-a', position=15, context=[('my', 'PRP$'),
('companion', 'NN'), ('enjoyed', 'VBD'), ('a', 'DT'), ('healthy',
'JJ'), ('slice', 'NN'), ('of', 'IN'), ('the', 'DT'), ('chocolate',
'NN'), ('mousse', 'NN'), ('cake', 'NN'), (',', ','), ('made', 'VBN'),
('with', 'IN'), ('a', 'DT'), ('hard', 'JJ'), ('chocolate', 'NN'),
('crust', 'NN'), (',', ','), ('topping', 'VBG'), ('a', 'DT'),
('sponge', 'NN'), ('cake', 'NN'), ('with', 'IN'), ('either', 'DT'),
('strawberry', 'NN'), ('or', 'CC'), ('raspberry', 'JJ'), ('on', 'IN'),
('the', 'DT'), ('bottom', 'NN'), ('.', '.')], senses=('HARD3',))

So the senseval corpus is a collection of information about a set of tagged sentences, where each entry or instance consists of 4 attributes:

word specifies the target word together with its syntactic category (e.g., hard-a means that the word is hard and its category is 'adjective');
position gives its position within the sentence (ignoring punctuation);
context represents the sentence as a list of pairs, each pair being a word or punctuation mark and its tag; and finally
senses is a tuple, each item in the tuple being a sense for that target word. In the subset of the corpus we are working with, this tuple consists of only one argument. But there are a few examples elsewhere in the corpus where there is more than one, representing the fact that the annotator couldn't decide which sense to assign to the word. For simplicity, our classifiers are going to ignore any non-first arguments to the attribute senses.

5. Exploring different WSD classifiers

You're now going to compare the performance of different classifiers that perform word sense disambiguation. You do this by calling the function wst_classifer This function must have at least the following arguments specified by you:

A trainer; e.g., NaiveBayesClassifier.train (if you want you could also try MaxentClassifier.train, but this takes longer to train).
The target word that the classifier is going to learn to disambiguate: i.e., 'hard.pos', 'line.pos', 'interest.pos' or 'serve.pos'.
A feature set. The code allows you to use two kinds of feature sets:
wsd_word_features
This feature set is based on the set S of the n most frequent words that occur in the same sentence as the target word w across the entire training corpus (as you'll see later, you can specify the value of n, but if you don't specify it then it defaults to 300). For each occurrence of w, wsd_word_features represents its context as the subset of those words from S that occur in the w's sentence. By default, the closed class words that are specified in STOPWORDS are excluded from the set S of most frequent words. But as we'll see later, you can also include closed class words in S, or re-define closed class words in any way you like! If you want to know what closed class words are excluded by default, then simply type STOPWORDS to the Python command.
wsd_context_features
This feature set represents the context of a word w as the sequence of m pairs (word,tag) that occur before w and the sequence of m pairs (word, tag) that occur after w. As we'll see shortly, you can specify the value of m (e.g., m=1 means the context consists of just the immediately prior and immediately subsequent word-tag pairs); otherwise, m defaults to 3.

5.1. Now let's train our first classifier

Type the following to the Python shell:

>>> wst_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_word_features)

You should then get something that looks like this:

Reading data...
Senses: HARD1 HARD2 HARD3
Training classifier...
Testing classifier...
Accuracy: 0.8362

In other words, the adjective hard is tagged with 3 senses in the corpus (HARD1, HARD2 and HARD3), and the Naive Bayes Classifier using the feature set based on the 300 most frequent context words yields an accuracy of 0.8362.

Now it's your turn:

Use wst_classifier to train a classifier that disambiguates hard using wsd_context_features. Build classifiers for line and serve as well, using the word features and then the context features.

What's more accurate for disambiguating 'hard.pos', wsd_context_features or wst_word_features?
Does the same hold true for line.pos and serve.pos. Why do you think that might be?
Why is it not fair to compare the accuracy of the classifiers across different target words?

6. Baseline models

Just how good is the accuracy of these WSD classifiers? To find out, we need a baseline. There are two we consider here:

A model which assigns a sense at random.
A model which always assigns the most frequent sense.

Now it's your turn:

What is the accuracy of the random baseline model for hard.pos?
To compute the accuracy of the frequency baseline model for hard.pos, we need to find out the Frequency Distribution of the three senses in the corpus:
```
>>> hard_sense_fd = nltk.FreqDist([i.senses[0] for i in
senseval.instances('hard.pos')])
>>> hard_sense_fd.most_common()
 ...
>>> frequency_hard_sense_baseline = hard_sense_fd.freq('HARD1')
>>> frequency_hard_sense_baseline
0.79736902838679902
```
In other words, the frequency baseline has an accuracy of approx. 0.793. What is the most frequent sense for 'hard.pos'? And is the frequency baseline a better model than the random model?
Now compute the accuracy of the frequency baseline for other target words; e.g. 'line.pos'.

7. Rich features vs. sparse data

In this part of the tutorial we are going to vary the feature sets and compare the results. As well as being able to choose between wsd_context_features vs. wsd_word_features, you can also vary the following:

wsd_context_features

You can vary the number of word-tag pairs before and after the target word that you include in the feature vector. You do this by specifying the argument distance to the function wst_classifier. For instance, the following creates a classifier that uses 2 words to the left and right of the target word:

wst_classifier(NaiveBayesClassifier.train, 'hard.pos', 
		wsd_context_features, distance=2)

What about distance 1?

wsd_word_features

You can vary the closed class words that are excluded from the set of most frequent words, and you can vary the size of the set of most frequent words. For instance, the following results in a model which uses the 100 most frequent words including close class words:

wst_classifier(NaiveBayesClassifier.train, 'hard.pos', 
		wsd_word_features, stopwords_list=[], number=100)

Now it's your turn:

Build several WSD models for 'hard.pos', including at least the following: for the wsd_word_features version, vary number between 100, 200 and 300, and vary the stopwords_list between [] (i.e., the null list) and STOPWORDS; for the wsd_context_features version, vary the distance between 1, 2 and 3, and vary the stopwords_list between [] and STOPWORDS.

Why does setting number to less than 300 seem to improve the word model? Why does making the context window before and after the target word to a number smaller than 3 improve the model?
Why does including closed class words in word model improve overall performance? Hint: For each sense of hard, construct a list of all its instances in the training data using the function sense_instances (see instances1 above). Then, call, for example, extract_vocab_frequency(instances1, stopwords=[], n=100) and compare this with what you get for instances2 and instances3.
It seems slightly odd that the word features for 'hard.pos' include harder and hardest. Try using a stopwords list which adds them to STOPWORDS: is the effect what you expected? Can you explain it?

8. Error analysis

The function wst_classifier allows you to explore the errors of the model it creates:

Confusion Matrix

You can output a confusion matrix as follows:

>>> wst_classifier(NaiveBayesClassifier.train, 'hard.pos',
		   wsd_context_features, distance=2, confusion_matrix=True)
Reading data...
 Senses: HARD1 HARD2 HARD3
Training classifier...
Testing classifier...
Accuracy: 0.8812
      |   H   H   H |
      |   A   A   A |
      |   R   R   R |
      |   D   D   D |
      |   1   2   3 |
------+-------------+
HARD1 | 636  37  19 |
HARD2 |  10  77   8 |
HARD3 |  20   9  51 |
------+-------------+
(row = reference; col = test)

Note that the rows in the matrix are the gold labels, and the columns are the estimated labels. Recall that the diagonal line represents the number of items that the model gets right.

Errors

You can also output each error from the test data into a file errors.txt that gets written to the same directory from where you're running Python. For example:

>>> wst_classifier(NaiveBayesClassifier.train, 'hard.pos',
			wsd_context_features, distance=2, confusion_matrix=True, log=True)

Produces within the file errors.txt the following first few lines:

There are 103 errors!
----------------------------

1) example number: 19
    sentence: the san jose museum of art auxiliary 's recent debut
    fashion show luncheon will be a HARD act to follow .
    guess: HARD1;  label: HARD2

2) example number: 28
    sentence: once your tummy is HARD enough , they 'll stop the
    faucet and push on your tummy so you 'll throw up all the water . ''
    guess: HARD1;  label: HARD3

3) example number: 58
    sentence: `` i feel that the HARD court is my best surface overall,
    " courier said .
    guess: HARD1;  label: HARD3

The example number is the (list) index of the error in the test_data.

Now it's your turn:

Choose your best performing model from Section 7, and train the model again, but add the arguments confusion_matrix=True and log=True.
Using the confusion matrix, identify which sense is the hardest one for the model to estimate.
Look in errors.txt for examples where that hardest word sense is the correct label. Do you see any patterns or systematic errors? If so, can you think of a way to adapt feature vector so as to improve the model?