In this tutorial we will be exploring the lexical sample task. This is a task where you use a corpus to learn how to disambiguate a small set of target words using supervised learning. The aim is to build a classifier that maps each occurrence of a target word in a corpus to its sense.
We will use a Naive Bayes classifier. In other words, where the
context of an occurrence of a target word in the corpus is represented as
a feature vector, the classifier estimates the word sense
s
on the basis of its context as shown below.
We will use the senseval-2 corpus for our training and test data. This corpus consists of text from a mixture of places, including the British National Corpus and the Penn Treebank portion of the Wall Street Journal. Each word in the corpus is tagged with its part of speech, and the senses of the following target words are also manually annotated: the nouns interest, line; the verb serve and the adjective hard. You can find out more about the task from here.
The set of senses that are used to annotate each target word come from WordNet (more on that later).
Now open it in Python using interactive mode (the -i option, or use
ipython
if you're familiar with it):
% python -i wsd_code.py
help(...)
is your friend:
help([class name])
for classes and all their
methods and instance variableshelp([any object]
likewisehelp([function]
or
help([class].[method]
for functions / methods>>> senseval.fileids()
Let's now find out the set of word senses for
each target word in senseval. There is a function in
wsd_code.py
that returns this information. For
example:
>>> senses('hard.pos')
returns
>>> ['HARD1', 'HARD2', 'HARD3']
So there are 3 senses for the adjective hard in the corpus. You'll shortly be looking at the data to guess what these 3 senses are.
Now it's your turn:
senses
with appropriate arguments.
You can find out what these 100 words for HARD1
by
running the following:
>>> instances1 = sense_instances(senseval.instances('hard.pos'), 'HARD1')
>>> features1 = extract_vocab_frequency(instances1, n=100)
Now call features1 and you get this:
[('harder', 476), ('time', 316), ('would', 224), ('get', 223),
('find', 206), ('make', 195), ('very', 187), ('out', 174), ('people',
164), ('even', 151), ('believe', 151), ('hardest', 137), ('going',
136), ('much', 127), ('like', 126), ('just', 119), ('now', 112),
('way', 107), ('imagine', 106), ('see', 104), ('other', 104), ('may',
103), ('know', 102), ('could', 95), ('into', 93), ('many', 93),
('new', 91), ('tell', 89), ('these', 86), ('come', 85), ('too', 83),
('without', 81), ('good', 80), ('part', 79), ('take', 78), ('think',
77), ('thing', 75), ('really', 74), ('work', 74), ('two', 73), ('go',
71), ('those', 71), ('back', 71), ('makes', 70), ('things', 69),
('still', 69), ('made', 67), ('last', 66), ('having', 66), ('most',
66), ('our', 65), ('first', 64), ('here', 63), ('enough', 62),
('life', 61), ('lot', 61), ('keep', 61), ('making', 60), ('year', 58),
('over', 58), ('long', 57), ('getting', 57), ('only', 56), ('off',
55), ('san', 54), ('understand', 53), ('home', 52), ('through', 52),
('day', 52), ('why', 52), ('times', 50), ('might', 50), ('sometimes',
50), ('does', 49), ('something', 49), ('since', 47), ('game', 47),
('down', 47), ('little', 47), ('children', 47), ('kind', 47), ('put',
46), ('while', 46), ('before', 46), ('job', 45), ('especially', 44),
('become', 43), ('often', 43), ('want', 42), ('place', 42), ('right',
41), ('around', 41), ('found', 41), ('once', 41), ('anything', 40),
('doing', 40), ('although', 40), ('someone', 40), ('women', 40),
('ever', 39), ('same', 39)] >>>
Now it's your turn:>>> instances3[0]
SensevalInstance(word='hard-a', position=15, context=[('my', 'PRP$'),
('companion', 'NN'), ('enjoyed', 'VBD'), ('a', 'DT'), ('healthy',
'JJ'), ('slice', 'NN'), ('of', 'IN'), ('the', 'DT'), ('chocolate',
'NN'), ('mousse', 'NN'), ('cake', 'NN'), (',', ','), ('made', 'VBN'),
('with', 'IN'), ('a', 'DT'), ('hard', 'JJ'), ('chocolate', 'NN'),
('crust', 'NN'), (',', ','), ('topping', 'VBG'), ('a', 'DT'),
('sponge', 'NN'), ('cake', 'NN'), ('with', 'IN'), ('either', 'DT'),
('strawberry', 'NN'), ('or', 'CC'), ('raspberry', 'JJ'), ('on', 'IN'),
('the', 'DT'), ('bottom', 'NN'), ('.', '.')], senses=('HARD3',))
So the senseval corpus is a collection of information about a set of tagged sentences, where each entry or instance consists of 4 attributes:
word
specifies
the target word together with its syntactic category (e.g.,
hard-a
means that the word is hard and its
category is 'adjective');
position
gives its position within the
sentence (ignoring punctuation);
context
represents
the sentence as a list of pairs, each pair being a word or punctuation
mark and its tag; and finally
senses
is a tuple, each item in the tuple being a
sense for that target word.
In the subset of the corpus we are working with, this tuple consists of only one argument. But there are a
few examples elsewhere in the corpus where there is more than one, representing
the fact that the annotator couldn't decide which sense to assign to
the word. For simplicity, our classifiers are going to ignore any
non-first arguments
to the attribute senses
.
wst_classifer
This function must have at least the
following arguments specified by you:NaiveBayesClassifier.train
(if you want
you could also try MaxentClassifier.train
, but this takes
longer to train).
'hard.pos'
, 'line.pos'
,
'interest.pos'
or 'serve.pos'
.
wsd_word_features
represents its context
as the subset of those words
from S that occur in the w's sentence.
By default, the closed class words that are specified in
STOPWORDS
are excluded
from the set S of most frequent words. But as we'll see
later, you can also include closed class words in S, or
re-define closed class words in any way you like!
If you want to know what closed class words are excluded by default,
then simply type STOPWORDS
to the Python command.
>>> wst_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_word_features)
You should then get something that looks like this:
Reading data...
Senses: HARD1 HARD2 HARD3
Training classifier...
Testing classifier...
Accuracy: 0.8362
In other words, the adjective hard is tagged with 3 senses in the corpus (HARD1, HARD2 and HARD3), and the Naive Bayes Classifier using the feature set based on the 300 most frequent context words yields an accuracy of 0.8362.
Now it's your turn:
Use wst_classifier
to
train a classifier that disambiguates hard using wsd_context_features
. Build classifiers for line and serve as well, using the word features and then the context features.
'hard.pos'
,
wsd_context_features
or wst_word_features
?
line.pos
and
serve.pos
. Why do you think that might be?
Now it's your turn:
hard.pos
?
hard.pos
, we need to
find out the Frequency Distribution of the three senses in the corpus:
>>> hard_sense_fd = nltk.FreqDist([i.senses[0] for i in
senseval.instances('hard.pos')])
>>> hard_sense_fd.most_common()
...
>>> frequency_hard_sense_baseline = hard_sense_fd.freq('HARD1')
>>> frequency_hard_sense_baseline
0.79736902838679902
In other words, the frequency baseline has an accuracy of approx. 0.793.
What is the most frequent sense for 'hard.pos'
? And is
the frequency baseline a better model than the random model?
'line.pos'
.
wsd_context_features
vs. wsd_word_features
,
you can also vary the following:
You can vary the number of word-tag pairs before and after the target
word that you include in the feature vector. You do this by
specifying the argument distance
to the function
wst_classifier
. For instance, the following creates a
classifier that uses 2 words to the left and right of the target word:
wst_classifier(NaiveBayesClassifier.train, 'hard.pos',
wsd_context_features, distance=2)
What about distance 1?
wst_classifier(NaiveBayesClassifier.train, 'hard.pos',
wsd_word_features, stopwords_list=[], number=100)
Now it's your turn:
Build several WSD models for 'hard.pos'
, including at
least the following: for the wsd_word_features
version, vary number
between 100, 200 and 300, and vary the stopwords_list
between []
(i.e., the null list) and STOPWORDS
; for the wsd_context_features
version, vary the distance
between 1, 2 and 3, and vary the stopwords_list
between []
and STOPWORDS
.
sense_instances
(see instances1
above).
Then, call, for example, extract_vocab_frequency(instances1,
stopwords=[], n=100)
and compare this with what you get for
instances2
and instances3
.
'hard.pos'
include harder and hardest. Try using a stopwords
list which adds them to STOPWORDS
: is the effect what you
expected? Can you explain it?wst_classifier
allows you to explore the
errors of the model it creates:>>> wst_classifier(NaiveBayesClassifier.train, 'hard.pos',
wsd_context_features, distance=2, confusion_matrix=True)
Reading data...
Senses: HARD1 HARD2 HARD3
Training classifier...
Testing classifier...
Accuracy: 0.8812
| H H H |
| A A A |
| R R R |
| D D D |
| 1 2 3 |
------+-------------+
HARD1 | 636 37 19 |
HARD2 | 10 77 8 |
HARD3 | 20 9 51 |
------+-------------+
(row = reference; col = test)
Note that the rows in the matrix are the gold labels, and the columns
are the estimated labels. Recall that the diagonal line represents
the number of items that the model gets right.
errors.txt
that gets written to the same directory from
where you're running Python. For example:
>>> wst_classifier(NaiveBayesClassifier.train, 'hard.pos',
wsd_context_features, distance=2, confusion_matrix=True, log=True)
Produces within the file errors.txt
the following first
few lines:
There are 103 errors!
----------------------------
1) example number: 19
sentence: the san jose museum of art auxiliary 's recent debut
fashion show luncheon will be a HARD act to follow .
guess: HARD1; label: HARD2
2) example number: 28
sentence: once your tummy is HARD enough , they 'll stop the
faucet and push on your tummy so you 'll throw up all the water . ''
guess: HARD1; label: HARD3
3) example number: 58
sentence: `` i feel that the HARD court is my best surface overall,
" courier said .
guess: HARD1; label: HARD3
The example number is the (list) index of the error in the
test_data
.
Now it's your turn:
confusion_matrix=True
and
log=True
.
errors.txt
for examples where that hardest word
sense is the correct label. Do you see any patterns or systematic
errors? If so, can you think of a way to adapt feature vector so as
to improve the model?