{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# FNLP 2017: Lab Session 5: Word Sense Disambiguation\n", "\n", "## Word Sense Disambiguation: Recap\n", "\n", "In this tutorial we will be exploring the lexical sample task. This is a task where you use a corpus to learn how to disambiguate a small set of target words using supervised learning. The aim is to build a classifier that maps each occurrence of a target word in a corpus to its sense.\n", "\n", "We will use a Naive Bayes classifier. In other words, where the context of an occurrence of a target word in the corpus is represented as a feature vector, the classifier estimates the word sense s on the basis of its context as shown below. \n", "\n", "\n", "![Alt Text](nb_maths.jpg)\n", "\n", "## The corpus\n", "\n", "We will use the [senseval-2](http://www.hipposmond.com/senseval2) corpus for our training and test data. This corpus consists of text from a mixture of places, including the British National Corpus and the Penn Treebank portion of the Wall Street Journal. Each word in the corpus is tagged with its part of speech, and the senses of the following target words are also manually annotated: the nouns interest, line; the verb serve and the adjective hard. You can find out more about the task from [here](http://www.hipposmond.com/senseval2/descriptions/english-lexsample.htm).\n", "\n", "The set of senses that are used to annotate each target word come from WordNet (more on that later).\n", "\n", "## Getting started: Run the code\n", "\n", "Look at the code below, and try to understand how it works (don't worry if you don't understand some of it, it's not necessary for doing this task).\n", " Rememmber, `help(...)` is your friend:\n", " * `help([class name])` for classes and all their methods and instance variables\n", " * `help([any object])` likewise\n", " * `help([function])` or `help([class].[method])` for functions / methods\n", "\n", "This code allows you to do several things. You can now run, train and evaluate a range of Naive Bayes classifiers over the corpus to acquire a model of WSD for a given target word: the adjective hard, the nouns interest or line, and the verb serve. We'll learn later how you do this. First, we're going to explore the nature of the corpus itself. " ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "# %load wsd_code.py\n", "from __future__ import division\n", "import nltk\n", "import random\n", "from nltk.corpus import senseval\n", "from nltk.classify import accuracy, NaiveBayesClassifier, MaxentClassifier\n", "from collections import defaultdict\n", "\n", "# The following shows how the senseval corpus consists of instances, where each instance\n", "# consists of a target word (and its tag), it position in the sentence it appeared in\n", "# within the corpus (that position being word position, minus punctuation), and the context,\n", "# which is the words in the sentence plus their tags.\n", "#\n", "# senseval.instances()[:1]\n", "# [SensevalInstance(word='hard-a', position=20, context=[('``', '``'), ('he', 'PRP'),\n", "# ('may', 'MD'), ('lose', 'VB'), ('all', 'DT'), ('popular', 'JJ'), ('support', 'NN'),\n", "# (',', ','), ('but', 'CC'), ('someone', 'NN'), ('has', 'VBZ'), ('to', 'TO'),\n", "# ('kill', 'VB'), ('him', 'PRP'), ('to', 'TO'), ('defeat', 'VB'), ('him', 'PRP'),\n", "# ('and', 'CC'), ('that', 'DT'), (\"'s\", 'VBZ'), ('hard', 'JJ'), ('to', 'TO'), ('do', 'VB'),\n", "# ('.', '.'), (\"''\", \"''\")], senses=('HARD1',))]\n", "\n", "def senses(word):\n", " \"\"\"\n", " This takes a target word from senseval-2 (find out what the possible\n", " are by running senseval.fileides()), and it returns the list of possible \n", " senses for the word\n", " \"\"\"\n", " return list(set(i.senses[0] for i in senseval.instances(word)))\n", "\n", "# Both above and below, we depend on the (non-obvious?) fact that although the field is\n", "# called 'senses', there is always only 1, i.e. there is no residual ambiguity in the\n", "# data as we have it\n", "\n", "def sense_instances(instances, sense):\n", " \"\"\"\n", " This returns the list of instances in instances that have the sense sense\n", " \"\"\"\n", " return [instance for instance in instances if instance.senses[0]==sense]\n", "\n", "# >>> sense3 = sense_instances(senseval.instances('hard.pos'), 'HARD3')\n", "# >>> sense3[:2]\n", "# [SensevalInstance(word='hard-a', position=15,\n", "# context=[('my', 'PRP$'), ('companion', 'NN'), ('enjoyed', 'VBD'), ('a', 'DT'), ('healthy', 'JJ'), ('slice', 'NN'), ('of', 'IN'), ('the', 'DT'), ('chocolate', 'NN'), ('mousse', 'NN'), ('cake', 'NN'), (',', ','), ('made', 'VBN'), ('with', 'IN'), ('a', 'DT'), ('hard', 'JJ'), ('chocolate', 'NN'), ('crust', 'NN'), (',', ','), ('topping', 'VBG'), ('a', 'DT'), ('sponge', 'NN'), ('cake', 'NN'), ('with', 'IN'), ('either', 'DT'), ('strawberry', 'NN'), ('or', 'CC'), ('raspberry', 'JJ'), ('on', 'IN'), ('the', 'DT'), ('bottom', 'NN'), ('.', '.')],\n", "# senses=('HARD3',)),\n", "# SensevalInstance(word='hard-a', position=5,\n", "# context=[('``', '``'), ('i', 'PRP'), ('feel', 'VBP'), ('that', 'IN'), ('the', 'DT'), ('hard', 'JJ'), ('court', 'NN'), ('is', 'VBZ'), ('my', 'PRP$'), ('best', 'JJS'), ('surface', 'NN'), ('overall', 'JJ'), (',', ','), ('\"', '\"'), ('courier', 'NNP'), ('said', 'VBD'), ('.', '.')],\n", "# senses=('HARD3',))]\n", "\n", "\n", "_inst_cache = {}\n", "\n", "STOPWORDS = ['.', ',', '?', '\"', '``', \"''\", \"'\", '--', '-', ':', ';', '(',\n", " ')', '$', '000', '1', '2', '10,' 'I', 'i', 'a', 'about', 'after', 'all', 'also', 'an', 'any',\n", " 'are', 'as', 'at', 'and', 'be', 'being', 'because', 'been', 'but', 'by',\n", " 'can', \"'d\", 'did', 'do', \"don'\", 'don', 'for', 'from', 'had','has', 'have', 'he',\n", " 'her','him', 'his', 'how', 'if', 'is', 'in', 'it', 'its', \"'ll\", \"'m\", 'me',\n", " 'more', 'my', 'n', 'no', 'not', 'of', 'on', 'one', 'or', \"'re\", \"'s\", \"s\",\n", " 'said', 'say', 'says', 'she', 'so', 'some', 'such', \"'t\", 'than', 'that', 'the',\n", " 'them', 'they', 'their', 'there', 'this', 'to', 'up', 'us', \"'ve\", 'was', 'we', 'were',\n", " 'what', 'when', 'where', 'which', 'who', 'will', 'with', 'years', 'you',\n", " 'your']\n", "\n", "STOPWORDS_SET=set(STOPWORDS)\n", "\n", "NO_STOPWORDS = []\n", "\n", "def wsd_context_features(instance, vocab, dist=3):\n", " features = {}\n", " ind = instance.position\n", " con = instance.context\n", " for i in range(max(0, ind-dist), ind):\n", " j = ind-i\n", " features['left-context-word-%s(%s)' % (j, con[i][0])] = True\n", "\n", " for i in range(ind+1, min(ind+dist+1, len(con))):\n", " j = i-ind\n", " features['right-context-word-%s(%s)' % (j, con[i][0])] = True\n", "\n", " \n", " features['word'] = instance.word\n", " features['pos'] = con[1][1]\n", " return features\n", "\n", "\n", "\n", "def wsd_word_features(instance, vocab, dist=3):\n", " \"\"\"\n", " Create a featureset where every key returns False unless it occurs in the\n", " instance's context\n", " \"\"\"\n", " features = defaultdict(lambda:False)\n", " features['alwayson'] = True\n", " #cur_words = [w for (w, pos) in i.context]\n", " try:\n", " for(w, pos) in instance.context:\n", " if w in vocab:\n", " features[w] = True\n", " except ValueError:\n", " pass\n", " return features\n", "\n", "def extract_vocab_frequency(instances, stopwords=STOPWORDS_SET, n=300):\n", " \"\"\"\n", " Given a list of senseval instances, return a list of the n most frequent words that\n", " appears in its context (i.e., the sentence with the target word in), output is in order\n", " of frequency and includes also the number of instances in which that key appears in the\n", " context of instances.\n", " \"\"\"\n", " fd = nltk.FreqDist()\n", " for i in instances:\n", " (target, suffix) = i.word.split('-')\n", " words = (c[0] for c in i.context if not c[0] == target)\n", " for word in set(words) - set(stopwords):\n", " fd[word] += 1\n", " #for sense in i.senses:\n", " #cfd[sense][word] += 1\n", " return fd.most_common()[:n+1]\n", " \n", "def extract_vocab(instances, stopwords=STOPWORDS_SET, n=300):\n", " return [w for w,f in extract_vocab_frequency(instances,stopwords,n)]\n", " \n", "##def wst_classifier(trainer, word, features,number=300):\n", "## print \"Reading data...\"\n", "## global _inst_cache\n", "## if word not in _inst_cache:\n", "## _inst_cache[word] = [(i, i.senses[0]) for i in senseval.instances(word)]\n", "## events = _inst_cache[word][:]\n", "## senses = list(set(l for (i, l) in events))\n", "## instances = [i for (i, l) in events]\n", "## vocab = extract_vocab(instances, n=number)\n", "## print ' Senses: ' + ' '.join(senses)\n", "##\n", "## # Split the instances into a training and test set,\n", "## #if n > len(events): n = len(events)\n", "## n = len(events)\n", "## random.seed(5444522)\n", "## random.shuffle(events)\n", "## training_data = events[:int(0.8 * n)]\n", "## test_data = events[int(0.8 * n):n]\n", "## # Train classifier\n", "## print 'Training classifier...'\n", "## classifier = trainer([(features(i, vocab), label) for (i, label) in training_data])\n", "## # Test classifier\n", "## print 'Testing classifier...'\n", "## acc = accuracy(classifier, [(features(i, vocab), label) for (i, label) in test_data] )\n", "## print 'Accuracy: %6.4f' % acc\n", "\n", " \n", "def wst_classifier(trainer, word, features, stopwords_list = STOPWORDS_SET, number=300, log=False, distance=3, confusion_matrix=False):\n", " \"\"\"\n", " This function takes as arguments:\n", " a trainer (e.g., NaiveBayesClassifier.train);\n", " a target word from senseval2 (you can find these out with senseval.fileids(),\n", " and they are 'hard.pos', 'interest.pos', 'line.pos' and 'serve.pos');\n", " a feature set (this can be wsd_context_features or wsd_word_features);\n", " a number (defaults to 300), which determines for wsd_word_features the number of\n", " most frequent words within the context of a given sense that you use to classify examples;\n", " a distance (defaults to 3) which determines the size of the window for wsd_context_features (if distance=3, then\n", " wsd_context_features gives 3 words and tags to the left and 3 words and tags to\n", " the right of the target word);\n", " log (defaults to false), which if set to True outputs the errors into a file errors.txt\n", " confusion_matrix (defaults to False), which if set to True prints a confusion matrix.\n", "\n", " Calling this function splits the senseval data for the word into a training set and a test set (the way it does\n", " this is the same for each call of this function, because the argument to random.seed is specified,\n", " but removing this argument would make the training and testing sets different each time you build a classifier).\n", "\n", " It then trains the trainer on the training set to create a classifier that performs WSD on the word,\n", " using features (with number or distance where relevant).\n", "\n", " It then tests the classifier on the test set, and prints its accuracy on that set.\n", "\n", " If log==True, then the errors of the classifier over the test set are written to errors.txt.\n", " For each error four things are recorded: (i) the example number within the test data (this is simply the index of the\n", " example within the list test_data); (ii) the sentence that the target word appeared in, (iii) the\n", " (incorrect) derived label, and (iv) the gold label.\n", "\n", " If confusion_matrix==True, then calling this function prints out a confusion matrix, where each cell [i,j]\n", " indicates how often label j was predicted when the correct label was i (so the diagonal entries indicate labels\n", " that were correctly predicted).\n", " \"\"\"\n", " print \"Reading data...\"\n", " global _inst_cache\n", " if word not in _inst_cache:\n", " _inst_cache[word] = [(i, i.senses[0]) for i in senseval.instances(word)]\n", " events = _inst_cache[word][:]\n", " senses = list(set(l for (i, l) in events))\n", " instances = [i for (i, l) in events]\n", " vocab = extract_vocab(instances, stopwords=stopwords_list, n=number)\n", " print ' Senses: ' + ' '.join(senses)\n", "\n", " # Split the instances into a training and test set,\n", " #if n > len(events): n = len(events)\n", " n = len(events)\n", " random.seed(5444522)\n", " random.shuffle(events)\n", " training_data = events[:int(0.8 * n)]\n", " test_data = events[int(0.8 * n):n]\n", " # Train classifier\n", " print 'Training classifier...'\n", " classifier = trainer([(features(i, vocab, distance), label) for (i, label) in training_data])\n", " # Test classifier\n", " print 'Testing classifier...'\n", " acc = accuracy(classifier, [(features(i, vocab, distance), label) for (i, label) in test_data] )\n", " print 'Accuracy: %6.4f' % acc\n", " if log==True:\n", " #write error file\n", " print 'Writing errors to errors.txt'\n", " output_error_file = open('errors.txt', 'w')\n", " errors = []\n", " for (i, label) in test_data:\n", " guess = classifier.classify(features(i, vocab, distance))\n", " if guess != label:\n", " con = i.context\n", " position = i.position\n", " item_number = str(test_data.index((i, label)))\n", " word_list = []\n", " for (word, tag) in con:\n", " word_list.append(word)\n", " hard_highlighted = word_list[position].upper()\n", " word_list_highlighted = word_list[0:position] + [hard_highlighted] + word_list[position+1:]\n", " sentence = ' '.join(word_list_highlighted)\n", " errors.append([item_number, sentence, guess,label])\n", " error_number = len(errors)\n", " output_error_file.write('There are ' + str(error_number) + ' errors!' + '\\n' + '----------------------------' +\n", " '\\n' + '\\n')\n", " for error in errors:\n", " output_error_file.write(str(errors.index(error)+1) +') ' + 'example number: ' + error[0] + '\\n' +\n", " ' sentence: ' + error[1] + '\\n' +\n", " ' guess: ' + error[2] + '; label: ' + error[3] + '\\n' + '\\n')\n", " output_error_file.close()\n", " if confusion_matrix==True:\n", " gold = [label for (i, label) in test_data]\n", " derived = [classifier.classify(features(i,vocab)) for (i,label) in test_data]\n", " cm = nltk.ConfusionMatrix(gold,derived)\n", " print cm\n", " return cm\n", " \n", " \n", " \n", "def demo():\n", " print \"NB, with features based on 300 most frequent context words\"\n", " wst_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_word_features)\n", " print\n", " print \"NB, with features based word + pos in 6 word window\"\n", " wst_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_context_features)\n", " print\n", "## print \"MaxEnt, with features based word + pos in 6 word window\"\n", "## wst_classifier(MaxentClassifier.train, 'hard.pos', wsd_context_features)\n", " \n", "\n", "#demo()\n", "\n", "# Frequency Baseline\n", "##hard_sense_fd = nltk.FreqDist([i.senses[0] for i in senseval.instances('hard.pos')])\n", "##most_frequent_hard_sense= hard_sense_fd.keys()[0]\n", "##frequency_hard_sense_baseline = hard_sense_fd.freq(hard_sense_fd.keys()[0])\n", "\n", "##>>> frequency_hard_sense_baseline\n", "##0.79736902838679902\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The Senseval corpus\n", "## Target words\n", "\n", "You can find out the set of target words for the senseval-2 corpus by running:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[u'hard.pos', u'interest.pos', u'line.pos', u'serve.pos']" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "senseval.fileids()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "The result doesn't tell you the syntactic category of the words, but see the description of the corpus in Section 1 or Section 4.2. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Word senses\n", "\n", "Let's now find out the set of word senses for each target word in senseval. There is a function in wsd_code.py that returns this information. For example:\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['HARD1', 'HARD2', 'HARD3']\n" ] } ], "source": [ "print senses('hard.pos')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see this gives you `['HARD1', 'HARD2', 'HARD3']`\n", "\n", "So there are 3 senses for the adjective hard in the corpus. You'll shortly be looking at the data to guess what these 3 senses are.\n", "\n", "Now it's your turn:\n", "\n", "* What are the senses for the other target words? Find out by calling senses with appropriate arguments.\n", "* How many senses does each target have?\n", "* Let's now guess the sense definitions for HARD1, HARD2 and HARD3 by looking at the 100 most frequent open class words that occur in the context of each sense. \n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can find out what these 100 words for HARD1 by running the following:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('harder', 475), ('time', 316), ('get', 224), ('would', 224), ('find', 206), ('make', 195), ('very', 187), ('out', 175), ('people', 165), ('believe', 154), ('even', 152), ('hardest', 139), ('going', 136), ('much', 130), ('like', 127), ('just', 121), ('now', 111), ('way', 107), ('imagine', 105), ('see', 104), ('may', 103), ('know', 103), ('other', 103), ('could', 96), ('into', 93), ('many', 92), ('new', 91), ('tell', 88), ('these', 85), ('come', 85), ('too', 83), ('without', 82), ('good', 81), ('part', 79), ('take', 78), ('think', 77), ('work', 76), ('thing', 75), ('really', 74), ('go', 73), ('two', 73), ('those', 71), ('things', 70), ('makes', 70), ('back', 70), ('still', 69), ('last', 67), ('made', 67), ('having', 66), ('first', 66), ('most', 64), ('our', 64), ('here', 63), ('life', 62), ('keep', 62), ('enough', 62), ('making', 61), ('lot', 60), ('year', 58), ('over', 58), ('long', 58), ('getting', 57), ('san', 55), ('off', 55), ('only', 55), ('home', 52), ('understand', 52), ('day', 52), ('through', 51), ('why', 51), ('might', 51), ('times', 50), ('sometimes', 50), ('little', 49), ('something', 49), ('does', 48), ('since', 48), ('game', 48), ('kind', 47), ('put', 46), ('while', 46), ('before', 45), ('down', 45), ('children', 45), ('job', 45), ('want', 43), ('become', 43), ('especially', 43), ('often', 43), ('right', 42), ('place', 42), ('around', 41), ('found', 41), ('once', 41), ('anything', 40), ('doing', 40), ('ever', 40), ('women', 40), ('man', 39), ('same', 39), ('although', 39)]\n" ] } ], "source": [ "instances1 = sense_instances(senseval.instances('hard.pos'), 'HARD1')\n", "features1 = extract_vocab_frequency(instances1, n=100)\n", "\n", "# Now lets try printing features1:\n", "print features1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Now it's your turn:\n", "\n", "* Call the above functions for HARD2 and HARD3.\n", "* Look at the resulting lists of 100 most frequent words for each sense, and try to define what HARD1, HARD2 and HARD3 mean.\n", "* These senses are actually the first three senses for the adjective hard in WordNet. You can enter a word and get its list of WordNet senses from [here](http://wordnetweb.princeton.edu/perl/webwn). Do this for hard, and check whether your estimated definitions for the 3 word senses are correct. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('work', 163), ('look', 85), ('take', 55), ('feelings', 32), ('long', 27), ('people', 23), ('our', 23), ('time', 22), ('good', 20), ('way', 19), ('should', 19), ('would', 18), ('just', 18), ('get', 18), ('very', 18), ('taking', 18), ('harder', 18), ('other', 18), ('line', 17), ('through', 16), ('now', 15), ('only', 15), ('even', 14), ('new', 14), ('little', 14), ('into', 14), ('day', 14), ('business', 13), ('like', 13), ('president', 12), ('those', 12), ('lot', 12), ('out', 12), ('over', 11), ('year', 11), ('members', 11), ('most', 11), ('last', 11), ('bush', 11), ('man', 11), ('does', 11), ('took', 10), ('much', 10), ('first', 10), ('san', 10), ('between', 10), ('money', 10), ('two', 10), ('really', 10), ('often', 10), ('next', 10), ('could', 10), ('state', 10), ('off', 10), ('well', 10), ('companies', 9), ('before', 9), ('make', 9), ('right', 9), ('market', 9), ('eyes', 9), ('show', 9), ('many', 9), ('evidence', 9), ('then', 9), ('both', 9), ('government', 9), ('freedom', 9), ('every', 8), ('each', 8), ('down', 8), ('may', 8), ('still', 8), ('going', 8), ('among', 8), ('these', 8), ('without', 8), ('company', 8), ('always', 8), ('see', 8), ('while', 8), ('fast', 8), ('women', 8), ('six', 8), ('own', 8), ('home', 8), ('made', 8), ('school', 7), ('never', 7), ('want', 7), ('needed', 7), ('came', 7), ('men', 7), ('jose', 7), ('working', 7), ('come', 7), ('pay', 7), ('real', 7), ('industry', 7), ('learned', 7), ('american', 7)]\n", "[('rock', 33), ('place', 24), ('between', 24), ('surface', 24), ('soft', 21), ('cover', 17), ('out', 17), ('other', 17), ('into', 16), ('like', 16), ('made', 15), ('red', 13), ('plastic', 13), ('time', 12), ('water', 12), ('first', 12), ('harder', 12), ('good', 12), ('mr', 11), ('too', 11), ('box', 11), ('may', 11), ('company', 11), ('put', 11), ('off', 11), ('would', 10), ('make', 10), ('winter', 10), ('wheat', 10), ('two', 10), ('while', 10), ('should', 10), ('over', 9), ('until', 9), ('only', 9), ('through', 9), ('firm', 9), ('material', 9), ('even', 8), ('new', 8), ('green', 8), ('board', 8), ('little', 8), ('caught', 8), ('ground', 8), ('surfaces', 8), ('shell', 8), ('small', 8), ('last', 8), ('used', 8), ('edge', 8), ('spring', 8), ('packed', 8), ('dry', 7), ('before', 7), ('best', 7), ('much', 7), ('just', 7), ('now', 7), ('year', 7), ('million', 7), ('way', 7), ('these', 7), ('very', 7), ('found', 7), ('look', 7), ('use', 7), ('including', 7), ('day', 7), ('north', 7), ('book', 7), ('whose', 6), ('paperback', 6), ('work', 6), ('machine', 6), ('better', 6), ('against', 6), ('ice', 6), ('down', 6), ('form', 6), ('covers', 6), ('skin', 6), ('lrb', 6), ('along', 6), ('goods', 6), ('cold', 6), ('still', 6), ('state', 6), ('cheese', 6), ('long', 6), ('white', 6), ('enough', 6), ('wood', 6), ('across', 6), ('develop', 6), ('4', 6), ('cream', 6), ('rubber', 6), ('flat', 6), ('since', 6), ('dirt', 6)]\n" ] } ], "source": [ "instances2 = sense_instances(senseval.instances('hard.pos'), 'HARD2')\n", "features2 = extract_vocab_frequency(instances2, n=100)\n", "\n", "instances3 = sense_instances(senseval.instances('hard.pos'), 'HARD3')\n", "features3 = extract_vocab_frequency(instances3, n=100)\n", "print(features2)\n", "print(features3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hard has 3 senses \n", "Interest has 6 senses \n", "Line has 6 senses \n", "Serve has 4 senses \n", "\n", "Sense meanings:\n", " * HARD1: difficult\n", " * HARD2: objective/dispassionate\n", " * HARD3: opposite to maleable; resistant to force." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The data structures: Senseval instances\n", "Having extracted all instances of a given sense, you can look at what the data structures in the corpus look like: " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "instances3[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " So the senseval corpus is a collection of information about a set of tagged sentences, where each entry or instance consists of 4 attributes:\n", "\n", "* word specifies the target word together with its syntactic category (e.g., hard-a means that the word is hard and its category is 'adjective');\n", "* position gives its position within the sentence (ignoring punctuation);\n", "* context represents the sentence as a list of pairs, each pair being a word or punctuation mark and its tag; and finally\n", "* senses is a tuple, each item in the tuple being a sense for that target word. In the subset of the corpus we are working with, this tuple consists of only one argument. But there are a few examples elsewhere in the corpus where there is more than one, representing the fact that the annotator couldn't decide which sense to assign to the word. For simplicity, our classifiers are going to ignore any non-first arguments to the attribute senses. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Exploring different WSD classifiers\n", "You're now going to compare the performance of different classifiers that perform word sense disambiguation. You do this by calling the function wst_classifer This function must have at least the following arguments specified by you:\n", "\n", " 1. A trainer; e.g., NaiveBayesClassifier.train (if you want you could also try MaxentClassifier.train, but this takes longer to train).\n", " 2. The target word that the classifier is going to learn to disambiguate: i.e., 'hard.pos', 'line.pos', 'interest.pos' or 'serve.pos'.\n", " 3. A feature set. The code allows you to use two kinds of feature sets:\n", "#### wsd_word_features\n", "This feature set is based on the set S of the n most frequent words that occur in the same sentence as the target word w across the entire training corpus (as you'll see later, you can specify the value of n, but if you don't specify it then it defaults to 300). For each occurrence of w, wsd_word_features represents its context as the subset of those words from S that occur in the w's sentence. By default, the closed class words that are specified in STOPWORDS are excluded from the set S of most frequent words. But as we'll see later, you can also include closed class words in S, or re-define closed class words in any way you like! If you want to know what closed class words are excluded by default, then simply type STOPWORDS to the Python command. \n", "#### wsd_context_features\n", "This feature set represents the context of a word w as the sequence of m pairs (word,tag) that occur before w and the sequence of m pairs (word, tag) that occur after w. As we'll see shortly, you can specify the value of m (e.g., m=1 means the context consists of just the immediately prior and immediately subsequent word-tag pairs); otherwise, m defaults to 3. \n", " \n", " \n", "## Now let's train our first classifier\n", "Type the following to the Python shell: " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reading data...\n", " Senses: HARD1 HARD2 HARD3\n", "Training classifier...\n", "Testing classifier...\n", "Accuracy: 0.8362\n" ] } ], "source": [ "wst_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_word_features) " ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reading data...\n", " Senses: HARD1 HARD2 HARD3\n", "Training classifier...\n", "Testing classifier...\n", "Accuracy: 0.8697\n" ] } ], "source": [ "wst_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_context_features) " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reading data...\n", " Senses: cord division product text phone formation\n", "Training classifier...\n", "Testing classifier...\n", "Accuracy: 0.7133\n", "Reading data...\n", " Senses: cord division product text phone formation\n", "Training classifier...\n", "Testing classifier...\n", "Accuracy: 0.7470\n" ] } ], "source": [ "wst_classifier(NaiveBayesClassifier.train, 'line.pos', wsd_word_features) \n", "wst_classifier(NaiveBayesClassifier.train, 'line.pos', wsd_context_features) " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reading data...\n", " Senses: SERVE6 SERVE10 SERVE12 SERVE2\n", "Training classifier...\n", "Testing classifier...\n", "Accuracy: 0.7534\n", "Reading data...\n", " Senses: SERVE6 SERVE10 SERVE12 SERVE2\n", "Training classifier...\n", "Testing classifier...\n", "Accuracy: 0.8470\n" ] } ], "source": [ "wst_classifier(NaiveBayesClassifier.train, 'serve.pos', wsd_word_features) \n", "wst_classifier(NaiveBayesClassifier.train, 'serve.pos', wsd_context_features) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In other words, the adjective hard is tagged with 3 senses in the corpus (HARD1, HARD2 and HARD3), and the Naive Bayes Classifier using the feature set based on the 300 most frequent context words yields an accuracy of 0.8362. \n", "\n", "#### Now it's your turn:\n", "\n", "Use wst_classifier to train a classifier that disambiguates hard using wsd_context_features. Build classifiers for line and serve as well, using the word features and then the context features.\n", "\n", "* What's more accurate for disambiguating 'hard.pos', wsd_context_features or wst_word_features?\n", "* Does the same hold true for line.pos and serve.pos. Why do you think that might be?\n", "* Why is it not fair to compare the accuracy of the classifiers across different target words? \n", "\n", "The context features perform better than the word features on all the\n", "target words. This could be for several reasons. One is that the data\n", "is too sparse on the word features (especially given that this is\n", "training on 300 most frequent words, and it might even do better on a\n", "smaller set of frequent words). Another reason could be that the\n", "senseval corpus contains material from pretty much the same genre\n", "across all training data, and word features tend to perform better\n", "when the different word senses are sourced in different text genres\n", "within the training corpus. These are just two possible reasons;\n", "discuss with your class mates about what you think might be other valid reasons.\n", "\n", "It does not make sense to compare accuracies as there are different numbers of senses for each word. Since there are 3 senses for hard the random classifier (see below) would get 1/3 accuracy, while serve has 4 possibilites and thus 1/4 accuracy for the random classifier. Thus it is harder to get serve than hard. The difficulty of disambiguating different words may also be varying levels of difficulty depending on how different the senses are. \n", " \n", "# Baseline models\n", "Just how good is the accuracy of these WSD classifiers? To find out, we need a baseline. There are two we consider here:\n", "\n", "1. A model which assigns a sense at random.\n", "2. A model which always assigns the most frequent sense. \n", "\n", "### Now it's your turn:\n", "\n", "* What is the accuracy of the random baseline model for hard.pos?\n", "* To compute the accuracy of the frequency baseline model for hard.pos, we need to find out the Frequency Distribution of the three senses in the corpus: \n", "\n", "The random baseline for \"hard\" would be 0.3333 accurate.\n", "The frequency baseline would be is given.\n", "The most frequent sense for \"hard\" is HARD1\n", "Obviously, the frequency baseline is a better model than the random\n", "one." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('HARD1', 3455), ('HARD2', 502), ('HARD3', 376)]\n", "0.797369028387\n" ] } ], "source": [ "hard_sense_fd = nltk.FreqDist([i.senses[0] for i in\n", "senseval.instances('hard.pos')])\n", "print(hard_sense_fd.most_common())\n", "\n", "frequency_hard_sense_baseline = hard_sense_fd.freq('HARD1')\n", "print(frequency_hard_sense_baseline)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " In other words, the frequency baseline has an accuracy of approx. 0.793. What is the most frequent sense for 'hard.pos'? And is the frequency baseline a better model than the random model?\n", "* Now compute the accuracy of the frequency baseline for other target words; e.g. 'line.pos'. " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[('product', 2217), ('phone', 429), ('text', 404), ('division', 374), ('cord', 373), ('formation', 349)]\n", "0.534732272069\n" ] } ], "source": [ "line_sense_fd = nltk.FreqDist([i.senses[0] for i in\n", "senseval.instances('line.pos')])\n", "print(line_sense_fd.most_common())\n", "\n", "frequency_line_sense_baseline = line_sense_fd.freq('product')\n", "print(frequency_line_sense_baseline)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Rich features vs. sparse data\n", "In this part of the tutorial we are going to vary the feature sets and compare the results. As well as being able to choose between wsd_context_features vs. wsd_word_features, you can also vary the following:\n", "\n", "#### wsd_context_features\n", "\n", "You can vary the number of word-tag pairs before and after the target word that you include in the feature vector. You do this by specifying the argument distance to the function wst_classifier. For instance, the following creates a classifier that uses 2 words to the left and right of the target word: \n", " \n", " wst_classifier(NaiveBayesClassifier.train, 'hard.pos', \n", "\t\twsd_context_features, distance=2)\n", "\n", "What about distance 1?\n", "#### wsd_word_features\n", "You can vary the closed class words that are excluded from the set of most frequent words, and you can vary the size of the set of most frequent words. For instance, the following results in a model which uses the 100 most frequent words including close class words:\n", "\n", " wst_classifier(NaiveBayesClassifier.train, 'hard.pos', \n", " \t\twsd_word_features, stopwords_list=[], number=100)\n", " \n", " \n", "#### Now it's your turn:\n", "\n", "Build several WSD models for 'hard.pos', including at least the following: for the wsd_word_features version, vary number between 100, 200 and 300, and vary the stopwords_list between [] (i.e., the null list) and STOPWORDS; for the wsd_context_features version, vary the distance between 1, 2 and 3, and vary the stopwords_list between [] and STOPWORDS.\n", "\n", "* Why does setting number to less than 300 seem to improve the word model? Why does making the context window before and after the target word to a number smaller than 3 improve the model?\n", "* Why does including closed class words in word model improve overall performance? **Hint**: For each sense of hard, construct a list of all its instances in the training data using the function sense_instances (see instances1 above). Then, call, for example, extract_vocab_frequency(instances1, stopwords=[], n=100) and compare this with what you get for instances2 and instances3.\n", "\n", "\n", "**I was not able to replicate the the results that were found when this lab was created with regards to an increase in accuracy when number of words decreases. If anyone has some code replicating these results please let me know. It is possible that something may have changed since this lab was originally written. The following solutions answer the questions as stated. **\n", "\n", "Setting the number of most frequent words used in the\n", "wsd_context_feature set to less than 300 confirms the earlier\n", "suspicion from section 5, that the data is too sparse to cope with a\n", "feature set as rich as the 300 most frequent words. \n", "Similarly for the smaller context windows.\n", "\n", "Including closed class words improves performance. One can see from\n", "the distinct list of closed class words that are constructed for each\n", "sense of \"hard\" that the distributions of closed class wrt word sense\n", "are quite distinct and therefore informative. Furthermore, by\n", "including closed class words within the context window one *excludes*\n", "open class words that may be, say, 5 or 6 words away from the target\n", "word and are hence less informative clues for the target word sense.\n" ] }, { "cell_type": "code", "execution_count": 17, "metadata": { "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "word features with number: 100 and no stopwords\n", "Reading data...\n", " Senses: HARD1 HARD2 HARD3\n", "Training classifier...\n", "Testing classifier...\n", "Accuracy: 0.8281\n", "word features with number: 100 and stopwords\n", "Reading data...\n", " Senses: HARD1 HARD2 HARD3\n", "Training classifier...\n", "Testing classifier...\n", "Accuracy: 0.8293\n", "word features with number: 200 and no stopwords\n", "Reading data...\n", " Senses: HARD1 HARD2 HARD3\n", "Training classifier...\n", "Testing classifier...\n", "Accuracy: 0.8351\n", "word features with number: 200 and stopwords\n", "Reading data...\n", " Senses: HARD1 HARD2 HARD3\n", "Training classifier...\n", "Testing classifier...\n", "Accuracy: 0.8362\n", "word features with number: 300 and no stopwords\n", "Reading data...\n", " Senses: HARD1 HARD2 HARD3\n", "Training classifier...\n", "Testing classifier...\n", "Accuracy: 0.8431\n", "word features with number: 300 and stopwords\n", "Reading data...\n", " Senses: HARD1 HARD2 HARD3\n", "Training classifier...\n", "Testing classifier...\n", "Accuracy: 0.8362\n", "context features with number: 1 and no stopwords\n", "Reading data...\n", " Senses: HARD1 HARD2 HARD3\n", "Training classifier...\n", "Testing classifier...\n", "Accuracy: 0.9123\n", "context features with number: 1 and stopwords\n", "Reading data...\n", " Senses: HARD1 HARD2 HARD3\n", "Training classifier...\n", "Testing classifier...\n", "Accuracy: 0.9123\n", "context features with number: 2 and no stopwords\n", "Reading data...\n", " Senses: HARD1 HARD2 HARD3\n", "Training classifier...\n", "Testing classifier...\n", "Accuracy: 0.8812\n", "context features with number: 2 and stopwords\n", "Reading data...\n", " Senses: HARD1 HARD2 HARD3\n", "Training classifier...\n", "Testing classifier...\n", "Accuracy: 0.8812\n", "context features with number: 3 and no stopwords\n", "Reading data...\n", " Senses: HARD1 HARD2 HARD3\n", "Training classifier...\n", "Testing classifier...\n", "Accuracy: 0.8697\n", "context features with number: 3 and stopwords\n", "Reading data...\n", " Senses: HARD1 HARD2 HARD3\n", "Training classifier...\n", "Testing classifier...\n", "Accuracy: 0.8697\n" ] } ], "source": [ "for n in [100, 200, 300]:\n", " for stopwords in [[], STOPWORDS]:\n", " stop = 'stopwords' if stopwords else 'no stopwords'\n", " print('word features with number: {} and {}'.format(n, stop))\n", " wst_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_word_features, number=n, stopwords_list=stopwords) \n", "\n", "for n in [1, 2, 3]:\n", " for stopwords in [[], STOPWORDS]:\n", " stop = 'stopwords' if stopwords else 'no stopwords'\n", " print('context features with number: {} and {}'.format(n, stop))\n", " wst_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_context_features,stopwords_list=stopwords, distance=n) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* It seems slightly odd that the word features for 'hard.pos' include harder and hardest. Try using a stopwords list which adds them to STOPWORDS: is the effect what you expected? Can you explain it?\n", "\n", "The accuracy goes down. This might be expected if a particular word sense would be more likely to appear together with harder and hardest. This means that removing the two words would remove relevant information which would be replaced by some very infrequent words. " ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reading data...\n", " Senses: HARD1 HARD2 HARD3\n", "Training classifier...\n", "Testing classifier...\n", "Accuracy: 0.8362\n", "Reading data...\n", " Senses: HARD1 HARD2 HARD3\n", "Training classifier...\n", "Testing classifier...\n", "Accuracy: 0.8328\n" ] } ], "source": [ "wst_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_word_features, number=300, stopwords_list=STOPWORDS)\n", "wst_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_word_features, number=300, stopwords_list=STOPWORDS+['harder', 'hardest'])\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Error analysis\n", "The function wst_classifier allows you to explore the errors of the model it creates:\n", "\n", "#### Confusion Matrix\n", "\n", "You can output a confusion matrix as follows: " ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reading data...\n", " Senses: HARD1 HARD2 HARD3\n", "Training classifier...\n", "Testing classifier...\n", "Accuracy: 0.8812\n", " | H H H |\n", " | A A A |\n", " | R R R |\n", " | D D D |\n", " | 1 2 3 |\n", "------+-------------+\n", "HARD1 |<636> 37 19 |\n", "HARD2 | 10 <77> 8 |\n", "HARD3 | 20 9 <51>|\n", "------+-------------+\n", "(row = reference; col = test)\n", "\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wst_classifier(NaiveBayesClassifier.train, 'hard.pos',\n", " wsd_context_features, distance=2, confusion_matrix=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Note that the rows in the matrix are the gold labels, and the columns are the estimated labels. Recall that the diagonal line represents the number of items that the model gets right. \n", "#### Errors\n", "\n", "You can also output each error from the test data into a file errors.txt that gets written to the same directory from where you're running Python. For example:\n", "\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reading data...\n", " Senses: HARD1 HARD2 HARD3\n", "Training classifier...\n", "Testing classifier...\n", "Accuracy: 0.8812\n", "Writing errors to errors.txt\n", " | H H H |\n", " | A A A |\n", " | R R R |\n", " | D D D |\n", " | 1 2 3 |\n", "------+-------------+\n", "HARD1 |<636> 37 19 |\n", "HARD2 | 10 <77> 8 |\n", "HARD3 | 20 9 <51>|\n", "------+-------------+\n", "(row = reference; col = test)\n", "\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wst_classifier(NaiveBayesClassifier.train, 'hard.pos',\n", " wsd_context_features, distance=2, confusion_matrix=True, log=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Produces within the file errors.txt the following first few lines: " ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 103 errors!\r\n", "----------------------------\r\n", "\r\n", "1) example number: 19\r\n", " sentence: the san jose museum of art auxiliary 's recent debut fashion show luncheon will be a HARD act to follow .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "2) example number: 28\r\n", " sentence: once your tummy is HARD enough , they 'll stop the faucet and push on your tummy so you 'll throw up all the water . ''\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "3) example number: 58\r\n", " sentence: `` i feel that the HARD court is my best surface overall , \" courier said .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "4) example number: 72\r\n", " sentence: one admires the inventive interplay of HARD , tusky forms and vulnerable belly without being in the least moved by the torture .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "5) example number: 74\r\n", " sentence: ( hbox ) ; some matzo balls are HARD , some fall apart , but marcie 's mom , aileen gugenheim , guarantees this recipe , which has been handed down from generation to generation .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "6) example number: 77\r\n", " sentence: `` i want to work very HARD for the next four , five years and be able around 65 to say to myself , 'ok , you should slow down . '\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "7) example number: 82\r\n", " sentence: in common with many locks of early date , no HARD floor was provided between the mass concrete lock walls , although timber baulks of substantial section were set in the earth floor to act as struts between the bases of the walls .\r\n", " guess: HARD2; label: HARD3\r\n", "\r\n", "8) example number: 97\r\n", " sentence: medea , in a towering rage about the marriage of her lover , jason , to a local princess , storms and steams around the stage , making dire noises , rending her garments and unperming her hair ( the HARD way , by pulling it out ) .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "9) example number: 103\r\n", " sentence: `` the city allowed officers to be overworked so much they became HARD and cold , \" pugh said in an interview last week .\r\n", " guess: HARD3; label: HARD2\r\n", "\r\n", "10) example number: 105\r\n", " sentence: wave of consolidation ; with HARDER times ahead , industry observers expect a wave of consolidation among the 12 , 000 banks and 2 , 000 thrifts serving 260 million americans .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "11) example number: 109\r\n", " sentence: oak , one of the HARDEST trees in california , is valued for its long burn in the hearth .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "12) example number: 114\r\n", " sentence: as a measure of the extent to which the united nations would go to prevent saddam from creating HARD to-detect biological weapons , the plan sets standards for bacteriological laboratories in iraq .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "13) example number: 118\r\n", " sentence: the HARDER the choice , the more willing the league is to wade in .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "14) example number: 119\r\n", " sentence: marc girardelli , a favorite along with tomba , fell on the first run , the victim of a HARD and unusually slick course .\r\n", " guess: HARD3; label: HARD1\r\n", "\r", "\r\n", "15) example number: 128\r\n", " sentence: corrections counselor ric hyland , who did HARD time for armed robbery 25 years ago , stood shoulder to shoulder with the 31-year-old groom , whose criminal record includes 42 misdemeanor and four felony convictions for drug and alcohol - related crimes and three terms in san quentin .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "16) example number: 132\r\n", " sentence: after a HARD day 's night in a british court , apple computer inc . and apple corps decided to come together and settle a 2-year-old case pitting the computer giant against the beatles .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "17) example number: 146\r\n", " sentence: it also makes the job HARD .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "18) example number: 150\r\n", " sentence: `` we learned the HARD way that you can 't celebrate until the final whistle , \" he said .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "19) example number: 175\r\n", " sentence: that makes me the last network radio comic in america , HARD though that may be to believe . ''\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "20) example number: 190\r\n", " sentence: you 're conscious of the fact that your feet hurt , that the city pavements are HARD .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "21) example number: 191\r\n", " sentence: i think probably the first year would have been the HARDEST . ''\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "22) example number: 198\r\n", " sentence: `` it was a HARD game , \" bulldogs coach jim sweeney said .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "23) example number: 199\r\n", " sentence: american catholics learned a HARD lesson when exclusively ethnic `` national '' churches common in the east and midwest closed when parishioners moved to the suburbs .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "24) example number: 209\r\n", " sentence: so HARD that when tambrands inc . advertised in the army times , the navy times and the air force times offering to send tampax directly to women soldiers in the gulf , the company received `` a couple thousand requests , \" said bruce garren , spokesman for the lake success , n . y . , company .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "25) example number: 212\r\n", " sentence: on his own garage door he found that the opener reversed when it struck the HARD block but crushed a cardboard box -- a material that one assumes more closely represents people .\r\n", " guess: HARD2; label: HARD3\r\n", "\r\n", "26) example number: 221\r\n", " sentence: he fears that publication of his name would complicate his search , HARD enough for a man of 54 .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "27) example number: 225\r\n", " sentence: the 50 paintings and 60 works on paper in the show consist largely of geometric reductions of the human form -- all HARD lines and kaleidoscopic constructions .\r\n", " guess: HARD2; label: HARD3\r\n", "\r\n", "28) example number: 226\r\n", " sentence: times were HARD , and money and jobs were scarce .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "29) example number: 242\r\n", " sentence: `` in the HARD life of politics it is well known that no platform nor any program advanced by either major american party has any purpose beyond expressing emotion '' .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "30) example number: 249\r\n", " sentence: as the name suggests , a green plantain has a HARD bright green skin .\r\n", " guess: HARD2; label: HARD3\r\n", "\r\n", "31) example number: 261\r\n", " sentence: it should feel heavy and not too HARD ; its skin should be thick and without any discolored patches ; it should smell nice but not too strongly .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "32) example number: 271\r\n", " sentence: i thought the mezz amore -- made with almonds and bittersweet chocolate -- were too HARD and dry .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "33) example number: 281\r\n", " sentence: boys will be in dark pants , tri-cornered hats and HARD soled shoes .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "34) example number: 294\r\n", " sentence: because of the HARD choices , schneiter recommended that the council study all the alternatives in the controversial traffic plan before them .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "35) example number: 295\r\n", " sentence: but the panel could n 't find HARD evidence that prozac or other anti-depressant drugs cause people to commit suicide or violent acts .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "36) example number: 296\r\n", " sentence: generally , a novice sandblaster can remove the old paint , but not without cutting into soft grain while leaving HARD grain virtually untouched .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "37) example number: 317\r\n", " sentence: i 'm a HARD match .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "38) example number: 320\r\n", " sentence: as HARD as it is to believe , with the exception of waits ' well-named wolfie , no one leaves much of an impression .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "39) example number: 328\r\n", " sentence: it is used for making buttons and other small , HARD objects of turnery .\r\n", " guess: HARD2; label: HARD3\r\n", "\r\n", "40) example number: 336\r\n", " sentence: a this is a HARDER question , because while the oil and gasoline markets are closely linked , they do not track each other exactly .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "41) example number: 339\r\n", " sentence: dent or no , finding the illicit gardens is n 't as HARD as some people would think .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "42) example number: 344\r\n", " sentence: they are convincingly non-human , exceedingly clever and even -- the HARDEST trick -- quite funny .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "43) example number: 352\r\n", " sentence: the attorney general ordered federal prosecutors to `` target the most violent offenders in each community and put them away for HARD time in federal prisons . ''\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "44) example number: 361\r\n", " sentence: the town has a variety of rates based on flatland areas and HARD to-serve areas , but the most common service is for two cans per household , which will go from $ 10 . 30 to $ 11 . 72 per month .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "45) example number: 364\r\n", " sentence: `` david bonior has earned this position the HARD way , \" said rep . john lewis , d-ga . , `` by doing the nitty-gritty , unglamorous work each and every day . ''\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "46) example number: 386\r\n", " sentence: the standard , of course , is very different from the HARD , expensive glitter of west germany .\r\n", " guess: HARD3; label: HARD2\r\n", "\r\n", "47) example number: 397\r\n", " sentence: here are some tips to help you enjoy garlic : ; ( check ) buying -- cloves should be big , plump and HARD .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "48) example number: 404\r\n", " sentence: `` this ( elbow ) injury has been easier to take in some ways than the back problem , but HARDER in other ways , \" montana said last week in one of his final pre-surgery interviews .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "49) example number: 411\r\n", " sentence: next came the really HARD parts : serving a hot main course and a frozen dessert .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "50) example number: 414\r\n", " sentence: note : cinnamon red hots are small red candies slightly HARDER than jelly beans , available in the baking section of supermarkets .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "51) example number: 429\r\n", " sentence: without ever raring back and letting go with the HARD stuff , harris opened the stakes for a starting job with a fine performance before a crowd of 8 , 087 .\r\n", " guess: HARD3; label: HARD2\r\n", "\r\n", "52) example number: 455\r\n", " sentence: ( box ) the shell should be HARD when pinched .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "53) example number: 456\r\n", " sentence: `` it 's going to be a long , HARD winter for a lot of people , \" said mayor abbie covington .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "54) example number: 471\r\n", " sentence: the fences are a lot HARDER in oakland than there were in portland , ore .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "55) example number: 482\r\n", " sentence: there are few HARD explanations for the wide disparity between men and women .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "56) example number: 483\r\n", " sentence: `` that 's one of the HARDEST things .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "57) example number: 500\r\n", " sentence: perhaps i should n 't be too HARD on myself .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "58) example number: 507\r\n", " sentence: if loan covenants are bent or broken , you may be in for a HARD time that gets progressively worse .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "59) example number: 518\r\n", " sentence: candice also learned the HARD way that time is a crucial element in some research .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "60) example number: 541\r\n", " sentence: growing numbers of poor and non-english-speaking students just make the job HARDER , he pointed out .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "61) example number: 552\r\n", " sentence: most of us would be willing to admit that forgiveness comes HARD .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "62) example number: 575\r\n", " sentence: considered the HARDEST but fastest method ( a few days to a week ) .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "63) example number: 587\r\n", " sentence: now , lamott says , canin 's gone through a couple of HARD , tense years writing `` blue river . ''\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "64) example number: 595\r\n", " sentence: `` sometimes , a rookie has to learn the HARD way , \" padres manager greg riddoch said .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "65) example number: 597\r\n", " sentence: `` it 's been one of the HARDEST things , \" beckenhauer-heller said .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "66) example number: 610\r\n", " sentence: perhaps fmc should knock off the tank treads and put on rubber tires , replace the 5 mpg engines with more efficient ones , replace the HARD seats with soft cushions , and mark down the prices a wee bit to compete with other luxury cars .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "67) example number: 616\r\n", " sentence: the plotters of the coup were men of the HARD right , who wanted to prevent change .\r\n", " guess: HARD3; label: HARD2\r\n", "\r\n", "68) example number: 617\r\n", " sentence: `` it was autumn then and the ground was getting HARD because the nights were cold and leaves from the maples around the stadium blew across the practice fields in gusts of wind and the girls were beginning to put polo coats over their sweaters .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "69) example number: 622\r\n", " sentence: while employers are rewarding the service of persian gulf veterans , they might also remember the longer , HARDER , lonelier service of vietnam veterans .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "70) example number: 633\r\n", " sentence: even though we use cold , filtered water in our tea kettle , it has developed a HARD , cracked gray crust on the bottom during the few months we 've been here .\r\n", " guess: HARD2; label: HARD3\r\n", "\r\n", "71) example number: 634\r\n", " sentence: it was a big lesson , a HARD lesson , but we needed to learn it . ' -- anna petrovna fedorova , 68 , former communist party member in moscow ; .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "72) example number: 662\r\n", " sentence: `` he 's real caring and giving , \" said jennine zinner , a fellow counselor at the south county center , `` but also cynical , embittered by a HARD life .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "73) example number: 663\r\n", " sentence: first interstate bancorp , HARD hit by bad real estate loans , has announced that it will eliminate 3 , 500 jobs by the end of the year and reorganize its sprawling 13-state operation in an effort to cut $ 250 million a year in costs .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "74) example number: 669\r\n", " sentence: instead of suntans and souvenirs , the indians brought back pieces of HARD rocks called monterey - banded chert , which they made into stone knives and spear tips .\r\n", " guess: HARD2; label: HARD3\r\n", "\r\n", "75) example number: 679\r\n", " sentence: for merrily milbert of san jose 's rose garden area , aesthetics , sturdiness , and the excellence of its cooking caused her to lay out about $ 1 , 500 for a HARD to-find 1947 wedgwood with six burners , two ovens and two broilers .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "76) example number: 680\r\n", " sentence: he added that the injection system also needs to be redesigned , using HARDER tungsten carbide materials .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "77) example number: 681\r\n", " sentence: there was a HARD , bitter edge in his voice as he accused federal prosecutors of hounding him while ignoring wrongdoing by white officials .\r\n", " guess: HARD3; label: HARD2\r\n", "\r\n", "78) example number: 685\r\n", " sentence: `` at least i went for the HARDEST thing , \" he said .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "79) example number: 689\r\n", " sentence: lumps of a very dark and HARD ferruginous sandstone , recalling a tropical laterite , can also be found with ironstained purbeck slabs in the surrounding arable fields .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "80) example number: 694\r\n", " sentence: tom fast , of scotts valley , stepped out of the crowd into a HARD embrace .\r\n", " guess: HARD3; label: HARD2\r\n", "\r\n", "81) example number: 695\r\n", " sentence: excavation of the floor of the lock followed up the completed HARD core drain in 15 ft sections , the trench sheeting being removed and a 2 ft thick mass concrete slab of 7 . 4 : 1 total aggregate cement ratio being emplaced to within 3 ft of each wall .\r\n", " guess: HARD2; label: HARD3\r\n", "\r\n", "82) example number: 698\r\n", " sentence: we 're finding out the HARD way that we can 't . ''\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "83) example number: 718\r\n", " sentence: the sky is a HARD enamel blue despite the autumn odds , and robin williams engages in some dry humor .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "84) example number: 719\r\n", " sentence: suspected porn ; san francisco art photographer jock sturges learned this lesson the HARD way in april 1990 after the manager of a photo-processing lab reported to the fbi that he had received negatives of nude young girls .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "85) example number: 724\r\n", " sentence: he remembered his parents talking of maine , where they came from , a vague and distant place girded with rocks and bound by HARD winters .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "86) example number: 732\r\n", " sentence: a well , a budget agreement , people know instinctively if not by HARD lessons learned , represents a great deal of sound and fury and little else .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "87) example number: 736\r\n", " sentence: phat nguyen , a vietnamese field representative for heritage cablevision , learned this the HARD way when he began tracking down local relatives of the refugees tran had taped in the philippines .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "88) example number: 740\r\n", " sentence: it was n 't so HARD before .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "89) example number: 741\r\n", " sentence: john devitt says he learned the HARD way that some large corporations will do anything for a buck .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "90) example number: 745\r\n", " sentence: i 've heard that many women spritz and spritz until even the split ends are HARD enough to cut glass .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "91) example number: 766\r\n", " sentence: but we no longer face such a HARD choice .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "92) example number: 769\r\n", " sentence: `` the one thing there was universal support for was the fact that he had been through a HARD time . ''\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "93) example number: 771\r\n", " sentence: `` doing the play before a sea of very HARD men , i felt this eerie kind of power .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "94) example number: 785\r\n", " sentence: they said the pasta was too HARD .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "95) example number: 786\r\n", " sentence: i 'm so HARD on myself that even though i 'd get good responses , though i 've been very lucky , i 've always wondered .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "96) example number: 791\r\n", " sentence: last week 's rail journey of a retired southern pacific steam locomotive from south san francisco down the peninsula on its way to sacramento , reminded me , fondly i want you to know , of trips to the city on ugly , soot-stained cars with HARD seats and drinking `` fountains '' filled with warm water .\r\n", " guess: HARD2; label: HARD3\r\n", "\r\n", "97) example number: 800\r\n", " sentence: i would go into the reading room , where solid silence was packed HARD and green up as far as the bowl of the dome , and walk over , always , to desk d-4 .\r\n", " guess: HARD3; label: HARD2\r\n", "\r\n", "98) example number: 820\r\n", " sentence: fleer : the players are supposed to `` leap right off '' the cards because of pastel backgrounds that appear to recede inside a HARD green border .\r\n", " guess: HARD3; label: HARD2\r\n", "\r\n", "99) example number: 824\r\n", " sentence: `` but sometimes it is HARD work .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "100) example number: 827\r\n", " sentence: most are easy to grow , but the task of transplanting the desired native tree from the woods to your lawn can be a HARD one .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "101) example number: 830\r\n", " sentence: usa 's `` lightning field '' is a HARDER road , particularly for viewers who 've seen all the ancient-curse movies they need for a lifetime .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "102) example number: 844\r\n", " sentence: water becomes stiff and HARD as clear stone .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "103) example number: 856\r\n", " sentence: tip : money workshop ; c onsumer credit counseling service of santa clara valley will present a free workshop on managing money during HARD economic times on saturday , april 13 , from 9 a . m . to noon at the sunnyvale senior citizen center , 820 mckinley st . , room 201 , sunnyvale .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n" ] } ], "source": [ "cat errors.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The example number is the (list) index of the error in the test_data. \n", "\n", "#### Now it's your turn:\n", "\n", "1. Choose your best performing model from Section 7, and train the model again, but add the arguments confusion_matrix=True and log=True.\n", "2. Using the confusion matrix, identify which sense is the hardest one for the model to estimate.\n", "2. Look in errors.txt for examples where that hardest word sense is the correct label. Do you see any patterns or systematic errors? If so, can you think of a way to adapt feature vector so as to improve the model? " ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Reading data...\n", " Senses: HARD1 HARD2 HARD3\n", "Training classifier...\n", "Testing classifier...\n", "Accuracy: 0.9123\n", "Writing errors to errors.txt\n", " | H H H |\n", " | A A A |\n", " | R R R |\n", " | D D D |\n", " | 1 2 3 |\n", "------+-------------+\n", "HARD1 |<675> 8 9 |\n", "HARD2 | 24 <69> 2 |\n", "HARD3 | 29 4 <47>|\n", "------+-------------+\n", "(row = reference; col = test)\n", "\n" ] }, { "data": { "text/plain": [ "" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "wst_classifier(NaiveBayesClassifier.train, 'hard.pos', wsd_context_features, distance=1, confusion_matrix=True, log=True) " ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.0245664739884\n", "0.273684210526\n", "0.4125\n" ] } ], "source": [ "print((8+9)/(675+8+9))\n", "print((24+2)/(69+2+24))\n", "print((29+4)/(29+4+47))" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "There are 76 errors!\r\n", "----------------------------\r\n", "\r\n", "1) example number: 19\r\n", " sentence: the san jose museum of art auxiliary 's recent debut fashion show luncheon will be a HARD act to follow .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "2) example number: 28\r\n", " sentence: once your tummy is HARD enough , they 'll stop the faucet and push on your tummy so you 'll throw up all the water . ''\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "3) example number: 30\r\n", " sentence: its different flavor results from its firing in a HARD seasoned-wood , brick oven .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "4) example number: 32\r\n", " sentence: she would not lie relaxed and peaceful , as though she were resting , but iron HARD , as though she were still fighting .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "5) example number: 66\r\n", " sentence: the boy tried to make the age-changing voice sound HARD , and it might have sounded ludicrous had it not been for the reckless chill shimmering in cat-yellow eyes .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "6) example number: 72\r\n", " sentence: one admires the inventive interplay of HARD , tusky forms and vulnerable belly without being in the least moved by the torture .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "7) example number: 74\r\n", " sentence: ( hbox ) ; some matzo balls are HARD , some fall apart , but marcie 's mom , aileen gugenheim , guarantees this recipe , which has been handed down from generation to generation .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "8) example number: 77\r\n", " sentence: `` i want to work very HARD for the next four , five years and be able around 65 to say to myself , 'ok , you should slow down . '\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "9) example number: 82\r\n", " sentence: in common with many locks of early date , no HARD floor was provided between the mass concrete lock walls , although timber baulks of substantial section were set in the earth floor to act as struts between the bases of the walls .\r\n", " guess: HARD2; label: HARD3\r\n", "\r\n", "10) example number: 105\r\n", " sentence: wave of consolidation ; with HARDER times ahead , industry observers expect a wave of consolidation among the 12 , 000 banks and 2 , 000 thrifts serving 260 million americans .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "11) example number: 107\r\n", " sentence: arlene had a HARD voice , too , this time .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "12) example number: 109\r\n", " sentence: oak , one of the HARDEST trees in california , is valued for its long burn in the hearth .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "13) example number: 118\r\n", " sentence: the HARDER the choice , the more willing the league is to wade in .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "14) example number: 119\r\n", " sentence: marc girardelli , a favorite along with tomba , fell on the first run , the victim of a HARD and unusually slick course .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "15) example number: 123\r\n", " sentence: estonians see no hope but foreign pressure for changing gorbachev 's HARD line against baltic independence .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "16) example number: 128\r\n", " sentence: corrections counselor ric hyland , who did HARD time for armed robbery 25 years ago , stood shoulder to shoulder with the 31-year-old groom , whose criminal record includes 42 misdemeanor and four felony convictions for drug and alcohol - related crimes and three terms in san quentin .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "17) example number: 132\r\n", " sentence: after a HARD day 's night in a british court , apple computer inc . and apple corps decided to come together and settle a 2-year-old case pitting the computer giant against the beatles .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "18) example number: 190\r\n", " sentence: you 're conscious of the fact that your feet hurt , that the city pavements are HARD .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "19) example number: 209\r\n", " sentence: so HARD that when tambrands inc . advertised in the army times , the navy times and the air force times offering to send tampax directly to women soldiers in the gulf , the company received `` a couple thousand requests , \" said bruce garren , spokesman for the lake success , n . y . , company .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "20) example number: 212\r\n", " sentence: on his own garage door he found that the opener reversed when it struck the HARD block but crushed a cardboard box -- a material that one assumes more closely represents people .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "21) example number: 221\r\n", " sentence: he fears that publication of his name would complicate his search , HARD enough for a man of 54 .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "22) example number: 223\r\n", " sentence: in a heated house , however , more water is advisable , as if the plants are kept too dry , they tend to become so HARD that the stems are slow to `` break '' in spring .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "23) example number: 232\r\n", " sentence: it was only when additional HARD intelligence data appeared two years later that the pentagon and much of the rest of the state department were informed of the suspicious algerian reactor .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "24) example number: 249\r\n", " sentence: as the name suggests , a green plantain has a HARD bright green skin .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "25) example number: 255\r\n", " sentence: remember , this is a curvy road with few HARD shoulders .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "26) example number: 261\r\n", " sentence: it should feel heavy and not too HARD ; its skin should be thick and without any discolored patches ; it should smell nice but not too strongly .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "27) example number: 281\r\n", " sentence: boys will be in dark pants , tri-cornered hats and HARD soled shoes .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "28) example number: 295\r\n", " sentence: but the panel could n 't find HARD evidence that prozac or other anti-depressant drugs cause people to commit suicide or violent acts .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "29) example number: 296\r\n", " sentence: generally , a novice sandblaster can remove the old paint , but not without cutting into soft grain while leaving HARD grain virtually untouched .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "30) example number: 300\r\n", " sentence: almost everything was rendered in a HARD , fast , ham-fisted manner that was about as subtle as a kick in the gut .\r\n", " guess: HARD3; label: HARD2\r\n", "\r\n", "31) example number: 320\r\n", " sentence: as HARD as it is to believe , with the exception of waits ' well-named wolfie , no one leaves much of an impression .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "32) example number: 328\r\n", " sentence: it is used for making buttons and other small , HARD objects of turnery .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "33) example number: 339\r\n", " sentence: dent or no , finding the illicit gardens is n 't as HARD as some people would think .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "34) example number: 386\r\n", " sentence: the standard , of course , is very different from the HARD , expensive glitter of west germany .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "35) example number: 397\r\n", " sentence: here are some tips to help you enjoy garlic : ; ( check ) buying -- cloves should be big , plump and HARD .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "36) example number: 414\r\n", " sentence: note : cinnamon red hots are small red candies slightly HARDER than jelly beans , available in the baking section of supermarkets .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "37) example number: 429\r\n", " sentence: without ever raring back and letting go with the HARD stuff , harris opened the stakes for a starting job with a fine performance before a crowd of 8 , 087 .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "38) example number: 449\r\n", " sentence: jio said , looking at the men dressed in orange vests and white HARD hats , `` it kind of makes you sad when you see that thing . . . but , progress .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "39) example number: 455\r\n", " sentence: ( box ) the shell should be HARD when pinched .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "40) example number: 463\r\n", " sentence: fixing them is HARD work , but is a possible project for the ambitious remodeler .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "41) example number: 471\r\n", " sentence: the fences are a lot HARDER in oakland than there were in portland , ore .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "42) example number: 482\r\n", " sentence: there are few HARD explanations for the wide disparity between men and women .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "43) example number: 497\r\n", " sentence: most of the time the change is incremental , in the evocative words of max weber , `` a strong and slow boring of HARD boards . ''\r\n", " guess: HARD2; label: HARD3\r\n", "\r\n", "44) example number: 500\r\n", " sentence: perhaps i should n 't be too HARD on myself .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "45) example number: 521\r\n", " sentence: `` but the other brushes , with their HARD bristles and tiny heads , are only good for cleaning the small , white part of your teeth . ''\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "46) example number: 552\r\n", " sentence: most of us would be willing to admit that forgiveness comes HARD .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "47) example number: 587\r\n", " sentence: now , lamott says , canin 's gone through a couple of HARD , tense years writing `` blue river . ''\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "48) example number: 610\r\n", " sentence: perhaps fmc should knock off the tank treads and put on rubber tires , replace the 5 mpg engines with more efficient ones , replace the HARD seats with soft cushions , and mark down the prices a wee bit to compete with other luxury cars .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "49) example number: 616\r\n", " sentence: the plotters of the coup were men of the HARD right , who wanted to prevent change .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "50) example number: 617\r\n", " sentence: `` it was autumn then and the ground was getting HARD because the nights were cold and leaves from the maples around the stadium blew across the practice fields in gusts of wind and the girls were beginning to put polo coats over their sweaters .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "51) example number: 622\r\n", " sentence: while employers are rewarding the service of persian gulf veterans , they might also remember the longer , HARDER , lonelier service of vietnam veterans .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "52) example number: 633\r\n", " sentence: even though we use cold , filtered water in our tea kettle , it has developed a HARD , cracked gray crust on the bottom during the few months we 've been here .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "53) example number: 663\r\n", " sentence: first interstate bancorp , HARD hit by bad real estate loans , has announced that it will eliminate 3 , 500 jobs by the end of the year and reorganize its sprawling 13-state operation in an effort to cut $ 250 million a year in costs .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "54) example number: 669\r\n", " sentence: instead of suntans and souvenirs , the indians brought back pieces of HARD rocks called monterey - banded chert , which they made into stone knives and spear tips .\r\n", " guess: HARD2; label: HARD3\r\n", "\r\n", "55) example number: 674\r\n", " sentence: the new honey-haired somers is just as sexy , but she 's now thoughtful , articulate and self-assured without the HARD edges of the past .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "56) example number: 680\r\n", " sentence: he added that the injection system also needs to be redesigned , using HARDER tungsten carbide materials .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "57) example number: 681\r\n", " sentence: there was a HARD , bitter edge in his voice as he accused federal prosecutors of hounding him while ignoring wrongdoing by white officials .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "58) example number: 689\r\n", " sentence: lumps of a very dark and HARD ferruginous sandstone , recalling a tropical laterite , can also be found with ironstained purbeck slabs in the surrounding arable fields .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "59) example number: 694\r\n", " sentence: tom fast , of scotts valley , stepped out of the crowd into a HARD embrace .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "60) example number: 695\r\n", " sentence: excavation of the floor of the lock followed up the completed HARD core drain in 15 ft sections , the trench sheeting being removed and a 2 ft thick mass concrete slab of 7 . 4 : 1 total aggregate cement ratio being emplaced to within 3 ft of each wall .\r\n", " guess: HARD2; label: HARD3\r\n", "\r\n", "61) example number: 718\r\n", " sentence: the sky is a HARD enamel blue despite the autumn odds , and robin williams engages in some dry humor .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "62) example number: 724\r\n", " sentence: he remembered his parents talking of maine , where they came from , a vague and distant place girded with rocks and bound by HARD winters .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "63) example number: 732\r\n", " sentence: a well , a budget agreement , people know instinctively if not by HARD lessons learned , represents a great deal of sound and fury and little else .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n", "64) example number: 740\r\n", " sentence: it was n 't so HARD before .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "65) example number: 745\r\n", " sentence: i 've heard that many women spritz and spritz until even the split ends are HARD enough to cut glass .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "66) example number: 763\r\n", " sentence: the one leadership quality palumbis lacked , he acquired through a HARD lesson during his first three years at stanford .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "67) example number: 764\r\n", " sentence: we learned the HARD way , by trial and error , but we learned .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "68) example number: 771\r\n", " sentence: `` doing the play before a sea of very HARD men , i felt this eerie kind of power .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "69) example number: 785\r\n", " sentence: they said the pasta was too HARD .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "70) example number: 786\r\n", " sentence: i 'm so HARD on myself that even though i 'd get good responses , though i 've been very lucky , i 've always wondered .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "71) example number: 800\r\n", " sentence: i would go into the reading room , where solid silence was packed HARD and green up as far as the bowl of the dome , and walk over , always , to desk d-4 .\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "72) example number: 813\r\n", " sentence: the cutbacks , especially those involving removal of management layers , indicate that most young managers will have a HARDER and longer time getting not only salary increases but actual promotions .\r\n", " guess: HARD3; label: HARD1\r\n", "\r\n", "73) example number: 820\r\n", " sentence: fleer : the players are supposed to `` leap right off '' the cards because of pastel backgrounds that appear to recede inside a HARD green border .\r\n", " guess: HARD3; label: HARD2\r\n", "\r\n", "74) example number: 833\r\n", " sentence: if i get a little HARD on myself because i 'm not doing something right , i remind myself that i 'm there to have fun . ''\r\n", " guess: HARD1; label: HARD2\r\n", "\r\n", "75) example number: 844\r\n", " sentence: water becomes stiff and HARD as clear stone .\r\n", " guess: HARD1; label: HARD3\r\n", "\r\n", "76) example number: 856\r\n", " sentence: tip : money workshop ; c onsumer credit counseling service of santa clara valley will present a free workshop on managing money during HARD economic times on saturday , april 13 , from 9 a . m . to noon at the sunnyvale senior citizen center , 820 mckinley st . , room 201 , sunnyvale .\r\n", " guess: HARD2; label: HARD1\r\n", "\r\n" ] } ], "source": [ "cat errors.txt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "HARD3 is the most difficult sense for the classifier. There isn't one right answer for this question. It is more of a question to invite speculation and let you think about your classifier. The most obvious pattern is that HARD1 is extremely dominant in the number of examples. This can be seen in the classification results as the majority of error comes from HARD2 and HARD3 being missclassified as HARD1. It should be noted that the classifier used only look at words at a distance of 1. This really isn't very much context. Due to data sparsity I imagine that a lot of error simply comes from the fact that a particular context may not have been seen before. For example, HARD shoulders and HARD soled shoes seem like obvious examples of HARD3 but they have been classed as HARD1. Most likely these things were simply not found in the dataset. One thing that seems to happen quite a few times is that the HARD ends up next to an adverb or a different adjective, such as slightly HARDER which could definitely appear next to any sense of the word. What might be useful is to always include information about the POS context in which the word appears or parsing information. HARD3 will generally attach to nounse, like hard seats or hard hats. While HARD1 has more of a spread. \n", "\n", "Again, there isn't one correct answer, see what you can spot and try and come up with some reasonable suggestions." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 2 }