{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# FNLP: Lab Session 6\n", "### Pointwise Mutual Information - Finding Collocations\n", "\n", "# Aim\n", "\n", "The aims of this lab session are to 1) familiarize the students with pointwise mutual information (PMI) 2) show\n", "how to apply PMI for the task of finding word collocations 3) identify shortcomings of this approach . By the\n", "end of this lab session, you should be able to:\n", "* Compute the PMI.\n", "* Apply PMI to find word collocations in a corpus." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction\n", "\n", "Before continuing with this lab sheet, please download a copy of the lab template (lab5.py) for this lab from\n", "the FNLP course website. This template contains code that you should use as a starting point when attempting\n", "the exercises for this lab.\n", "In this lab we consider the task of identifying word collocations in a corpus to demonstrate the use of Pointwise\n", "Mutual Information (PMI).\n", "\n", "$PMI(x_i, y_j) = log\\frac{P(X=x_i,Y=y_j)}{P(X=x i)P(Y =y_j)}$\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import operator\n", "\n", "import nltk\n", "\n", "from nltk.probability import FreqDist,MLEProbDist\n", "\n", "from nltk.corpus import gutenberg\n", "from nltk import bigrams\n", "from nltk.corpus import stopwords\n", "from math import log\n", "from pprint import pprint\n", "\n", "\n", "class FasterMLEProbDist(MLEProbDist):\n", " '''Speed up prob lookup for large sample sizes'''\n", " def __init__(self,freqdist):\n", " self._N=freqdist.N()\n", " if self._N == 0:\n", " self._empty = True\n", " else:\n", " self._empty = False\n", " self._pq=float(self._N)\n", " MLEProbDist.__init__(self,freqdist)\n", "\n", " def prob(self, sample):\n", " '''use cached quotient for division'''\n", " if self._empty:\n", " return 0\n", " else:\n", " return float(self._freqdist[sample]) / self._pq\n", "\n", "\n", "stopwords = stopwords.words('english')\n", "\n", "def Filter1(word):\n", " return word.isalpha()\n", "\n", "def Filter2(word):\n", " return (word.isalpha() and not(word.lower() in stopwords))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data set consists of bigrams extracted from the Moby Dick novel." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "sentences=gutenberg.sents('melville-moby_dick.txt')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Estimating the Joint and Marginal Probability Distributions\n", "\n", "In order to compute the PMI we need the joint probability $P(X = x, Y = y)$ and the marginal probabilities $P (X = x)$ and $P (Y = y)$. In our case $P (X = x)$ and $P (Y = y)$ will be the unigram probabilities of the two\n", "words that are considered to form a collocation, and $P (X = x, Y = y)$ will be the bigram probability.\n", "\n", "### Exercise 1:\n", "\n", "In this exercise we will compute the joint and marginal probability distributions for the word bigrams. You will\n", "have to fill in two functions to achieve this:\n", "* The function BuildData receives as parameters a list of sentences and the name of a Filter function. Two Filter functions are already defined, one eliminates just non-alphanumeric tokens, the other eliminates stop-words as well.\n", "* The helper function ex1 should return a list of bigrams and a list of unigrams extracted from the sentences.\n", "\n", "Specifically:\n", "\n", "1. Build the two data structures in the BuildData function. Lowercase the words and eliminate unigrams and bigrams that do not pass the filter.\n", " Remember, help is your friend:\n", " `help(nltk.bigrams)`\n", "2. The function ex1 receives as parameters a list of bigrams and a list of unigrams and returns the corresponding probability distributions. Construct a FreqDist for each of the two lists. Transform each FreqDist into a probability distribution using the FasterMLEProbDist estimator.\n", " Again, help is your friend:\n", " `help(nltk.probability.FasterMLEProbDist)`\n", "\n", "Now uncomment the test code. Compare the top 30 most frequent bigrams arising from the two different filters.\n", "\n", "Most of the former are made up of closed-class words. If we eliminate stopwords some interesting bigramsover content words show up." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def BuildData(sentences,Filter):\n", " unigrams_list = [w.lower() for s in sentences for w in s if Filter(w)]\n", " bigrams_list = [bigram for bigram in bigrams(unigrams_list)]\n", " return bigrams_list, unigrams_list" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#TODO: using the data build the probability distribution over bigrams and unigrams using FasterMLEProbDist\n", "def ex1(bigrams, unigrams):\n", " #TODO build the frequency distribution over bigrams and unigrams\n", " bigramFreqDist = FreqDist(bigrams)\n", " unigramFreqDist = FreqDist(unigrams)\n", "\n", " #TODO build the probability distribuition from the above frequency distributions using the FasterMLEProbDist estimator\n", " bigramProbDist = FasterMLEProbDist(bigramFreqDist)\n", " unigramProbist = FasterMLEProbDist(unigramFreqDist)\n", "\n", " return bigramProbDist, unigramProbist\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def test1():\n", " bigrams, unigrams = BuildData(sentences,Filter1)\n", " \n", " bigramProbDist1, unigramProbist1 = ex1(bigrams, unigrams)\n", " print \"type: \",type(bigramProbDist1) # \n", " print \"type: \",type(unigramProbist1) # \n", " \n", " MLESorted = bigramProbDist1.freqdist().most_common(30)\n", " print \"Using filter 1:\",pprint(MLESorted)\n", " print \"type: \\n\",type(MLESorted) # \n", "\n", " bigrams, unigrams = BuildData(sentences,Filter2)\n", " bigramProbDist, unigramProbist = ex1(bigrams, unigrams)\n", " MLESorted = bigramProbDist.freqdist().most_common()[:30]\n", " print \"Using filter 2:\",pprint(MLESorted)\n", " print \"\\n\"\n", "\n", " return bigramProbDist1, unigramProbist1\n", " \n", "\n", "# TEST EXERCISE 1 - return values will be used for exercise 2\n", "bigramProbDist, unigramProbDist = test1()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Computing the Pointwise Mutual Information\n", "\n", "In the previous section we estimated the joint and marginal probability distributions from the data. In this\n", "section we use these distributions to compute the PMI for a given sample. In order to avoid multiplication of\n", "small floating point numbers (probabilities), we can rewrite the formula for PMI as:\n", "$P M I(x i , y j ) = log P (x i , y j ) − (log P (x i ) + log P (y j ))$\n", "\n", "### Exercise 2:\n", "\n", "The template of the function that you have to implement takes two parameters: the bigram and unigram\n", "probability distributions.\n", "1. Create a list of pairs of samples in the distribution and its PMI, using the logprob() function of the two FasterMLEProbDist instances.\n", "2. Make a dictionary from that list\n", "3. Sort the list of pairs.\n", "\n", "Uncomment the test code. Look at the PMI of some bigrams. We can see that the PMI for ”sperm whale” is 5 binary orders of magnitude greater than the PMI of ”of the”. Can you think of some reasons why the PMI for ”old man” is not as low as we would expect?\n", "\n", "What can you observe by looking at the top 10 bigrams according to the PMI? How do low counts affect the PMI?\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#TODO: for each sample in the bigramProbDist compute the PMI and add {sample,PMI} pair to the PMI dict\n", "#input: bigram and unigram distribution of type nltk.probability.FasterMLEProbDist\n", "#output: PMI dict, PMIsorted list\n", "def ComputePMI(bpd, upd):\n", " # data structure to hold the PMI score for every bigram\n", " #TODO: compute PMI for each sample in bpd and add the (sample,PMI) pair to the dict\n", " PMIs=[(sample,\n", " bpd.logprob(sample) - (upd.logprob(sample[0]) +\n", " upd.logprob(sample[1])))\n", " for sample in bpd.samples()]\n", "\n", " #list of (bigrams,PMI) sorted according to the PMI score\n", " PMIsorted = sorted(PMIs,key=operator.itemgetter(1), reverse=True)\n", " return dict(PMIs), PMIsorted\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def test2(bpd,upd):\n", " \n", " print \"type: \",type(bpd) # \n", " print \"type: \",type(upd) # \n", " \n", " PMIs, PMIsorted = ComputePMI(bpd, upd)\n", " print \"type: \", type(PMIs) # \n", " print \"type: \", type(PMIsorted) # \n", " \n", "\n", " print \"sperm whale %0.2f\" % PMIs[(\"sperm\",\"whale\")]\n", " print \"of the %0.2f\" % PMIs[(\"of\",\"the\")]\n", " print \"old man %0.2f\" % PMIs[(\"old\",\"man\")] #comment why it's not as expected close to 0 -> because not enough data\n", " print \"one side %0.2f\" % PMIs[(\"one\",\"side\")]\n", " print \"\\n\"\n", " bcount=bpd.freqdist()\n", " print \"Top 10 by PMI\"\n", " print \"%s\\t%s\\t%s\"%('PMI','n','pair')\n", " for pair in PMIsorted[:10]:\n", " print \"%0.2f\\t%d\\t%s\" % (pair[1], bcount[pair[0]], pair[0])\n", "\n", " return PMIsorted\n", "\n", "# TEST EXERCISE 2 - return values will be used for exercise 3\n", "PMIsorted = test2(bigramProbDist, unigramProbDist)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 3.\n", "\n", "In the previous exercise we found that the PMI is very sensitive to data sparsity. Bigrams composed of low\n", "frequency words are ranked higher than bigrams with high frequency words according to PMI. One way to fix\n", "this issue is by putting a threshold for the frequency of words.\n", "Edit the ex3 function to do this:\n", "1. Filter the full list of bigrams and their corresponding PMI to include only bigrams with frequency greater than 30.\n", "\n", "What does a negative score say about a bigram?\n", "\n", "**Optionally** you can eliminate stop-words from the corpus by applying the second Filter function, then recom-\n", "pute the PMI and investigate the top and last bigrams." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def ex3(PMIsorted,bpd):\n", " #TODO Build a list of bigrams and their corresponding PMI for bigrams composed of words with frequency grater than 30\n", " #PMIs should look like [(bigram, PMI[bigram])] where bigram is a key in the bigramProbDist\n", " bcount = bpd.freqdist()\n", " return [pair for pair in PMIsorted if bcount[pair[0]]>30]" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def test3(PMIsorted, bpd):\n", "\n", " high_freq = ex3(PMIsorted, bpd)\n", "\n", " print \"\\nTop 20 by PMI where pair count>30\"\n", " print \"%s\\t%s\\t%s\"%('PMI','n','pair')\n", " bcount = bpd.freqdist()\n", " for pair in high_freq[:20]:\n", " print \"%0.2f\\t%d\\t%s\" % (pair[1], bcount[pair[0]], pair[0])\n", "\n", " print \"\\nBottom 20 by PMI where pair count>30\"\n", " for pair in high_freq[-20:]:\n", " print \"%s\\t%0.2f\\t%d\" % (pair[0], pair[1], bpd.freqdist()[pair[0]])\n", " \n", "# TEST EXERCISE 3\n", "test3(PMIsorted,bigramProbDist)\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.13" } }, "nbformat": 4, "nbformat_minor": 2 }