{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# FNLP: Lab Session 2\n", "\n", "### Smoothing and Authorship Identification\n" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Import libraries used for this lab\n", "\n", "import nltk\n", "import sys\n", "\n", "# Import the gutenberg corpus\n", "from nltk.corpus import gutenberg\n", "\n", "# Import NLTK's NgramModel module (for building language models)\n", "# It has been removed from standard NLTK, so we access it in a special package installation\n", "sys.path.extend(['/group/ltg/projects/fnlp', '/group/ltg/projects/fnlp/packages_2.6'])\n", "from nltkx import NgramModel\n", "\n", "# Import probability distributions\n", "from nltk.probability import LaplaceProbDist\n", "from nltk.probability import LidstoneProbDist\n", "from nltk.probability import SimpleGoodTuringProbDist\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aim\n", "\n", "The aims of this lab session are to explore 1) the Laplace, Lidstone and Good-\n", "Turing smoothing methods for language models and 2) the use of language\n", "models in authorship identification. Successful completion of this lab will help\n", "you solidify your understanding of smoothing (important not just for LMs but\n", "all over NLP), perplexity (important also for assignment 1), and one type of\n", "text classification (authorship identification). By the end of this lab session,\n", "you should be able to:\n", "\n", " * Compute smoothed bigram probabilities by hand for simple smoothing methods.\n", " * Train an nltk language model with smoothing for unseen n-grams\n", " * Make use of language models to identify the author of a text" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "In some of the exercises, you’ll use NLTK’s `NgramModel` to train language models. \n", "\n", "The initialisation method for NgramModel is:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "def __init__(self, n, train, pad_left=False, pad_right=False,\n", " estimator=None, *estimator_args, **estimator_kwargs):\n", " \"\"\"\n", " Creates an ngram language model to capture patterns in n consecutive\n", " words of training text. An estimator smooths the probabilities derived\n", " from the text and may allow generation of ngrams not seen during\n", " training.\n", "\n", " :param n: the order of the language model (ngram size)\n", " :type n: C{int}\n", " :param train: the training text\n", " :type train: C{iterable} of C{string} or C{iterable} of C{iterable} of C{string} \n", " :param estimator: a function for generating a probability distribution---defaults to MLEProbDist\n", " :type estimator: a function that takes a C{ConditionalFreqDist} and\n", " returns a C{ConditionalProbDist}\n", " :param pad_left: whether to pad the left of each sentence with an (n-1)-gram of \n", " :type pad_left: bool\n", " :param pad_right: whether to pad the right of each sentence with \n", " :type pad_right: bool\n", " :param estimator_args: Extra arguments for estimator.\n", " These arguments are usually used to specify extra\n", " properties for the probability distributions of individual\n", " conditions, such as the number of bins they contain.\n", " Note: For backward-compatibility, if no arguments are specified, the\n", " number of bins in the underlying ConditionalFreqDist are passed to\n", " the estimator as an argument.\n", " :type estimator_args: (any)\n", " :param estimator_kwargs: Extra keyword arguments for the estimator\n", " :type estimator_kwargs: (any)\n", " \"\"\"\n", "\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Smoothing\n", "\n", "In the final exercise of Lab 1, you were asked to calculate the probability of\n", "a word given its context, using an bigram language model with no smoothing.\n", "For the first two word-context pairs, these bigrams had been seen in the data\n", "used to train the language model. For the third word-context pair, the bigram\n", "had not been seen in the training data, which led to an estimated probability\n", "of 0.0.\n", "\n", "Zero probabilities for unseen n-grams cause problems. Suppose for example you\n", "take a bigram language model and use it to score an automatically generated\n", "sentence of 10 tokens (say the output of a machine translation system). If one\n", "of the bigrams in that sentence is unseen, the probability of the sentence will\n", "be zero.\n", "\n", "Smoothing is a method of assigning probabilities to unseen n-grams. As language models are typically trained using large amounts of data, any n-gram not\n", "seen in the training data is probably unlikely to be seen in other (test) data. A\n", "good smoothing method is therefore one that assigns a fairly small probability\n", "to unseen n-grams.\n", "\n", "We’ll implement two different smoothing methods: Laplace (add-one) and Lidstone (add-alpha), and we will also consider the effects of backoff, which is\n", "implemented in NLTK’s NgramModel.\n", "\n", "(NLTK also includes implementations of Laplace and Lidstone smoothing in its\n", "probability module; if you wish to look at the implementations they are here:\n", "http://www.nltk.org/api/nltk.html#module-nltk.probability\n", "However the point of this lab is to implement these simple methods yourself to\n", "make sure you understand them.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Maximum-Likelihood estimation\n", "\n", "Before implementing any smoothing, you should make sure you understand how\n", "to implement maximum likelihood estimation. In last week’s lab, we used NLTK\n", "to do this for us by training a bigram language model with an MLE estimator.\n", "We could then use the language model to find the MLE probability of any word\n", "given its context. Here, you’ll do the same thing but without using NLTK,\n", "just to make sure you understand how. We will also compare the smoothed\n", "probabilities you compute later to these MLE probabilities.\n", "\n", "#### Exercise 0\n", "Code has been provided that extracts all the words from Jane Austen’s “Sense\n", "and Sensibility”, and then computes a list of bigram tuples by pairing up each\n", "word in the corpus with the following word. Using these unigrams and bigrams,\n", "fill in the remaining code to compute the MLE probability of a word given a\n", "single word of context. Then uncomment the test code to compute the proba-\n", "bilities:\n", " \n", "1. $ P_{M LE}(“end”|“the”) $\n", "2. $ P_{M LE}(“the”|“end”) $\n", "\n", "Make sure your answers match the MLE probability estimates from Exercise 5\n", "of Lab 1, where we used NLTK to compute these estimates.\n", "\n" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MLE:\n", "Probability of 'end' given 'the': 0.00584652862363\n", "Probability of 'the' given 'end': 0.0\n" ] } ], "source": [ "#################### EXERCISE 0 ####################\n", "\n", "# Solution for exercise 0\n", "# Input: word (string), context (string)\n", "# Output: p (float)\n", "# Compute the unsmoothed (MLE) probability for word given the single word context\n", "def ex0(word,context):\n", " p = 0.0\n", "\n", " austen_words = [w.lower() for w in gutenberg.words('austen-sense.txt')]\n", " austen_bigrams = zip(austen_words[:-1], austen_words[1:]) # list of bigrams as tuples\n", " # (above doesn't include begin/end of corpus: but basically this is fine)\n", "\n", " # Compute probability of word given context\n", " p = float(austen_bigrams.count((context, word))) / (austen_words.count(context))\n", "\n", " # Return probability\n", " return p\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### Test your code with these examples\n", "print \"MLE:\"\n", "result0a = ex0('end','the')\n", "print \"Probability of \\'end\\' given \\'the\\': \" + str(result0a)\n", "result0b = ex0('the','end')\n", "print \"Probability of \\'the\\' given \\'end\\': \" + str(result0b)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Laplace (add-1)\n", "\n", "Laplace smoothing adds a value of 1 to the sample count for each “bin” (possible\n", "observation, in this case each possible bigram), and then takes the maximum\n", "likelihood estimate of the resulting frequency distribution.\n", "\n", "Exercise 1\n", "Assume that the size of the vocabulary is just the number of different words\n", "observed in the training data (that is, we will not deal with unseen words).\n", "Add code to the template to compute Laplace smoothed probabilities, again\n", "without using NLTK.\n", "Hint: if you have trouble, study the equations and example in Lecture 4\n", "Now uncomment the test code and look at the estimates for:\n", "\n", "1. $P_{+1} (“end”|“the”)$\n", "2. $P_{+1} (“the”|“end”)$\n", "\n", "How do these probabilities differ from the MLE estimates?" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LAPLACE:\n", "Probability of 'end' given 'the': 0.00237913970308\n", "Probability of 'the' given 'end': 0.000154846701765\n" ] } ], "source": [ "#################### EXERCISE 1 ####################\n", "\n", "# Solution for exercise 1\n", "# Input: word (string), context (string)\n", "# Output: p (float)\n", "# Compute the Laplace smoothed probability for word given the single word context\n", "def ex1(word,context):\n", " p = 0.0\n", "\n", " austen_words = [w.lower() for w in gutenberg.words('austen-sense.txt')]\n", " austen_bigrams = zip(austen_words[:-1], austen_words[1:]) # list of bigrams as tuples\n", " # (above doesn't include begin/end of corpus: but basically this is fine)\n", " V = len(set(austen_words)) # vocabulary size\n", "\n", " # Compute probability of word given context\n", " p = float(austen_bigrams.count((context, word))+1) / (austen_words.count(context)+V)\n", "\n", " # Return probability\n", " return p\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### Test your code with these examples\n", "print \"LAPLACE:\"\n", "result1a = ex1('end','the')\n", "print \"Probability of \\'end\\' given \\'the\\': \" + str(result1a)\n", "result1b = ex1('the','end')\n", "print \"Probability of \\'the\\' given \\'end\\': \" + str(result1b)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Lidstone (add-alpha)\n", "\n", "In practice, Laplace smoothing assigns too much mass to unseen n-grams. The\n", "Lidstone method works in a similar way, but instead of adding 1, it adds a value\n", "between 0 and 1 to the sample count for each bin (in class we called this value\n", "alpha, NLTK calls it gamma).\n", "\n", "#### Exercise 2\n", "\n", "Fill in the code to compute Lidstone smoothed probabilities, then uncomment\n", "the test code and look at the probability estimates that are computed for the\n", "same bigrams as before using various values of alpha.\n", "\n", "What do you notice about using `alpha = 0` and `alpha = 1`? (Compare to the\n", "probabilities computed by the previous methods.)\n", "\n", "What about when `alpha = 0.01`? Are the estimated probabilities more similar\n", "to MLE or Laplace smoothing in this case?" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "LIDSTONE, alpha=0.01:\n", "Probability of 'end' given 'the': 0.00575913341952\n", "Probability of 'the' given 'end': 8.40124338402e-05\n", "LIDSTONE, alpha=0:\n", "Probability of 'end' given 'the': 0.00584652862363\n", "Probability of 'the' given 'end': 0.0\n", "LIDSTONE, alpha=1:\n", "Probability of 'end' given 'the': 0.00237913970308\n", "Probability of 'the' given 'end': 0.000154846701765\n" ] } ], "source": [ "#################### EXERCISE 2 ####################\n", "# Solution for exercise 2\n", "# Input: word (string), context (string), alpha (float)\n", "# Output: p (float)\n", "# Compute the Lidstone smoothed probability for word given the single word context\n", "# Alpha is the smoothing parameter, normally between 0 and 1.\n", "def ex2(word,context,alpha):\n", " p =0.0\n", "\n", " austen_words = [w.lower() for w in gutenberg.words('austen-sense.txt')]\n", " austen_bigrams = zip(austen_words[:-1], austen_words[1:]) # list of bigrams as tuples\n", " # (above doesn't include begin/end of corpus: but basically this is fine)\n", " V = len(set(austen_words)) # vocabulary size\n", " p = float(austen_bigrams.count((context, word))+alpha) / (austen_words.count(context)+V*alpha)\n", "\n", " # Return probability\n", " return p\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### Test your code with these examples\n", "print \"LIDSTONE, alpha=0.01:\"\n", "result2a = ex2('end','the',.01)\n", "print \"Probability of \\'end\\' given \\'the\\': \" + str(result2a)\n", "result2b = ex2('the','end',.01)\n", "print \"Probability of \\'the\\' given \\'end\\': \" + str(result2b)\n", "print \"LIDSTONE, alpha=0:\"\n", "result2c = ex2('end','the',0)\n", "print \"Probability of \\'end\\' given \\'the\\': \" + str(result2c)\n", "result2d = ex2('the','end',0)\n", "print \"Probability of \\'the\\' given \\'end\\': \" + str(result2d)\n", "print \"LIDSTONE, alpha=1:\"\n", "result2e = ex2('end','the',1)\n", "print \"Probability of \\'end\\' given \\'the\\': \" + str(result2e)\n", "result2f = ex2('the','end',1)\n", "print \"Probability of \\'the\\' given \\'end\\': \" + str(result2f)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4 Backoff\n", "\n", "Now we will look at the effects of incorporating backoff in addition to some of\n", "these simple smoothing methods. In a bigram language model with backoff, the\n", "probability of an unseen bigram is computed by “backing off”: that is, if a word\n", "has never been seen in a particular context, then we compute its probability by\n", "using one fewer context words. Backing off from a bigram model (one word of\n", "context) therefore means we’d get estimates based on unigram frequencies (no\n", "context).\n", "\n", "The mathematical details of backoff are a bit complex to ensure all the prob-\n", "abilities sum to 1. You needn’t understand all the details of backoff but you\n", "should understand these basic principles:\n", "\n", " * Bigram probabilities for seen bigrams will be slightly lower than MLE in order to allocate some probability mass to unseen bigrams.\n", " * The unigram probabilities inside the backoff (i.e. the ones we use if we didn’t see the bigram) are similar in their relatives sizes to the unigram probabilities we would get if we just estimated a unigram model directly.\n", "\n", "That is, a word with high corpus frequency will have a higher unigram\n", "backoff probability than a word with a low corpus frequency.\n", "Look back at the initialization method for NgramModel earlier in the lab. If\n", "you pass in MLEProbDist as the estimator (which we did in the last lab), then\n", "no backoff is used. However, with any other estimator (i.e., smoothing), the\n", "NgramModel does use backoff.\n", "\n", "### Exercise 3\n", "\n", "Use the code we have provided to train a bigram language model using Jane\n", "Austen’s “Sense and Sensibility” and Laplace smoothing. (By using Laplace\n", "as the estimator, you are also turning on backoff in NgramModel.) Use this\n", "language model to compute the probability of the same bigrams we’ve been\n", "looking at all along. Uncomment the test code to see the results.\n", "\n", " * Compare the probabilities you get for these bigrams to what you got when you computed Laplace yourself. Why are the probabilities produced by NgramModel with Laplace smoothing different from the probabilities you computed yourself?\n", " * Now look at the estimated probabilities P'(“end”|“the”) and P'(“the”|“end”) as computed by the NgramModel and by the previous smoothing methods. Which method(s) produce larger differences between those probabilities? Do all the methods agree about which bigram has higher probability? If not, what is the reason for the difference?" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "BACKOFF WITH LAPLACE\n", "Probability of 'end' given 'the': 0.0023789133124\n", "Probability of 'the' given 'end': 0.0339454865864\n" ] } ], "source": [ "#################### EXERCISE 3 ####################\n", "# Solution for exercise 3\n", "# Input: word (string), context (string)\n", "# Output: p (float)\n", "def ex3(word,context):\n", " p =0.0\n", "\n", " # Train a bigram language model using a LAPLACE estimator AND BACKOFF\n", " austen_words = [w.lower() for w in gutenberg.words('austen-sense.txt')]\n", " lm = NgramModel(2,austen_words,estimator=lambda f,b: LaplaceProbDist(f,b+1))\n", " # Compute probability of word given context (note lm requires a list context)\n", " p = lm.prob(word,[context])\n", "\n", " # Return probability\n", " return p\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### Test your code with these examples\n", "print \"BACKOFF WITH LAPLACE\"\n", "result3a = ex3('end','the')\n", "print \"Probability of \\'end\\' given \\'the\\': \" + str(result3a)\n", "result3b = ex3('the','end')\n", "print \"Probability of \\'the\\' given \\'end\\': \" + str(result3b)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Authorship Identification\n", "\n", "## Cross-entropy\n", "\n", "In language modelling, a model is trained on a set of data (i.e. the training\n", "data). The cross-entropy of this model may then be measured on a test set\n", "(i.e. another set of data that is different from the training data) to assess how\n", "accurate the model is in predicting the test data.\n", "\n", "Another way to look at this is: if we used the trained model to generate new\n", "sentences by sampling words from its probability distribution, how similar would\n", "those new sentences be to the sentences in the test data? This interpretation\n", "allows us to use cross-entropy for authorship detection, as described below.\n", "\n", "nltkx.NgramModel contains the following cross-entropy method:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def entropy(self, text, pad_left=False, pad_right=False,\n", " verbose=False, perItem=False):\n", " \"\"\"\n", " Calculate the approximate cross-entropy of the n-gram model for a\n", " given evaluation text.\n", " This is the average log probability of each item in the text.\n", " :param text: items to use for evaluation\n", " :type text: iterable(str)\n", " :param pad_left: whether to pad the left of each text with an (n-1)-gram\\\n", " of markers\n", " :type pad_left: bool\n", " :param pad_right: whether to pad the right of each sentence with an \\\n", " marker\n", " :type pad_right: bool\n", " :param perItem: normalise for length if True\n", " :type perItem: bool\n", " \"\"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Exercise 4\n", "\n", "We can use cross-entropy in authorship detection. For example, suppose we have\n", "a language model trained on Jane Austen’s “Sense and Sensibility” (training\n", "data) plus the texts for two other novels (test data), one by Jane Austen and\n", "one by another author, but we don’t know which is which. We can work out the\n", "cross-entropy of the model on each of the texts and from the scores, determine\n", "which of the two test texts was more likely written by Jane Austen.\n", "Use:\n", "\n", "* A trigram language model with a Lidstone probability distribution, trained on Jane Austen’s “Sense and Sensibility” (austen-sense.txt) N.B. The “f.B()+1” argument (already provided for you in the code) means that we lump together all the unseen n-grams as a single “unknown” token.\n", "* text a: austen-emma.txt (Jane Austen’s “Emma”)\n", "* text b: chesterton-ball.txt (G.K. Chesterton’s “The Ball and Cross”)\n", "* `NgramModel`’s entropy function: `lm.entropy(...)` \n", " \n", "Compute both the total cross-entropy and the per-word cross entropy of each text. (Separate function templates are provided.)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total cross-entropy for austen-emma.txt: 2811146.28844\n", "Total cross-entropy for chesterton-ball.txt: 1678081.01156\n", "Per-word cross-entropy for austen-emma.txt: 14.6090491799\n", "Per-word cross-entropy for chesterton-ball.txt: 17.3008744001\n" ] } ], "source": [ "#################### EXERCISE 4 ####################\n", "\n", "# Solution for exercise 4 - total entropy calculation\n", "# Input: lm (NgramModel language model), doc_name (string)\n", "# Output: e (float)\n", "def ex4_tot_entropy(lm,doc_name):\n", " e = 0.0\n", "\n", " # Construct a list of lowercase words from the document (test document)\n", " doc_words = [w.lower() for w in gutenberg.words(doc_name)]\n", "\n", " # Compute the total cross entropy of the text in doc_name\n", " e = lm.entropy(doc_words, perItem=False)\n", "\n", " # Return the entropy\n", " return e\n", "\n", "# Solution for exercise 4 - per-word entropy calculation\n", "# Input: lm (NgramModel language model), doc_name (string)\n", "# Output: e (float)\n", "def ex4_perword_entropy(lm,doc_name):\n", " e = 0.0\n", "\n", " # Construct a list of lowercase words from the document (test document)\n", " doc_words = [w.lower() for w in gutenberg.words(doc_name)]\n", "\n", " # Compute the normalized cross entropy of the text in doc_name\n", " e = lm.entropy(doc_words, perItem=True)\n", "\n", " # Return the entropy\n", " return e\n", "\n", "\n", "# Solution for exercise 4 - language model training\n", "# Input: doc_name (string)\n", "# Output: l (language model)\n", "def ex4_lm(doc_name):\n", " l = None\n", "\n", " # Construct a list of lowercase words from the document (training data for lm)\n", " doc_words = [w.lower() for w in gutenberg.words(doc_name)]\n", "\n", " # Train a trigram language model using doc_words and a Lidstone probability distribution with +0.01 added to the sample count for each bin\n", " l = NgramModel(3,doc_words,estimator=lambda f,b:nltk.LidstoneProbDist(f,0.01,f.B()+1))\n", "\n", " # Return the language model\n", " return l\n", "\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### Test your code with these examples\n", "lm4 = ex4_lm('austen-sense.txt')\n", "result4a = ex4_tot_entropy(lm4,'austen-emma.txt')\n", "print \"Total cross-entropy for austen-emma.txt: \" + str(result4a)\n", "result4b = ex4_tot_entropy(lm4,'chesterton-ball.txt')\n", "print \"Total cross-entropy for chesterton-ball.txt: \" + str(result4b)\n", "result4c = ex4_perword_entropy(lm4,'austen-emma.txt')\n", "print \"Per-word cross-entropy for austen-emma.txt: \" + str(result4c)\n", "result4d = ex4_perword_entropy(lm4,'chesterton-ball.txt')\n", "print \"Per-word cross-entropy for chesterton-ball.txt: \" + str(result4d)" ] }, { "cell_type": "markdown", "metadata": { "collapsed": true }, "source": [ "# 6 Going further\n", "## 6.1 Padding\n", "\n", "Redo exercise 4 setting `pad_left` and `pad_right` to `True` both when initialising\n", "the n-gram model and when computing entropy. What difference does this\n", "make?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.5" } }, "nbformat": 4, "nbformat_minor": 2 }