{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# FNLP: Lab Session 1\n", "### Corpora and Language Models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Aim\n", "\n", "The aims of this lab session are to 1) explore the different uses of language in\n", "different documents, authored by different people and 2) introduce the construction of language models using Python’s Natural Language Tool Kit (NLTK).\n", "Successful completion of this lab is important as the first assignment for FNLP\n", "builds on some of the concepts and methods that are introduced here. By the\n", "end of this lab session, you should be able to:\n", "\n", "* Access the corpora provided in NLTK\n", "* Compute a frequency distribution\n", "* Train a language model\n", "* Use a language model to compute bigram probabilities" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Running NLTK and Python Help" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Running NLTK\n", "\n", "This year our recommended method for running labs is through Jupyter Notebooks.\n", "\n", "See http://www.inf.ed.ac.uk/teaching/courses/fnlp/Tutorials/labs.pdf for instructions and help with Jupyter.\n", "\n", "Try it out by importing NLTK:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import nltk" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Python Help \n", "\n", "Python contains an inbuilt help module that runs in an interactive mode. To\n", "run the interactive help, type:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "help()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`help()` will run until interrupted. If a cell is running it will block any other cell from running until it has completed. You can check if a cell is still running by looking at `In [*]:` to the left of any cell. If there is a `*` inside the brackets the cell is still running. As soon as the cell has stopped running the `*` will be replaced by a number. \n", "\n", "**Before moving on** you will need to interupt `help()` (make it stop running). To interupt running cells go to **`kernel/interrupt`** at the top of the webpage. You can also hit the **big black square button** right underneath (if you hover over it it will say interrupt kernel). This is equivalent to hitting CTRL-d to interrupt a running program in the terminal or the python shell." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you know the name of the module that you want to get help on, type:\n", "`import `\n", "`help()`\n", "try looking at the help documentation for `nltk`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "help(nltk)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you know the name of the module and the method that you want to get help\n", "on, type `help(.)` (note you must have imported ``" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "The FNLP lab sessions will make use of the Natural Language Tool Kit (NLTK)\n", "for Python. NLTK is a platform for writing programs to process human language\n", "data, that provides both corpora and modules. For more information on NLTK,\n", "please visit: http://www.nltk.org/.\n", "\n", "For each exercise, edit the corresponding function in the notebook (e.g. ex1\n", "for Exercise 1), then run the lines which prepare for and invoke that\n", "function.\n", "\n", "If you’re unfamiliar with developing python code, you may want to look at the\n", "second lab for ANLP, which assumes much less background experience and has\n", "a detailed step-by-step guide to using python for the first time:\n", "\n", "http://www.inf.ed.ac.uk/teaching/courses/anlp/labs/lab2.html\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Accessing Corpora\n", "\n", "NLTK provides many corpora and covers many genres of text. Some of the\n", "corpora are listed below:\n", "\n", "* Gutenberg: out of copyright books\n", "* Brown: a general corpus of texts including novels, short stories and news\n", "articles\n", "* Inaugural: U.S. Presidential inaugural speeches\n", "\n", "To see a complete list of available corpora you can run:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "print(os.listdir(nltk.data.find(\"corpora\")))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each corpus contains a number of texts. We’ll work with the inaugural corpus,\n", "and explore what the corpus contains. Make sure you have imported the nltk\n", "module first and then load the inaugural corpus by typing the following:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from nltk.corpus import inaugural" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To list all of the documents in the inaugural corpus, run:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "inaugural.fileids()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From this point on we’ll work with President Barack Obama’s inaugural speech\n", "from 2009 (2009-Obama.txt). The contents of each document (in a corpus) may\n", "be accessed via a number of corpus readers. The plaintext corpus reader provides\n", "methods to view the raw text (raw), a list of words (words) or a list of sentences:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "print(inaugural.raw('2009-Obama.txt'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(inaugural.words('2009-Obama.txt'))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(inaugural.sents('2009-Obama.txt'))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 1\n", "\n", "* Find the total number of words (tokens) in Obama’s 2009 speech \n", "\n", "* Find the total number of distinct words (word types) in the same speech" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def ex1(doc_name):\n", " # Use the plaintext corpus reader to access a pre-tokenised list of words\n", " # for the document specified in \"doc_name\"\n", " doc_words = inaugural.words(doc_name)\n", "\n", " # Find the total number of words in the speech\n", " total_words = len(doc_words)\n", "\n", " # Find the total number of DISTINCT words in the speech\n", " total_distinct_words = len(set(w.lower() for w in doc_words))\n", "\n", " # Return the word counts\n", " return (total_words, total_distinct_words)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To test your solution:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "speech_name = '2009-Obama.txt'\n", "(tokens,types) = ex1(speech_name)\n", "print(\"Total words in %s: %s\"%(speech_name,tokens))\n", "print(\"Total distinct words in %s: %s\"%(speech_name,types))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 2\n", "\n", "Find the average word-type length of Obama’s 2009 speech" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def ex2(doc_name):\n", " doc_words = inaugural.words(doc_name)\n", "\n", " # Construct a list that contains the word lengths for each DISTINCT word in the document\n", " distinct_word_lengths = [len(w) for w in set(v.lower() for v in doc_words)]\n", "\n", " # Find the average word type length\n", " avg_word_length = float(sum(distinct_word_lengths)) / len(distinct_word_lengths)\n", "\n", " # Return the average word type length of the document\n", " return avg_word_length\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To test your solution:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "speech_name = '2009-Obama.txt'\n", "result2 = ex2(speech_name)\n", "print(\"Average word type length for %s: %.3f\"%(speech_name,result2))\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Frequency Distribution\n", "\n", "A frequency distribution records the number of times each outcome of an ex-\n", "periment has occurred. For example, a frequency distribution could be used to\n", "record the number of times each word appears in a document:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# Obtain the words from Barack Obama’s 2009 speech\n", "obama_words = inaugural.words('2009-Obama.txt')\n", "# Construct a frequency distribution over the lowercased words in the document\n", "fd_obama_words = nltk.FreqDist(w.lower() for w in obama_words)\n", "# Find the top 50 most frequently used words in the speech\n", "fd_obama_words.most_common(50)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You can easily plot the top 50 words (note `%matplotlib inline` tells jupyter that it should embed plots in the output cell after you run the code. You only need to run it once per notebook, not in every cell with a plot. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "fd_obama_words.plot(50)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Find out how many times the words peace and america were used in the speech:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "print('peace:', fd_obama_words['peace'])\n", "print('america:', fd_obama_words['america'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 3\n", "\n", "Compare the top 50 most frequent words in Barack Obama’s 2009 speech with\n", "George Washington’s 1789 speech.\n", "What can knowing word frequencies tell us about different speeches at different\n", "times in history?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def ex3(doc_name, x):\n", " doc_words = inaugural.words(doc_name)\n", " \n", " # Construct a frequency distribution over the lowercased words in the document\n", " fd_doc_words = nltk.FreqDist(w.lower() for w in doc_words)\n", "\n", " # Find the top x most frequently used words in the document\n", " top_words = fd_doc_words.most_common(x)\n", "\n", " # Return the top x most frequently used words\n", " return top_words\n", "\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### Now test your code\n", "print(\"Top 50 words for Obama's 2009 speech:\")\n", "result3a = ex3('2009-Obama.txt', 50)\n", "print(result3a)\n", "print(\"Top 50 words for Washington's 1789 speech:\")\n", "result3b = ex3('1789-Washington.txt', 50)\n", "print(result3b)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Language Models\n", "\n", "A statistical language model assigns a probability to a sequence of words, using\n", "a probability distribution. Language models have many applications in Natural\n", "Language Processing. For example, in speech recognition they may be used\n", "to predict the next word that a speaker will utter. In machine translation a\n", "language model may be used to score multiple candidate translations of an\n", "input sentence in order to find the most fluent/natural translation from the set\n", "of candidates.\n", "\n", "\n", "### Building a Language Model\n", "\n", "We provide you with an NgramModel module taken from an old version of NLTK. The\n", "initialisation method looks like this:\n", "\n", " def __init__(self, n, train, pad_left=False, pad_right=False,\n", " estimator=None, *estimator_args, **estimator_kwargs):\n", "\n", "Where:\n", "\n", "* n = order of the language model. 1=unigram; 2=bigram; 3=trigram, etc.\n", "* train = the training data (supplied as a list)\n", "* pad left and pad right = sentence initial and sentence final padding\n", "* estimator = method used to construct the probability distribution. May\n", "* or may not include smoothing. Arguments to the estimator are optional.\n", "\n", "\n", "### Exercise 4\n", "\n", "Use `NgramModel` to build a language model based on the text of Sense and\n", "Sensibility by Jane Austen. The language model should be a bigram model, and\n", "you can let it use the default `nltk.MLEProbDist` estimator.\n", "\n", "**Hint**, fill in the gaps with the information already provided in the code /\n", "comments.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Import NLTK's NgramModel module (for building language models)\n", "# It has been removed from standard NLTK, so we access it from a shared space\n", "try:\n", " from nltk_model import * # See the README inside the nltk_model folder for more information\n", "except ImportError:\n", " from .nltk_model import * # Compatibility depending on how this script was run\n", "from nltk.corpus import gutenberg" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Input: doc_name (string), n (int)\n", "# Output: lm (NgramModel language model)\n", "def ex4(doc_name, n):\n", " # Construct a list of lowercase words from the document\n", " words = [w.lower() for w in gutenberg.words(doc_name)]\n", "\n", " # Build the language model using the nltk.MLEProbDist estimator \n", " lm = NgramModel(n,words)\n", " \n", " # Return the language model (we'll use it in exercise 5)\n", " return lm" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### Test your code for excercise 4\n", "result4 = ex4('austen-sense.txt',2)\n", "print(\"Sense and Sensibility bigram language model built\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Computing Probabilities\n", "\n", "Using the language model, we can work out the probability of a word given its\n", "context. In the case of the bigram language model built in Exercise 4, we have\n", "only one word of context. To obtain probabilities from a language model, use\n", "**NgramModel.prob**:\n", "\n", "`lm.prob(word, [context])`\n", "\n", "Where **word** and **context** are both unigram strings when working with a bigram\n", "language model. For higher order language models, context will be a list of\n", "unigram strings of length order-1.\n", "\n", "\n", "### Exercise 5\n", "\n", "Using the bigram language model built in Exercise 4, compute the following\n", "probabilities\n", "\n", "1. reason followed by for\n", "2. the followed by end\n", "3. end followed by the\n", "\n", "Now uncomment the test code and check your results\n", "The result for c above is perhaps not what you expected. Why do you think it\n", "happened?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Input: lm (NgramModel language model, from exercise 4), word (string), context (list)\n", "# Output: p (float)\n", "def ex5(lm,word,context,verbose=False):\n", " \n", " # Compute the probability for the word given the context\n", " p = lm.prob(word,context,verbose=verbose)\n", "\n", " # Return the probability\n", " return p" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### Test your code for Exercise 5\n", "result5a = ex5(result4,'for',['reason'])\n", "print(\"Probability of \\'reason\\' followed by \\'for\\': %.3f\"%result5a)\n", "result5b = ex5(result4,'end',['the'])\n", "print(\"Probability of \\'the\\' followed by \\'end\\': %.5f\"%result5b)\n", "result5c = ex5(result4,'the',['end'])\n", "print(\"Probability of \\'end\\' followed by \\'the\\': %.1f\"%result5c)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 6\n", "\n", "Update your definition of the `ex5` function to include a (boolean) `verbose`\n", "argument, which is passed through to `NgramModel.prob`. Use this to see if\n", "it gives any insight on the (end, the) bigram." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### Test your code for Exercise 6\n", "result6 = ex5(result4,'the',['end'],True)" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.1" } }, "nbformat": 4, "nbformat_minor": 2 }