TTS: Lab 4 (Proximity Indexing)

In this Lab, we will build a proximity index and find out what we could do with it.

Note please that this lab can only be done on a DICE machine.

Python

  1. On a DICE machine, create a Folder called TTSLab4 and create a new file called Lab4.py
  2. Download Lab4Support.py and save it in TTSLab4. This contains some helpful functions. There is a list of the functions you can use at the bottom of this page. In Lab4.py, you need to import the module (write from Lab4Support import * in the first line).
  3. If you need help with Python commands, look here or use Google for a quick fix.

Building a Proximity Index

Before we start building an index, we need some data. In this lab, we will use a subset of the British National Corpus (BNC). This is available on DICE, so all you have to do in python is execute from Lab4Support import * and then bnc = readBNC(). This reads and cleans the texts for you. Note that for the purpose of this lab, we will treat the corpus as one big document. Tokenize the text (tokens = re.split('\s+', bnc)) and get an idea of the amount of tokens (len(tokens)) and the amount of terms (len(set(tokens))) in the data.

The aim of this lab is to find positions in the text, where two given words are within a certain number of tokens from each other. A naive way of computing this would be a sliding window over the text, checking for the two terms in each subset. This is implemented in the function window(term1, term2,n,tokens). It will return the number of two words coocurring within a sliding window of n tokens. Try a few examples for n = 5 and imagine how slow this would be on a larger corpus - such as the web.

To remedy this, we should create a proximity index. This is a dictionary, which has the terms as keys and their positions in the text (i.e. indices in the token list) as values. In Lab4.py, write a function that builds such an index. You might find the dictionary standard function setdefault(token, []) helpful for this task. The way it works is described here. Alternatively, you can implement this pseudo-code:

set position to zero
for each token in text:
  if we have not seen this term before:
    create an empty list for it and store it in a dictionary
  append position to the end of the list for this term
  increment position

Once you have the index, you can play with it a little. If everything went well, len(proximityIndex) == len(set(tokens)). You should get these values (note that the corpus was transformed to lower case earlier):

len(proximityIndex['the']) = 895.144
len(proximityIndex['university']) = 2.092
len(proximityIndex['of']) = 443.461
len(proximityIndex['edinburgh']) = 524
len(proximityIndex['school']) = 5.550
len(proximityIndex['of']) = 443.461
len(proximityIndex['informatics']) = 1

Linear Merge

Now that we have the proximity index, one thing we could do is to calculate a proximity score between two words. This is the relative amount of occurrences of two words in close proximity to each other. This means that words, which frequently occur together, will have a high score, whereas unrelated words will not.

This requires to go through the two corresponding lists of positions using linear merge. To help you with this task, the function linearMerge(proximityIndex, term1, term2, n) implements a linear merge. However, there are two conditions missing (see code comments). Your job is to "fill in the gaps".

You can now play with this. Use the function proximityScore(proximityIndex, term1, term2, n) to compute proximity scores. Note that proximityScore(proximityIndex, 'edinburgh', 'university', 10) will output the percentage of matches relative to all occurrences of 'edinburgh', whereas proximityScore(proximityIndex, 'university', 'edinburgh', 10) will be based on all the occurrences of 'university'). So you could compare how often Edinburgh and Glasgow are mentioned in the context of universities by comparing:

proximityScore(proximityIndex, 'university', 'edinburgh', 5)
proximityScore(proximityIndex, 'university', 'glasgow', 5)

Or you can check whether people talk more about the university or the festival, when they talk about Edinburgh:

proximityScore(proximityIndex, 'edinburgh', 'university', 5)
proximityScore(proximityIndex, 'edinburgh', 'festival', 5)

Extensions

If you still have time, there are some extensions you could implement. Firstly, you could change the linear merge in a way that it assigns a higher score if the two words are closer together, up to the maximal distance of n (which would receive the lowest score). Right now, any match within n will increase the matches variable by 1. Also, you could implement a function that takes a word A and finds the word B with the highest proximity score (or a top 10 maybe). You will want to get rid of very common words ('the', 'of', 'a', etc.) before you do this.

Helpful functions in Lab4Support.py

You may find these functions helpful for the tasks above. Feel free to adapt them, if you like.

readBNC()
Reads, cleans and tokenizes a subset of the British National Corpus.

window(term1, term2, n, tokens)
Takes two words and a maximum distance parameter n, as well as a list of tokens. Returns the number of coocurrences for the two words within n tokens.

proximityScore(proximityIndex, term1, term2, n)
Takes a proximity index, two words and a maximum distance parameter n. Returns the proximity score for the two words based on the index.

saveToFile(data, filename)
Takes data in the form of a string, list, set or dictionary and saves it in the specified file.


Lab prepared by Philipp Petrenz, October 2010.


Home : Teaching : Courses : Tts 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh