In this Lab, we will build a proximity index and find out what we could do with it.
Note please that this lab can only be done on a DICE machine.
Before we start building an index, we need some data. In this lab, we will use a subset of the British National Corpus (BNC). This is available on DICE, so all you have to do in python is execute from Lab4Support import * and then bnc = readBNC(). This reads and cleans the texts for you. Note that for the purpose of this lab, we will treat the corpus as one big document. Tokenize the text (tokens = re.split('\s+', bnc)) and get an idea of the amount of tokens (len(tokens)) and the amount of terms (len(set(tokens))) in the data.
The aim of this lab is to find positions in the text, where two given words are within a certain number of tokens from each other. A naive way of computing this would be a sliding window over the text, checking for the two terms in each subset. This is implemented in the function window(term1, term2,n,tokens). It will return the number of two words coocurring within a sliding window of n tokens. Try a few examples for n = 5 and imagine how slow this would be on a larger corpus - such as the web.
To remedy this, we should create a proximity index. This is a dictionary, which has the terms as keys and their positions in the text (i.e. indices in the token list) as values. In Lab4.py, write a function that builds such an index. You might find the dictionary standard function setdefault(token, []) helpful for this task. The way it works is described here. Alternatively, you can implement this pseudo-code:
set position to zero
for each token in text:
if we have not seen this term before:
create an empty list for it and store it in a dictionary
append position to the end of the list for this term
increment position
Once you have the index, you can play with it a little. If everything went well, len(proximityIndex) == len(set(tokens)). You should get these values (note that the corpus was transformed to lower case earlier):
len(proximityIndex['the']) = 895.144Now that we have the proximity index, one thing we could do is to calculate a proximity score between two words. This is the relative amount of occurrences of two words in close proximity to each other. This means that words, which frequently occur together, will have a high score, whereas unrelated words will not.
This requires to go through the two corresponding lists of positions using linear merge. To help you with this task, the function linearMerge(proximityIndex, term1, term2, n) implements a linear merge. However, there are two conditions missing (see code comments). Your job is to "fill in the gaps".
You can now play with this. Use the function proximityScore(proximityIndex, term1, term2, n) to compute proximity scores. Note that proximityScore(proximityIndex, 'edinburgh', 'university', 10) will output the percentage of matches relative to all occurrences of 'edinburgh', whereas proximityScore(proximityIndex, 'university', 'edinburgh', 10) will be based on all the occurrences of 'university'). So you could compare how often Edinburgh and Glasgow are mentioned in the context of universities by comparing:
proximityScore(proximityIndex, 'university', 'edinburgh', 5)Or you can check whether people talk more about the university or the festival, when they talk about Edinburgh:
proximityScore(proximityIndex, 'edinburgh', 'university', 5)If you still have time, there are some extensions you could implement. Firstly, you could change the linear merge in a way that it assigns a higher score if the two words are closer together, up to the maximal distance of n (which would receive the lowest score). Right now, any match within n will increase the matches variable by 1. Also, you could implement a function that takes a word A and finds the word B with the highest proximity score (or a top 10 maybe). You will want to get rid of very common words ('the', 'of', 'a', etc.) before you do this.
You may find these functions helpful for the tasks above. Feel free to adapt them, if you like.
readBNC()
|
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk Please contact our webadmin with any comments or corrections. Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh |