# Text Technologies for Data Science: Assessment 2

### Victor Lavrenko

#### 6th October, 2014

The year is 2001, and you work for an ambitious startup called Google. While riding in an elevator with the CEO, you propose to build a system that will organise the world's news. The CEO is fascinated, you are given access to a news crawl, and set out to build a scalable architecture for organising them. Your idea is based on linking together pairs of highly-similar news stories, and then forming clusters based on linked pairs. You decide to use the following algorithm for finding the pairs:

Algorithm:
for each story A in news.txt:
find the most similar story B that came before A
if similarity(A,B) > threshold:
output pair: A,B

You also make the following initial design decisions (required for tasks 1 and 2):

1. tokenise each story by splitting on whitespace
2. lowercase the tokens, but do not stem them, and do not remove stopwords
3. use tf-idf weighted cosine for measuring the similarity of two stories
4. do not squash term frequencies: use tf instead of tf/(tf+K)
5. load idf values from news.idf, set idf=13.6332 for any word not in news.idf
6. set the threshold to 0.2 for tasks 1 and 2
7. when similarity(A,B1) = similarity(A,B2) pick the earlier story as a pair

Please familiarize yourself with the Data and Formats section, and then complete the following tasks:

1. Implement a brute-force version of the Algorithm, without any indexing.
Your code for step 2 of the Algorithm should look something like this:
B = 1
for i = 2..A-1:
if cosine(A,i) > cosine(A,B):
B = i
Your implementation (brute.py) should read the news stories line-by-line from news.txt in the current directory and write the results to pairs.out in the current directory. Stop after processing 10,000 stories.
2. Implement an indexed version of the Algorithm. Your code should maintain an in-memory index of all stories you have already seen. When you load a new story A, run it as a query against your index. Use cosine scores and term-at-a-time execution. After finding the most-similar story B, add story A to the index. Your implementation (index.py) should produce exactly the same results as you got in task 1. Stop after processing 10,000 stories.
3. Try to improve the speed of the Algorithm. You can use any methods and heuristics, and can change any of the initial design decisions, as long as 90% of the pairs remain the same as in tasks 1 and 2. Try to process as many stories as you can in 30 minutes. Save your most successful algorithm as best.py
4. Compare the running time of your algorithms. Take N=100 first stories and measure how many seconds it takes for your algorithms from tasks 1, 2 and 3. The time should include all stages of the algorithm (loading news.idf, reading the stories, tokenizing, indexing, query execution, printing of results, and anything else you do). Measure the runtime on a DICE machine using the following command:
time /usr/bin/python2.6 ./brute.py
Use the real part of the output. Repeat for N=100,200,500,1000,2000,5000,10000. Plot the running time against N (use log-scale). Your plot should contain three curves: brute, index, and best. You can use any software to produce the plot (e.g. OpenOffice).
5. Write a report summarizing what you have learned from this assignment. Give full details for everything you did in part 3. You are allowed a maximum of 2 pages for the report, but shorter reports are fine. Include the plot from task 4. Submit a paper copy of the report to the ITO. Please note:
• The report should not be a step-by-step walkthrough of your code
• Don't waste space explaining program structure, classes and functions
• Describe what you tried, why you tried it, how it improved results, etc.
• Here's an example of a good report (for previous coursework)

### Data and Formats

Each story is on a separate line. The first field is a unique story number, followed by the story content, all on the same line. For example:
1 An endowed chair has been named at a Northern Ireland university program for Thomas ...
2 Bosnian Serbs smarting under recent battlefield gains by their Muslim and Croat ...
Stories are listed in chronological order.
Use this for assigning tf-idf weights to all words.
Each line contains the idf value followed by a word, separated by a space:
0.2276 the
0.4428 a
Set idf=13.6332 for any word not in this file.
Use these to check that your algorithm is correct.
4. pairs.out: results of your best algorithm -- you will create this file
This file should contain the pairs for as many stories as you can process in 30 minutes. The format should be exactly the same as for pairs.ref:
7 5
8 2
This means that story 5 was the most similar story to story 7, and the cosine similarity was greater than 0.2. Stories 1..6 did not have a sufficiently-similar story before them.
5. best.id: algorithm name -- you will create this file.
Come up with a short and distinctive name for your best algorithm and save it in a file called best.id. The file should contain your chosen name for your algorithm on the first line. Optionally, you may provide a one-sentence description of your algorithm on the second line, for example:
JohnEngine-7
doc-at-a-time with skip-pointers and early stopping
(optional)

Please do not use your full name or your matriculation number as the identifier and don't include any non-English characters.
Your code must obey the restrictions below.

### Restrictions

You must use Python 2.6 (default on DICE) as the programming language for this assignment.
1. Your code should have no import statements. Only built-in modules are allowed (math.sqrt(x) = pow(x,0.5)).
2. Do not use parallel processing of any kind: no threading, multi-processing or vector/GPU code.
3. Do not use just-in-time compilers, bytecode optimizers or native interfaces of any kind (e.g. PyPy, Psyco, ctypes, SWIG).
4. All parts of your code should be clearly and identifiably your own work.
5. You can reuse the code you have written for TTSDS coursework 1, or the TTSDS lab exercises, as long as you clearly cite the origin.
6. If you consult external sources, please cite them (this includes talking to your friends).
7. Your code will be run as follows: /usr/bin/python2.6 best.py
8. Your code should read the stories from news.txt in the current directory.
9. Your code should write the results into pairs.out in the current directory.
10. Your code should process stories one-by-one: write out and flush() the result for story K before loading story K+1.
11. Your code should not rely on any command-line arguments, environment variables or files not in the current directory.
12. Your code should run on a standard DICE machine with 2GB RAM and no network connection
13. Your code will be terminated after 30 minutes of running. It is your responsibility to make sure the output is saved to pairs.out

### What to turn in

1. Please submit a paper copy of the report (task 5) to the ITO (Appleton Tower room 4.02).
Note that only the first two pages of the report will be assessed: any material beyond this will be ignored.
2. On a DICE machine, create a directory called tts2, and place the following files into it:
• best.id: the name of your best algorithm
• pairs.out: results of your best implementation (as many pairs as you can find in 30 minutes)
Once the files are in place, run the following DICE command: submit ttsds 2 tts2
Please make sure you name the files exactly as stated above and use the correct formats.
Please make sure your code runs correctly on a DICE machine and produces results in the correct format.
You may lose a significant fraction of the marks if you use wrong names, formats, or if we cannot run your code.

Both the paper report and the electronic submission are due 16:00 on Monday, 20th October, 2014. [late submissions]

### Marking

The assignment is worth 7.5% of your total course mark and will be scored out of 10 points as follows:
• 5 points for correct implementation of the brute-force and index-based algorithms (tasks 1 and 2)
• 3 points for a clear and detailed report with a plot (tasks 4 and 5)
• 2 points for successfully improving the speed of your search algorithm (task 3). Marks will be based on the creativity of your approach and on the number of stories it can process within 30 minutes. Your code will be tested on a different set of news stories (which are similar to news.txt in all aspects).

### Sanity checks

1. The output for tasks 1 and 2 should be identical.
2. Correct output for the first 1000 stories is here.
3. cosine(Story8,Story2) = 0.5883
4. brute.py should not take more than 30 minutes for 10,000 stories
5. index.py should be at least 4 times faster