Text Technologies Assessment 2

Victor Lavrenko and Dominik Wurzer

7th October, 2013

In this coursework you will develop a fast search engine for tweets.

What to turn in

  1. Please submit a paper copy of the report (tasks 5,6) to the ITO (Appleton Tower room 4.02).
    Note that only the first two pages of the report will be assessed: any material beyond this will be ignored.
  2. On a DICE machine, create a directory called tts2, and place the following files into it:
    • brute.py and brute.top -- your implementation and results for task 1
    • terms.py and terms.top -- your implementation and results for task 2
    • docs.py and docs.top -- your implementation and results for task 3
    • best.py and best.top -- your implementation and results for task 4
    • best.id -- the name of your best algorithm
    • report.pdf -- a PDF version of your report, including the plot (tasks 5 and 6)
    Once the files are in place, run the following DICE command: submit tts 2 tts2
Please make sure you name the files exactly as stated above and use the correct formats.
Please make sure your code runs correctly on a DICE machine and produces results in the correct format.
You may lose a significant fraction of the marks if you use wrong names, formats, or if we cannot run your code.

Both the paper report and the electronic submission are due 16:00 on Monday, 21st October, 2013. [late submissions]

Tasks

In this coursework, you will develop a search engine for Twitter. Social network search is different in two ways from web-search: Your goal is to develop the fastest possible search algorithm by exploring a range of indexing and query-execution methods. To aid in your development, you are provided with a training set of documents (tweets) and queries. Your algorithms will be tested on a different testing set of documents and queries (training and testing sets are sufficiently similar). Please familiarize yourself with the Data and Formats section, and then complete the following tasks:
  1. Implement a brute-force algorithm for finding the matching documents:
       for each query Q:
         for each document D:
           if D contains all words in Q:
             add D to the matching set
         print 5 most recent docs in the matching set
    Your implementation (brute.py) should load the queries from qrys.txt and documents from docs.txt in the current directory. Tokenize on punctuation and lowercase the tokens, but do not perform any other pre-processing. For each query, print up to 5 most recent matches into a file called brute.top in the current directory.
  2. Implement the term-at-a-time retrieval algorithm. Your implementation (terms.py) should load and tokenize the documents and queries exactly as in task 1. Then build an inverted list for each word, and run each query against that index using the term-at-a-time execution strategy. Do not use any heuristics to speed up the algorithm. Print up to 5 most recent matches into a file called terms.top in the current directory.
  3. Implement the doc-at-a-time retrieval algorithm (docs.py). Tokenize and index the documents as in task 2. Run each query against the index using the document-at-a-time execution strategy. Use simple linear merge and do not use any heuristics to speed up the algorithm. Print up to 5 most recent matches into a file called docs.top in the current directory.
  4. Try to improve the speed of your algorithms. You can use any methods and heuristics to improve the speed, as long as the output stays the same. Please obey the restrictions. Save your most successful algorithm as best.py. It should print 5 most recent matches into a file called best.top in the current directory.
  5. Compare the running time of your algorithms. Take N=100 first documents and measure how many seconds it takes your algorithms from tasks 1, 2 and 3 to process all queries. The time should include all stages of the algorithm (loading the data, tokenizing, indexing, query execution, printing of results, and anything else you do). Measure the runtime on a DICE machine (not on your personal computer) and use the following code:
       started = time.time()
      # all stages of your algorithm go here
       runtime = time.time() - started

    Repeat for N=100,200,500,1000,2000,5000,10000,20000. Plot the running time against N (use logarithmic scale). Your plot should contain four curves: brute, terms, docs and best. We suggest averaging the times over several runs for each value of N -- this will produce smooth curves. You can use any software to produce the plot (e.g. OpenOffice).
  6. Write a report summarizing what you have learned from this assignment. Give full details for the methods you used in part 4. You are allowed a maximum of 2 pages for the report, but shorter reports are fine. Include the plot from task 5, and any other plots / tables if necessary (plots and tables do not count towards the 2-page limit). Submit a paper copy of the report to the ITO.

Restrictions

You must use Python 2.6 (default on DICE) as the programming language for this assignment.
  1. Please use only libraries and modules that are available on DICE.
  2. Do not use any modules or libraries that provide out-of-the-box indexing, retrieval or search functionality (this includes all database software).
  3. Do not use multi-processing of any kind (e.g. threading, OpenMP, GPU)
  4. Do not use just-in-time compilers, bytecode optimizers or native interfaces of any kind (e.g. PyPy, Psyco, ctypes, SWIG)
  5. The algorithms (building an index, matching queries to documents and any heuristics) should be clearly and identifiably your own work.
  6. You can reuse any of the code in the TTS lab exercises or coursework as long as you cite the origin.
  7. If you want to use a non-standard package -- please ask if it is allowed.
  8. If you consult external sources, please cite them (this includes talking to your friends).

Data and Formats

  1. docs.txt [download]: training documents.
    Each document is on a separate line. The first field is a unique document number, followed by the document content, all on the same line. For example:
        1 RT @Dohnanyi: Anyone who knows about Swiss nationals caught up in the terrorist attacks? pls reply to my e-mail ...
        2 @mumbai did they ever find the two cars?
    Documents are listed in chronological order (last one is the most recent).
  2. qrys.txt [download] : training queries.
    Each query is given on a separate line. The first field is the query number, followed by the query itself, as follows:
        1 taj paintings
        50 live video stream the fire and fights taj hotel
  3. {brute,terms,docs,best}.top: search results -- you will create these files.
    Once you match the query to a set of documents, print the results in the following format:
        1 1098 1089
        2 1098 1089
        3 905 900 894
        4 14923 14915 14906 14872 14864
        5 15655 15649 15639
    This means that query 4 matched many documents and we are printing the 5 most recent ones. The other queries matched fewer than 5 documents.
  4. best.id: algorithm name -- you will create this file.
    Come up with a short and distinctive name for your best algorithm and save it in a file called best.id. The file should contain your chosen name for your algorithm on the first line. Optionally, you may provide a one-sentence description of your algorithm on the second line, for example:
        JohnEngine-7
        term-at-a-time with top-docs and priority processing
    (optional)

    Please do not use your full name or your matriculation number as the identifier and don't include any non-English characters.
  5. {brute,terms,docs,best}.py: your implementation of tasks 1, 2, 3, 4
    best.py:
    • should start using the following command: /usr/bin/python2.6 ./best.py
    • should read docs.txt and qrys.txt from the current directory
    • should write the results into best.top in the current directory
    • should not rely on any command-line arguments, environment variables or files not in the current directory
    • should run on a standard DICE machine with 2GB RAM and no network connection
    • should run for no more than 30 minutes and not spawn any background threads
    {brute,terms,docs}.py should write the results into {brute,terms,docs}.top respectively.

Marking

The assignment is worth 7.5% of your total course mark and will be scored out of 10 points as follows:

Sanity checks

  1. The output for tasks 1, 2, 3 and 4 should be identical (for all queries).
  2. Correct output for the first 1000 queries is here.
  3. brute.py should take ~25 minutes for the full dataset.
  4. terms.py should take ~20 seconds for the full dataset.


Home : Teaching : Courses : Tts 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh