Text Technologies for Data Science: Assessment 1

Victor Lavrenko and Dominik Wurzer

22nd September, 2014

The goal of this assessment is to develop a search engine for scientific documents.

What to turn in

  1. Please submit a paper copy of the report (task 6) to the ITO (Appleton Tower room 4.02).
    Note that only the first two pages of the paper report will be assessed: any material beyond this will be ignored.
  2. On a DICE machine, create a directory called tts1, and place the following files into it:
    • overlap.py and overlap.top -- your implementation and results for task 1
    • tfidf.py and tfidf.top -- your implementation and results for task 2
    • best.py and best.top -- your implementation and results for task 3
    • test.top -- your results results for task 5
    • best.id -- the name of your best algorithm
    • any files that are necessary to run your implementation (10MB size limit)
    • report.pdf -- a PDF version of your report
    When you're ready to submit, run the following command: submit ttsds 1 tts1
Please make sure you name the files exactly as stated above and use the correct formats.
Please make sure your code runs correctly on a DICE machine and produces results in the correct format.
You may lose a significant fraction of the marks if you use wrong names, formats, or if we cannot run your code.
To check your submission for errors, download a1check, and run: csh a1check tts1
Please check the Q/A section below before asking questions.

Both the paper report and the electronic submission are due 16:00 on Monday, 6th October, 2014. [late submissions]


You are dealing with a collection of scientific documents. Each document consists of a title, date, authors, abstract, keywords and references, all in plain-text form [example]. Some elements may be missing in some of the documents.


Your goal is to develop a search engine for retrieving relevant documents in response to narrative queries. To aid in the development, you are provided with a set of 32 training queries, a set of relevance judgments, and software for evaluating the accuracy of your algorithms. Your algorithms will be tested on a different set of testing queries. Please familiarize yourself with the Data and Formats section, and then complete the following tasks:
  1. Implement a simple word overlap retrieval algorithm (see lecture 3, slide 6). Your implementation (overlap.py) should read the query file (qrys.txt) and the document file (docs.txt) from the current directory. Tokenize on punctuation and lowercase the tokens, but don't do anything else to improve performance. For each query Q and for each document D compute the overlap score between Q and D. Output all scores into a file called overlap.top in the current directory.
  2. Implement a tf.idf retrieval algorithm (tfidf.py), based on the weighted sum formula with tf.idf weighting (lecture 3). Set k=2. All relevant statistics should be computed from docs.txt. Tokenize exactly as in task 1. Save the results in a file called tfidf.top.
  3. Try to improve the accuracy of your retrieval algorithms. You are free to use any techniques that you think may improve retrieval effectiveness, subject to the restrictions. Use average precision (see task 4) to judge the accuracy of your approach. Save your most successful algorithm as best.py. Output the results to a file called best.top. Come up with a short distinctive name for your algorithm and save it in a file called best.id. This name will be used when we report the effectiveness of your algorithm.
  4. Evaluate the performance of your algorithms, using the provided trec_eval program. You can do this on a Linux command line by running:
         trec_eval -o -c -M1000 truth.rel overlap.top
         trec_eval -o -c -M1000 truth.rel tfidf.top
         trec_eval -o -c -M1000 truth.rel best.top
    The program will print a number of standard effectiveness measures for your algorithm (you'll learn about these in later lectures). Use average precision as your primary measure.
  5. Run your most successful algorithm (best.py) on the testing queries in test.txt. Output the results to a file called test.top. Average precision (MAP) on these testing queries will be used as the main criterion for marking this task. You cannot compute MAP yourself (we are not providing relevance judgements for the test set), but assume that MAP for the training queries will be representative of the MAP for the testing queries.
  6. Write a report outlining the decisions you made in implementing the search algorithms, as well as your comments on the relative performance of the techniques you have tried. Provide specific details on the methods and their parameters. Include formulas, tables and graphs if necessary. You are allowed a maximum of 2 pages for the report, but shorter reports are fine. Submit a paper copy of the report to the ITO. Please note:
    • The report should not be a step-by-step walkthrough of your code
    • Don't waste space explaining program structure, classes and functions
    • Spend most of your report describing task 3, not tasks 1 and 2.
    • Describe what you tried, why you tried it, how it improved results, etc.
    • Here's an example of a good report (note: it was based on a very different dataset)


You must use Python 2.6 (default on DICE) as the programming language for this assignment. If you don't know Python, please consider the following tutorials: [part1], [part2] . You can use other libraries and packages subject to the following conditions:
  1. Do not use any software packages or libraries that provide out-of-the-box indexing, or retrieval/search functionality.
  2. Do not use any network services. This includes online APIs, spell-checkers, etc.
  3. The core logic of the retrieval algorithm (matching queries to documents) should be clearly and identifiably your own work.
  4. You can reuse any of the code in the TTS lab exercises as long as you cite the origin.
  5. You are free to use any libraries or external programs that perform generic tasks such as reading / writing of files, manipulation of strings, lists, vectors, matrices, hash-tables and priority queues, as well as sorting and merging methods. Please cite the libraries you use.
  6. If you are uncertain as to whether a package qualifies as "generic" or not -- please ask.
  7. If you consult external sources, please cite them (this includes personal communication).

Data and Formats

  1. docs.txt [download]: documents.
    Each document is on a separate line. The first field is a unique document number, followed by the document content, all on the same line. For example:
        1 Chebyshev Interpolation and Quadrature Formulas of Very High Degree ...
        2 Generation of Hilbert Derived Test Matrix (Algorithm 274 [F1]) ...
    There are 3204 documents.
  2. qrys.txt [download] : training queries.
    test.txt [download] : testing queries.
    Each query is given on a separate line. The first field is the query number, followed by the query itself, as follows:
        6 SETL, Very High Level Languages
        28 Anything dealing with star height of regular languages or regular expressions or ...
    There are 32 training queries and 32 testing queries.
  3. truth.rel [download]: relevance judgments for the training queries. You do not need to process this file within your code.
    Specifies which documents are relevant to each query. First field is the query number, third is the document number, second and fourth must be set to 0 and 1 respectively. This file is used by trec_eval.
  4. {overlap,tfidf,best,test}.top: search results -- you will create these files.
    Once you compare the query to a set of documents, print the results in the following format:
        1 0 1009 0 0.1234 0
        1 0 1010 0 0.5678 0
        2 0 1003 0 0.9876 0
    This means that for query "1" you retrieved two documents with numbers "1009" and "1010". Document "1010" is a better match to the query than "1009" (similarity 0.5678 vs. 0.1234). For query "2" you retrieved one document ("1003"). Second, fourth and sixth fields should be printed as "0". Lines can be printed in any order (there is no need to sort).
  5. best.id: algorithm name -- you will create this file.
    Come up with a short and distinctive name for your best algorithm and save it in a file called best.id. The file should contain your chosen name for your algorithm on the first line. Optionally, you may provide a one-sentence description of your algorithm on the second line, for example:
        tf.idf with Porter and WordNet

    Please do not use your full name or your matriculation number as the identifier and don't include any non-English characters.
  6. {overlap,tfidf,best}.py: your implementation of tasks 1, 2, 3
    • should start using the following command: /usr/bin/python2.6 ./best.py
    • should read docs.txt and qrys.txt from the current directory
    • should write the results into best.top in the current directory
    • should not rely on any command-line arguments, environment variables or files not in the current directory
    • should run on a standard DICE machine with 2GB RAM and no network connection
    • should run for no more than 30 minutes and not spawn any background threads
    overlap.py should write the results into overlap.top; tfidf.py to tfidf.top.
  7. trec_eval [download]: evaluation program.
    This is a Linux binary that you will use to evaluate the accuracy of your system. It should run on any DICE machine. After downloading the file, make sure it's executable:
        chmod a+x trec_eval
    If you are using Windows or a Mac, you have two options: either perform evaluation on a Linux machine (e.g. in the DICE lab), or re-compile the program from the source [download]. Unfortunately we cannot help you compile trec_eval.


The assignment is worth 7.5% of your total course mark and will be scored out of 10 points as follows:

Sanity check: correct implementation of task 2 should yield mean average precision of 0.32 +/- 0.02. No other targets will be provided.

Questions and Answers

  1. Q: For task 3, can I implement any other relevant documents retrieval algorithm or use any other possible formula? Or do I have to perform some magic with given weighted sum formula with tf-idf weighting and obtain better performance?
    A: You can implement anything you want. Keep in mind that your implementation shouldn't be too specifically tailored to the training queries.
  2. Q: I get the following error: bash: trec_eval: command not found
    A: First, make the file executable: chmod a+x ./trec_eval
    Second, specify the path: ./trec_eval -o -c -M1000 truth.rel ...
  3. Q: I get this: trec_eval: input error: in trec_eval: 'Malformed qrels line' Illegal parameter value - Quit
    A: This happens if the file truth.rel does not exist, or is not in the current directory, or is corrupted. Try running the command md5sum truth.rel, you should see something like:    7dda22e1eee9c296951ee4cff513e6a4   truth.rel
  4. Q: I get this: trec_eval: input error: in trec_eval: 'malformed top results line' Illegal parameter value - Quit
    A: You produced the .top file in the wrong format. See the search results section.
  5. Q: I get no output from trec_eval
    A: The relevance judgments don't match the .top file.
  6. Q: Do we have to normalize the similarity so as to be between 0 and 1? (for example dividing by 100).
    A: No, you don't have to normalize the scores in any way. All that matters is how the documents get ranked relative to each other.
  7. Q: I get this: trec_eval: input error: in trec_eval: 'malformed top results line' Illegal parameter value - Quit
    I eyeballed by .top file and it looks ok...
    A: Try the following commands, none of them should produce any output:
       gawk 'NF != 6' FILE.top
       gawk '$2 || $4 || $6' FILE.top
       gawk '($1+0)*($3+0)*($5+0) == 0' FILE.top
  8. Q: I get trec_eval: form_tr_vec error: in trec_eval: 'Duplicate top docs docno' Illegal parameter value - Quit
    A: This means you're retrieving the same document twice for the same query. Check if some document number appears more than once in the .top file.
Ask questions about this coursework here

Home : Teaching : Courses : Tts 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh