Text Technologies for Data Science Assessment 3

Dominik Wurzer

October 20, 2014

The prestigious news agency Alpine Dale pays their field reporters based on the number of submitted stories. The company suspects that some of their reporters might plagiarize their submissions in order to increase their payroll. Since the volume of news stories is high, plagiarism detection using humans is too labour intensive and consequently deemed infeasible. Your consultant company has therefore been contracted to develop a plagiarism-detection tool able to discover various types of plagiarism. 

What to turn in

The assessment is due 16:00 on Monday, 3rd November, 2014. [late submissions]
  1. Please submit a paper copy of the report (task 2) to the ITO (Appleton Tower room 4.02).
    Note that only the first two pages of the paper report will be assessed: any material beyond this will be ignored.
  2. On a DICE machine, create a directory called tts3, and place the following files into it:
    • detector.py The source code of your plagiarism detector 
    • any files that are necessary to run your implementation (10MB size limit)
    • type1.dup, type2.dup, and type3.dup -- Plain text files with your results
    • report.pdf -- a PDF version of your report
    Once the files are in place, run the following DICE command: submit ttsds 3 tts3
Please make sure you name the files exactly as stated above and use the correct formats.
Please make sure your code runs correctly on a DICE machine and produces results in the correct format.
You may lose a significant fraction of the marks if you use wrong names, formats, or if we cannot run your code.



  1. Have a look at the training set and identify the kind of plagiarism Alpine Dale denoted as type 1 and type 2.
  2. Implement a detection tool able to spot plagiarism of type 1. Check the correctness of your implementation by comparing it to the provided truth (type1.truth) on the training data set.
  3. Implement or extend your detection tool to also handle plagiarism of type 2. Use the the truth (type2.truth) on the training data set to tune your algorithm.
  4. Run your plagiarism detector on the test dataset (data.test) for 30 minutes and save all detected cases of plagiarism in their adequate output files. It is your responisbility to ensure that the output is stored even if the program is terminated after 30 minutes. 
  5. Adapt Finn's method (set c = 100) to extract all areas with high number densities from the provided finn data set (data.finn). Then investigate those areas for possible plagiarism of type 1 or 2. Note that any case of plagiarism will consist of at least a couple of tokens.
  6. Provide a detailed overview of the decisions you made in designing your plagiarism detection tool. Explain your thought process and outline the conclusion you drew. In particular explain, what kinds of plagiarism you have detected and how?  Please note:
    • The report should not be a step-by-step walkthrough of your code
    • Don't waste space explaining implementation details like program structure, classes, functions, etc.
    • Describe what you tried, why you tried it, how it improved results, etc.



The news agency aims on detecting three kinds of plagiarism, which they creatively denote as type 1, type 2 and type 3. The company has collected examples for plagiarism of type 1 and 2, which have been spotted by human proofreaders in the past. The agency further suspects that interesting sections might be more likely subject to plagiarism. In particular, they assume that sections which have a lot of numbers are likely to be plagiarised, as these require more research effort. This kind of plagiarism is denoted as type 3.


Alpine Dale provided you with a training set, which contains examples for plagiarism of type 1 and 2. There are no examples for plagiarism of type 3, however. Please note that a perfect accuracy on this set doesn't necessarily mean the same for future data, depending on the methods and parameters you use.


Additionally, you are provided with data sets containing samples from Alpine Dale's news stream. The batches are suspected to contain all three kinds of plagiarism. 

Output Format

For tasks 4 and 5 above, provide 3 text files, which contain the cases of plagiarism you found in the following format:

107 435
95 573
562 3711

This would mean that the file 107 is a plagiarism of 435, and 95, 573 and 562, 3711 are also plagiarised. Please use blank between each plagiarism pair and have no more than two numbers in one line. Each pair should be separated by a single blank (and nothing else), whereas the documents should be sorted in ascendant order (the left number should be lower than the right). You should not add the same line twice. Please stick to this format and do not add anything else to the file. If we cannot process your submission, you may lose a significant fraction of the marks.

Put all detected cases of plagiarism of type 1 in a file called type1.dup
Put all detected cases of plagiarism of type 2 in a file called type2.dup
Put all detected cases of plagiarism of type 3 in a file called type3.dup

If a detected pair fits more than one type of plagiarism, list it in all relevant files. Also, please make sure that running your submitted source code will output the exact same .dup files you submitted, provided it has sufficient time and access to the data you are working on (please do not submit the data set).


The assignment is worth 7.5% of your total course mark and will be scored out of 10 points as follows:

1 Points for identifying all cases of type 1 plagiarism (task 4).
3 Points for identifying all cases of type 2 plagiarism (task 4).
3 Points for identifying all cases of type 3 plagiarims (task 5).
3 Points for a clear and detailed report, addressing the questions above.

Note that you will be marked on both recall and precision of your detected cases of plagiarism within 30min.


You must use Python 2.6 (default on DICE) as the programming language for this assignment. You can use other libraries and packages subject to the following conditions:
  1. You should not use any software packages or libraries that provide out-of-the-box duplicate/plagiarism detection.
  2. The core logic of the algorithm should be your own work.
  3. You are free to use any libraries or external programs that perform generic tasks such as string operations, hashtables, etc.
  4. Do not use parallel processing of any kind: no threading, multi-processing or vector/GPU code.
  5. Do not use just-in-time compilers, bytecode optimizers or native interfaces of any kind (e.g. PyPy, Psyco, ctypes, SWIG).
  6. Your code will be run as follows: /usr/bin/python2.6 detector.py
  7. Your code should not rely on any command-line arguments, environment variables or files not in the current directory.
  8. Your code should read the stories from data.test and data.finn in the current directory.
  9. Your code should write the results into type1.dup, type2.dup and type3.dup in the current directory.
  10. Your code should run on a standard DICE machine with 2GB RAM and no network connection
  11. Your code will be terminated after 30 minutes of running. It is your responsibility to make sure the output is saved to type[1,2,3].dup
  12. If you consult external sources, previous assignments or labs, please cite them (this also includes personal communication).
Please ask questions on the discussion forum

Home : Teaching : Courses : Tts 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh