Based on previous versions by Michael Gutmann, Charles Sutton and Stefanos Angelidis.

In the project, you will use data science methods in a realistic setting. We have a list of potential projects and corresponding datasets (same as for the IRDS course). For each dataset, the web page gives a description of the task to be undertaken. If you wish to propose your own project, feel free to contact the TA. You will produce a project report that will be assessed.


You will have considerable freedom in the projects. But it should involve most parts of the data analysis process described in the first lecture. An example project involves

  • reading up on some relevant background to well understand the task and what has been done previously (via google scholar, internet search, in some cases references are provided)
  • some exploratory data analysis
  • if classification is the goal, choosing some methods that might work well on the task, based on the first two steps
  • evaluating the results of the different methods on the task (e.g. by assessing the generalisation performance).

There is no need that you are outperforming previous methods. What is important is that the path taken is reasonable, methodologically correct, and clearly described in the report. Good projects would nonetheless discuss possible differences in performance.

Some of the data sets may be too large to be used directly in the software that you have. In such cases, you are allowed to appropriately sub-sample the data set.


You will work in groups of maximally 4 people. You can use Piazza to find team-mates.

By Friday 16 February 2018, 4pm, each group should send an email to the TA with a ranked list of 3 datasets you would like to work on. Please also indicate the names and student numbers of all group members. We'll try to keep everyone happy, but there's a chance you won't be allocated your 1st choice.

By Tuesday 13 March 2018, 4pm, each group must email the instructor an interim report. It should contain what you have done so far and your plans for completion of the mini-project by the final deadline. While not necessary, you may want to consider it to be a draft of the final report (and you can use the same template if you like). This report will not form part of your numerical mark for the course. The goal of interim report is to make sure that your project has the right scope and that you are on track.

Evaluation of the work on the mini-project will be by a written final report. Each group need only submit one report. Unless for special reasons, all members of the group will receive the same grade.

By Friday 6 April 2018, 4pm, the final report is due. You will need to submit the report, its latex source as well as all source code to reproduce the results in the report. (tbc: submission will be via the submit command) The grade will be based on the final report only.

The report

The report should be maximally 8 pages long, using this report template (adapted from the NIPS conference). It should contain the following in some manner:

  • description of the task
  • relevant background and related previous work
  • explanation of the significance/relevance of the objective/task
  • information on the data preparation
  • exploratory data analysis
  • description of the learning (e.g. classification) methods used
  • results and evaluation
  • conclusions

At the end of the report, you must include a short description of how each member of the group contributed to the project, which can be on an additional ninth page.

Marking Breakdown

The marking criteria include the appropriateness of the machine learning methods chosen, quality of the analysis, the quality of the evaluation, the amount of work, and the quality of the explanation of the report (both text and graphics). A guide to the letter marks are:

  • A: Well explained description of points above plus extra achievement at understanding or analysis of results. Clear explanations, evidence of creative or deeper thought will contribute to a higher grade.
  • B: Well explained description of points above.
  • C: Good description of points above but significant deficiencies.
  • D: Evidence that the student has gained some understanding, but not addressed the specified task properly.
  • E/F/G Serious error or slack work.


Late penalties The policy of the School of Informatics is that no late submissions are allowed except on valid ground agreed a priori with the year organiser.

Plagiarism Policy The projects are (usually) group projects. Hence you are expected to discuss the work within your group, and to work on your report together. You should write up the project as a whole, including the work of the others in your project. At the end of the report, there should be a short description of how each member of the group contributed to the project. Please familiarize yourself with the School Plagiarism Policy.