Mini-Projects

In the project, you will use data science methods in a realistic setting. We have a short list of projects and corresponding datasets1, that you can choose from. You will work in groups of 4 people2, using Piazza to find teammates, and will collaborate to produce a project report that will be assessed.

Each group need only submit one report. All members of the group will receive the same grade, excepting in any special/extenuating circumstances. The grade will be based on the final report only.

You will have considerable freedom in the projects, but it should involve most parts of the data analysis process described over the lectures. This would typically involve

Note that you’re not required to outperform prior methods. The important thing is that the approaches taken are reasonable, methodologically correct, and clearly described in the report. Good projects would nonetheless discuss possible differences in performance to current state-of-the-art approaches.

Important Dates

Wed 10 Feb, noon (wk 5)
Each group should fill-in the project details form with the following information
  • Student numbers and names
  • Ranked project choices in decreasing order of preference

We’ll try to keep everyone happy, but there’s a chance you won’t be allocated your first choice.

Wed 17 March, noon (wk 8)
Each group must upload an progress report using the project interim report form. The report should detail what you have done so far and your plans for completion by the final deadline. You may want to consider this to be a draft of the final report, using the report template itself, but this will not be enforced. The interim report will not form part of your numerical mark for the course. The goal here is to ensure your project has the right scope and that you are on track.
Wed 7 April, 4pm (wk 12)
Final report due. You should submit the report (in PDF), the complete latex source (so we can re-create the PDF), as well as the complete source code to reproduce the results in the report.

Project Assignments

Project assignments: https://edin.ac/3pBouiV

Details on where to submit the final report and any specifications for run-scripts in the source code will be provided later in the course.

Authors’ Instructions

The report should be maximally 8 pages long, using this report template (adapted from the NeurIPS conference). It should contain the following in some manner:

At the end of the report, you must include a short description of how each member of the group contributed to the project, which can be on an additional ninth page.

Marking

The marking criteria include the appropriateness of the machine learning methods chosen, quality of the analysis, the quality of the evaluation, the amount of work, and the quality of the explanation of the report (both text and graphics). While you will be marked out of 100 in line with the common marking scheme, an interpretation of the scheme with letter grades can be seen as:

A:
Well explained description of points above plus extra achievement at understanding or analysis of results. Clear explanations, evidence of creative or deeper thought will contribute to a higher grade.
B:
Well explained description of points above.
C:
Good description of points above but significant deficiencies.
D:
Evidence that the student has gained some understanding, but not addressed the specified task properly.
E/F/G:
Serious error or slack work.

Late penalties

The policy of the School of Informatics is that no late submissions are allowed except on valid ground agreed a priori with the year organiser.

Plagiarism Policy

You are expected to discuss the work within your group, and to work on your report together. You should write up the project as a whole, including the work of the others in your project. At the end of the report, there should be a short description of how each member of the group contributed to the project. Please familiarize yourself with the School Plagiarism Policy.

Projects

Sentiment Classification from Movie Reviews

This data set contains 215,154 phrases from movie reviews on Rotten Tomatoes, labeled with the degree of sentiment that the phrase expresses, on a 5-point scale from positive to negative.

Task
  • Predict the sentiment from a given phrase
  • Explore unsupervised clustering methods
    • explore clustering metrics and compare performance across methods
    • analyse what clustering-bases get used for different methods
Data
A copy of the data is available on AFS at /afs/inf.ed.ac.uk/group/ANC/IRDS/miniproject-data/sentiment-movie, obtained from http://nlp.stanford.edu/sentiment/ .
References
  • Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng and Christopher Potts. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2013.

Predicting Cuisines of Recipes

This dataset includes 4236 recipes from 12 cuisines with 709 distinct ingredients, labelled with what type of cuisine (e.g., Italian, Japanese, Chinese, etc.) it is from.

Task
  • Predict type of cuisine from a given list of ingredients in a recipe
  • Explore collaborative filtering
    • predict ingredients to complete a partial recipe
    • explore relevant metrics and compare performance to appropriate baselines
Data
This data is available locally at /afs/inf.ed.ac.uk/group/ANC/IRDS/miniproject-data/recipes, and was collected by Facundo Bellosi.
References
  • Machine learning for cuisine discovery, Facundo Bellosi, MSc thesis (UoE) 2012.

Activity prediction from brain-computer interface data

This dataset was used in the BCI Competition III (dataset V). Using a cap with 32 integrated electrodes, EEG data were collected from three subjects while they performed three activities: imagining moving their left hand, imagining moving their right hand, and thinking of words beginning with the same letter. As well as the raw EEG signals, the data set provides precomputed features obtained by spatially filtering these signals and calculating the power spectral density.

Task
  • Exploratory data analysis and appropriate preparation/cleanup of data
  • Predict label of test data with different classifiers, using appropriate baselines (say, using precomputed features) and classifier choices.
  • Explore the utility of incorporating the temporal aspect of data through the use of time-series models (e.g. Hidden Markov Models).
Data
The dataset is available on afs at /afs/inf.ed.ac.uk/group/ANC/IRDS/miniproject-data/bbci, and is described in detail here.
  • 31216 records in training data, 10464 in test data
  • Each record has 96 continuous values and a numerical label
  • 63 MB as uncompressed text
Reference
  • On the need for on-line learning in brain-computer interfaces by J. del R. Millán
  1. taken from the IRDS course 

  2. Excepting special/extenuating circumstances.