Main | Lectures | Labs | Projects


Suggested Datasets: Introduction to Research in Data Science (IRDS)

Here is a list of suggested project ideas for the mini-project for IRDS. If you wish, you may instead propose a project that is not on this list. At the bottom of this page, you will find some examples of datasets which we judged as inappropriate for the projects — this may help you to avoid some pitfalls. Other sources of ideas for data sets include:

It could also be very interesting to consider a project that involves databases or data management systems rather than data analysis. For example, you might compare the runtime of different variants of a system on a set of test queries. The lecturers from your databases course would be happy to help with ideas for this.

Many of the data sets on this page are locally available in the directory /afs/


Data Integration and Visualization of Grant Proposals

Note: This is proposed by one of the CDT parterns, the Digital Catapult. This is a real-world problem, and your findings could have on future grant allocations.

Change History of Wikipedia Infoboxes

This data set kindly collected by Google.

Predicting Links and Communities in Social Networks

Yelp Dataset Challenge

Predicting Dropouts in MOOCs

On-time Performance of Commericial Air Travel

Thanks to MathWorks for alerting us to this public data set.

Ninapro EMG hand gesture recognition and kinematics reconstruction

Data Transformation and Querying

Consider any of the datasets listed elsewhere on this page - the larger and/or less structured, the better. Identify five common operations or "queries" that one might want to compute over this data as part of a larger analysis. Choose two publicly available database systems using different data models, from the following list:

For each of the two systems, carry out the following:

  1. Set up an appropriate database schema (if necessary) suitable for storing the dataset.
  2. Implement a mapping from the raw form of the data to the model used by the database; load the data and measure the time and space needed for it.
  3. Implement each of the five queries using the query language supported by the database, and measure the performance of each of the queries (following appropriate experimental methodology). (If a query turns out not to be answerable using a single query in the database's query language, it may be implemented using several queries or by loading the data from the database and evaluating in-memory.)
  4. Identify any performance issues and investigate ways of ameliorating them using the database, for example, by identifying appropriate indexing or constraints in the RDBMS setting, or adding user-defined functions to the database to support specific operations that are not present in the query language.

The report for a mini-project of this form should present an empirical comparison of the query performance and space usage of the two systems being compared. This should include a detailed analysis of any uncertainty involved in the evaluation - for example, to account for variation in observations across different runs. The report can also incorporate qualitative observations (or quantitative measurements) relating to the ease of accomplishing the above tasks using the different tools.

Object Classification from Images (Caltech101)

Sentiment Classification from Movie Reviews

Identifying Malaria Parasites from Images

Predicting Cuisines of Recipes

Human Activity Recognition Using Smartphones Data Set

Web quality assesment

Short term movements in stock prices

Energy Usage in the Informatics Forum

Student performance on mathematical problems

Marketing: Predicting Customer Churn

Performance of Distributed Data Analysis: Map-Reduce, etc.

Comparing Stochastic Optimization Algorithms

Particle physics data set

Brain-Computer Interface data set

Prediction of Gene/Protein Localization data set

Prediction of Molecular Bioactivity for Drug Design: Binding to Thrombin dataset

The 4 Universities dataset

Internet advertisements dataset

The Reuters-21578 text dataset

The charitable donations dataset

The caravan insurance data

The yeast S. cerevisiae gene expression vectors

The colon cancer data

The leukemia data set

The human splice site data

Volcanoes on Venus

Network intrusion data

The SuperCOSMOS Sky Survey objects catalogue

Less interesting datasets

You are allowed to come up with your own dataset for this project. In order to guide you in this search, we present here some examples of datasets which were considered less interesting.

The Landsat image data from Statlog

The OHSUMED document collection

The predictive toxicology dataset

20 News Groups dataset

Yeast Gene Regulation Prediction dataset

CATS benchmark

This page was originally written by Frederick Ducatelle for the Data Mining and Exploration course. This IRDS version is maintained by Charles Sutton.

Home : Teaching : Courses : Irds 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail:
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh