Main | Lectures | Labs | Projects

 

Suggested Datasets: Introduction to Research in Data Science (IRDS)

Here is a list of suggested project ideas for the mini-project for IRDS. If you wish, you may instead propose a project that is not on this list. At the bottom of this page, you will find some examples of datasets which we judged as inappropriate for the projects — this may help you to avoid some pitfalls. Other sources of ideas for data sets include:

It could also be very interesting to consider a project that involves databases or data management systems rather than data analysis. For example, you might compare the runtime of different variants of a system on a set of test queries. The lecturers from your databases course would be happy to help with ideas for this.

Many of the data sets on this page are locally available in the directory /afs/inf.ed.ac.uk/group/ANC/IRDS/miniproject-data.


Contents

Data Integration and Visualization of Grant Proposals

Note: This is proposed by one of the CDT parterns, the Digital Catapult. This is a real-world problem, and your findings could have on future grant allocations.

Change History of Wikipedia Infoboxes

This data set kindly collected by Google.


Predicting Links and Communities in Social Networks


Yelp Dataset Challenge


Predicting Dropouts in MOOCs


On-time Performance of Commericial Air Travel

Thanks to MathWorks for alerting us to this public data set.


Ninapro EMG hand gesture recognition and kinematics reconstruction


Data Transformation and Querying

Consider any of the datasets listed elsewhere on this page - the larger and/or less structured, the better. Identify five common operations or "queries" that one might want to compute over this data as part of a larger analysis. Choose two publicly available database systems using different data models, from the following list:

For each of the two systems, carry out the following:

  1. Set up an appropriate database schema (if necessary) suitable for storing the dataset.
  2. Implement a mapping from the raw form of the data to the model used by the database; load the data and measure the time and space needed for it.
  3. Implement each of the five queries using the query language supported by the database, and measure the performance of each of the queries (following appropriate experimental methodology). (If a query turns out not to be answerable using a single query in the database's query language, it may be implemented using several queries or by loading the data from the database and evaluating in-memory.)
  4. Identify any performance issues and investigate ways of ameliorating them using the database, for example, by identifying appropriate indexing or constraints in the RDBMS setting, or adding user-defined functions to the database to support specific operations that are not present in the query language.

The report for a mini-project of this form should present an empirical comparison of the query performance and space usage of the two systems being compared. This should include a detailed analysis of any uncertainty involved in the evaluation - for example, to account for variation in observations across different runs. The report can also incorporate qualitative observations (or quantitative measurements) relating to the ease of accomplishing the above tasks using the different tools.


Object Classification from Images (Caltech101)


Sentiment Classification from Movie Reviews


Identifying Malaria Parasites from Images


Predicting Cuisines of Recipes


Human Activity Recognition Using Smartphones Data Set


Web quality assesment


Short term movements in stock prices


Energy Usage in the Informatics Forum


Student performance on mathematical problems


Marketing: Predicting Customer Churn


Performance of Distributed Data Analysis: Map-Reduce, etc.


Comparing Stochastic Optimization Algorithms


Particle physics data set



Brain-Computer Interface data set


Prediction of Gene/Protein Localization data set


Prediction of Molecular Bioactivity for Drug Design: Binding to Thrombin dataset


The 4 Universities dataset


Internet advertisements dataset


The Reuters-21578 text dataset


The charitable donations dataset


The caravan insurance data


The yeast S. cerevisiae gene expression vectors


The colon cancer data


The leukemia data set


The human splice site data


Volcanoes on Venus


Network intrusion data


The SuperCOSMOS Sky Survey objects catalogue


Less interesting datasets

You are allowed to come up with your own dataset for this project. In order to guide you in this search, we present here some examples of datasets which were considered less interesting.

The Landsat image data from Statlog


The OHSUMED document collection


The predictive toxicology dataset


20 News Groups dataset


Yeast Gene Regulation Prediction dataset


CATS benchmark


This page was originally written by Frederick Ducatelle for the Data Mining and Exploration course. This IRDS version is maintained by Charles Sutton.


Home : Teaching : Courses : Irds 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh