Main | Lectures | Labs | Projects

Suggested Datasets: Introduction to Research in Data Science (IRDS)

Here is a list of suggested project ideas for the mini-project for IRDS. If you wish, you may instead propose a project that is not on this list. At the bottom of this page, you will find some examples of datasets which we judged as inappropriate for the projects — this may help you to avoid some pitfalls. Other sources of ideas for data sets include:

It could also be very interesting to consider a project that involves databases or data management systems rather than data analysis. For example, you might compare the runtime of different variants of a system on a set of test queries. The lecturers from your databases course would be happy to help with ideas for this.

Many of the data sets on this page are locally available in the directory /afs/inf.ed.ac.uk/group/ANC/IRDS/miniproject-data.

Data Integration and Visualization of Grant Proposals

Note: This is proposed by one of the CDT parterns, the Digital Catapult. This is a real-world problem, and your findings could have on future grant allocations.
• Description: The Department of Business Innovation and Skills (BIS) funds business and research development through a network of sub-organisations. The project goal is to get a unified view based on separate datasets collected by part of the network, and understand "where the money is going".
• Size: 4 datasets of thousands of entries each
• Getting the data: The data is available from four different web sites:
• Task: The main objective is to understand grant data from different organisations in a unified way:
• Create a unique identifier to join the different data sets
• Unify the datasets and explore/visualise distribution of grants by sector, geography, type of grant, time, etc
• Based on past applications recommend the best grant option for a new application
• Challenges: Extracting information from semi-structured data: the content of the proposals (which need to be first scraped from HTML). Predictive models based on natural language processing

Change History of Wikipedia Infoboxes

• Description: This data set contains the edit history of 1.8 million infoboxes in Wikipedia pages. These are the structured set of information on the top right of Wikipedia pages in many popular categories. Attributes on Wikipedia change over time, and some of them change more than others. Understanding attribute change is important for extracting accurate and useful information from Wikipedia.
• Task: Although this data set does not have an explicit task associated with it, there are several predictive tasks that could be derived from it. One is understanding predicting which data items are most likely to be updated, based on the category of page, record key, last time updated, etc. This could be useful for alerting wikipedia editors to data items might need update soon, or to alerting editors when other users make suspcious revisions to data items that do not usually change over time (a real-life example). Time series modelling techniques could be useful here.
• Data set: Dataset of Wikipedia changes in JSON format (126 GB). Available locally at /afs/inf.ed.ac.uk/group/ANC/IRDS/miniproject-data/wikipedia-changes
• Challenges: Semi-structured time series data with a textual component (with all problems of ambiguity etc).

This data set kindly collected by Google.

Predicting Links and Communities in Social Networks

• Description: A wide variety of network data is available as part of the Stanford Large Network Dataset Collection. Examples include subsets of social networks from Facebook and Google+. Two types of prediciton problems you could consider are: (1) Choose a data set from the list which contains ground truth communities. For example, the Live Journal data set contains information from groups that users created and joined. Explore to what extent it is possible to predict whether a user is a member of the group based on the social network structure. (2) Choose a data set with metadata on each node in the network. For example, the high-energy physics citation network contains information about the title and abstract of the papers. Explore the possibility of link prediction (predict whether paper A cites paper B) based on the meta-data.

Yelp Dataset Challenge

• Description: The dataset contains 2.7M reviews and 649K tips by 687K users for 86K businesses. Recently, 200K pictures from the included businesses have been added.
• Task: This data set does not have an explicit task associated with it, however many example tasks have been suggested by the challenge organisers.
• Data set: 952 MB, 6.3 GB (pictures). A copy of the data is available on AFS at /afs/inf.ed.ac.uk/group/ANC/IRDS/miniproject-data/. They were obtained from https://www.yelp.co.uk/dataset_challenge/dataset.
• References: Publications using this dataset can be found on google scholar.

Predicting Dropouts in MOOCs

• Description: Th egoal is to predict whether a student will drop out of a MOOC based on their interaction with the online platform. The data contain 8157278 "events" that record one student's visit to one element of a course. The data is from XuetangX, a Chinese MOOC provider that was launched in 2013. This data was the basis of the KDD Cup 2015 competition. The KDD Cup web site will contain more information about the data, as well as links to methods that people tried on this data during the competition.

On-time Performance of Commericial Air Travel

• Description: Data from the United States Department of Transportation regarding the on-time statistics of various commercial flights. Here is a description of the data set. The original data can be obtained from that link (several hundred thousand entries per year). Or, you can find slightly processed versions which might be easier to use, but go only to 2008. That version of the data set is also available locally at /afs/inf.ed.ac.uk/group/ANC/IRDS/miniproject-data/us-flights
• Task: Predict whether a flight will arrive on time based on carrier, route, date, etc. More advanced questions that you can ask include: Does the performance of your classifier degrade over time? e.g., if you train on 2006 and test on 2008, how much do you lose versus training and testing on 2006? Is it better to periodically retrain the classifier? If so, how often and with how much old data?

Thanks to MathWorks for alerting us to this public data set.

Data Transformation and Querying

Consider any of the datasets listed elsewhere on this page - the larger and/or less structured, the better. Identify five common operations or "queries" that one might want to compute over this data as part of a larger analysis. Choose two publicly available database systems using different data models, from the following list:

• MySQL or PostgreSQL (relational)
• eXist, MarkLogic (XML database)
• OpenRDF Sesame, Stardog, or Virtuoso (RDF triple stores)

For each of the two systems, carry out the following:

1. Set up an appropriate database schema (if necessary) suitable for storing the dataset.
2. Implement a mapping from the raw form of the data to the model used by the database; load the data and measure the time and space needed for it.
3. Implement each of the five queries using the query language supported by the database, and measure the performance of each of the queries (following appropriate experimental methodology). (If a query turns out not to be answerable using a single query in the database's query language, it may be implemented using several queries or by loading the data from the database and evaluating in-memory.)
4. Identify any performance issues and investigate ways of ameliorating them using the database, for example, by identifying appropriate indexing or constraints in the RDBMS setting, or adding user-defined functions to the database to support specific operations that are not present in the query language.

The report for a mini-project of this form should present an empirical comparison of the query performance and space usage of the two systems being compared. This should include a detailed analysis of any uncertainty involved in the evaluation - for example, to account for variation in observations across different runs. The report can also incorporate qualitative observations (or quantitative measurements) relating to the ease of accomplishing the above tasks using the different tools.

Object Classification from Images (Caltech101)

• Description: This data set (called the "Caltech 101" data set) contains pictures of 8680 objects. The goal is to classify the image by category. Most categories have about 50 images. The size of each image is roughly 300 x 200 pixels.
• Data Set: http://www.vision.caltech.edu/Image_Datasets/Caltech101/
• Discussion: The images are fairly straightforward, having relatively low variability within a class, especially with respect to pose and occlusion. Many more modern object recognition and localization data sets exists that are more difficult in various ways, including the PASCAL Visual Object Challenges. We have suggested a simpler one for this project, so that you do not have to focus on the more time consuming aspects of feature engineering for computer vision, but if you have more experience in the field you could select a more complex data set.

Sentiment Classification from Movie Reviews

• Description: This data set contains 215,154 phrases from movie reviews on Rotten Tomatoes, labeled with the degree of sentiment that the phrase expresses, on a 5-point scale from positive to negative. The problem is to predict the sentiment from the phrase.
• Data: A copy of the data is available on AFS at /afs/inf.ed.ac.uk/group/ANC/IRDS/miniproject-data/. They were obtained from http://nlp.stanford.edu/sentiment/.
• Reference: Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher Manning, Andrew Ng and Christopher Potts. Conference on Empirical Methods in Natural Language Processing (EMNLP), 2013.

Identifying Malaria Parasites from Images

• Description: This data set was collected as part of a research project to develop an automated system for diagnosing malaria from blood samples. The data consists of 3243600 image patches taken from microscopy images of blood cells from 133 different patients. Each patch is labelled 1 if there is any plasmodium (the blood parasite that causes the disease malaria) visible in the patch, and 0 otherwise. Each patch is a 30x30 pixel grayscale image (so in order to visualise the patches, first reshape each 900 dimensional feature vector accordingly).
• Reference: See Quinn et al. Automated Blood Smear Analysis for Mobile Malaria Diagnosis
• Task: Train a classifier to distinguish between positive and negative patches. Using cross-validation, compare precision and recall at different detection thresholds.
• Bonus task: Whilst efforts were made to ensure that the annotation of images by clinical experts was done consistently, due to the large quantity of data it is likely that some of these patches are mislabelled. Imagine there was some budget to get a second opinion on a proportion of the patches; in order to suggest candidates, try to find examples for which the classifier developed in the first task strongly predicts the "wrong" label.
• Challenges: The shape of a parasite may appear differently depending on the stage of the life cycle it is at, and parasites may be at any orientation. Appropriate shape features therefore need to be extracted from the raw pixel data.

Predicting Cuisines of Recipes

• Description: Given a list of ingredients in a recipe, predict what type of cuisine it is from (e.g., Italian, Japanese, Chinese, etc.)
• Size
• Data set: 4236 recipes from 12 cuisines. 709 distinct ingredients. This is the data set that we used in one of the labs
• References:
• This data set was collected by Facundo Bellosi, and is discribed in Bellosi, F. Machine learning for cuisine discovery. MSc thesis, University of Edinburgh, 2012.
• This data set is available locally at /afs/inf.ed.ac.uk/group/ML/csutton/msc/2012/bellosi/DATA
• Task: This data set could be used for clustering (group together similar recipes) or collaborative filtering (given a partial recipe, suggest other ingredients that could go along with it). The supervised cuisine prediction task is perhaps too easy, but these other tasks could be more interesting. It is also easy to get many additional recipes from the internet to build a larger data set, which might make collaborative filtering, etc., easier.

Human Activity Recognition Using Smartphones Data Set

• Description: Determine what activity a person is engaging in (e.g., WALKING, WALKING_UPSTAIRS, WALKING_DOWNSTAIRS, SITTING, STANDING, LAYING) from data recorded by a smartphone (Samsung Galaxy S II) on the subject's waist. Using its embedded accelerometer and gyroscope, the data includes 3-axial linear acceleration and 3-axial angular velocity at a constant rate of 50Hz.
• Size
• Training set: 7352 time steps with 561 features at each time step.
• References:
• Data
• The UCI web page links to a research paper that was written with this data set.
• Task: Classification, Clustering. Human Activity Recognition database built from the recordings of 30 subjects performing activities of daily living (ADL) while carrying a waist-mounted smartphone with embedded inertial sensors. The paper describes the use of an SVM on this data set, classifying each time step into one of the activities without taking temporal structure into account. This is a fairly naive method, and it could be interesting to compare this to a time series model like an HMM. Alternatively you could look at different classification methods or different ways of handling variability between subjects. Also, in an SVM you could include the activity detected from the previous state as a feature (this is a "poor man's" version of a time series model). Do not bother trying to reproduce the hardware limited (fixed precision) SVM on this data.
• Challenges: Large number of features, time series data. May need to take into account the fact that different subjects have slightly different profiles over activities.

Web quality assesment

• Description: Site-level classification for the genre of the web sites (editorial, news, commercial, educational, deep Web, or Web spam and more) as well as their readability, authoritativeness, trustworthiness and neutrality. The data set consists of sample Web hosts from Europe. The training and testing samples are biased towards the interesting aspects and cleansed manually from mixed sites, Web hosting, and adult content. Features similar to those used to filter Web spam based on content and linkage information are be provided on the host level, along with natural language processing annotations of a large set of sample pages.
• Size
• 4019 total examples; 1522 labeled (for English subset)
• multiple web based and NLP based variables
• 2GB compressed text
• References:
• Task: Perform exploratory data analysis to get a good feel for the data and prepare the data for data mining. Train at least two classifiers to put the websites into one of 6 classes and asses their neutrality, bias and trustiness. Due to the size of the data set we recommend focusing only on the English subset.
• Challenges: diversity of of provided explanatory variables, graph nature of the data.

Short term movements in stock prices

• Description: Based on trading data showing stock price movements at five minute intervals, sectoral data, economic data, experts' predictions and indices predict short term stock movement.
• Size
• The training dataset contains 5922 observations. The test dataset contains 2539 observations and follows chronologically from the training dataset.
• 609 explanatory variables
• 33Mb as uncompressed text
• References:
• Task: Perform exploratory data analysis to get a good feel for the data and prepare the data for data mining. Train at least two classifiers to predict the probability of a stock price increase.
• Challenges: complex nature of stock market, temporal character of the data.

Energy Usage in the Informatics Forum

• Description: Office buildings have relatively complex metering systems for electricity and heating usage. But just knowing how much energy is being used does not mean that we know what the activities are using energy. Knowing this would help managers of the building can make policy decisions to reduce energy usage. This project involves prediction and visualization based on data about energy usage and about activities in our own building.
• Data: The data includes data from meters and about activities in the Informatics Forum for 2012-2014. This includes electricity usage (collected every 10 minutes) and hot water (heating, per day), plus room booking and weather data. The data is available on AFS at /afs/inf.ed.ac.uk/group/ANC/IRDS/miniproject-data
• Task: There are several tasks that could be considered. Forecasting electricity or hot water demand would be one idea that is probably not too complex. A more interesting possibility would be unsupervised learning based on all the time streams: If room booking and weather data are included in the clustering process, how much of a difference does this make in the resulting clusters? Do they correspond more closely to events that explain energy usage?

Student performance on mathematical problems

• Description: This data set was used in the KDD Cup 2010. This challenge asks you to predict student performance on mathematical problems from logs of student interaction with Intelligent Tutoring Systems.
• Size
• 3,310 students with 9,426,966 steps
• 20 attributes
• 3GB MB as uncompressed text
• References:
• Task: For purpose of this class we recommend using only the smaller challenge data set: algebra_2008_2009. Perform exploratory data analysis to get a good feel for the data and prepare the data for data mining. Train at least two classifiers to predict the probability of Correct First Attempt for given task. Evaluationa is based on a Root Mean Square Error.
• Challenges:
1. The data matrix is sparse: not all students are given every problem, and some problems have only 1 or 2 students who completed each item. So, the contestants need to exploit relationships among problems to bring to bear enough data to hope to learn.
2. There is a strong temporal dimension to the data: students improve over the course of the school year, students must master some skills before moving on to others, and incorrect responses to some items lead to incorrect assumptions in other items. So, contestants must pay attention to temporal relationships as well as conceptual relationships among items.
3. Which problems a given student sees is determined in part by student choices or past success history: e.g., students only see remedial problems if they are having trouble with the non-remedial problems. So, contestants need to pay attention to causal relationships in order to avoid selection bias.

Marketing: Predicting Customer Churn

• Task: This data was used in the KDD Cup 2009. The task is to predict customer behaviour, namely, churn (whether they will switch providers), appetancy (whether they will buy new products), and whether they may be willing to buy upgrades.
• Size: Number of data items: 50,000. Number of features: 15,000 (large data set), 230 (small data set). For this task we recommend that you start with the small data.
• Note: You may wish to focus on only one of the three prediction tasks (churn, appetancy, or upselling). Alternatively, if you are ambitious, you may try predicting all three in one classifier.
• Challenges: Large number of features, data may be noisy, due to anonymity constraints you do not have a lot of information about what the features mean.
• References:
1. Proceedings of KDD-Cup 2009 competition
2. KDD-Cup 2009 competition web site (Note post-challenge entries still ranked on the leaderboard!)

Performance of Distributed Data Analysis: Map-Reduce, etc.

• Description: Statistical analysis of modern large data sets makes extensive use of embarrassing parallelism, by splitting the data set onto multiple machines. There are several easy-to-use open source frameworks for this type of computation, including Apache Hadoop, Apache Spark, Apache Storm, and others.

• Project: Set up a Hadoop cluster on a few DICE machines, running either Map-Reduce or Spark. Choose a few simple benchmarks, such as word count or logistic regression. Run some experiments to probe the performance of the system as a function of a number of parameters, e.g., the size of the data set, the number of cores, the number of reduce steps required (e.g., logistic regression will require one map reduce to compute every gradient). For what types of problems is distributed processing best? For what types of problems does Spark have an advantage over Map Reduce? Another type of experiment you could run is to study how the running time is affected by contention for CPU, I/O by other processes on the worker machines. Is it possible to bring the overall cluster to a halt just through contention on a few machines?

Comparing Stochastic Optimization Algorithms

• Description: Optimization is a fundamental technology in data science, underlying many methods in statistics and machine learning. In this project you will study methods for solving an optimization problem of the form $$\min_x f(x) = \frac{1}{n}\left(f_1(x) + \ldots + f_n(x)\right),$$ where $x$ is a $d$-dimensional vector.

Problems of this type arise frequently in many domains of data science; one can for instance think of $n$ being the number of examples, $d$ the number of features and $x$ being a classifier and $f_i(x)$ being the loss of classifier $x$ on data example $i$. Usually it is assumed that the functions $f_i$ are smooth and convex.

For big data optimization problems, stochastic optimization methods, which avoid scanning all of the data at every iteration, are necessary. This can be done, for example, by randomly selecting one of the $f_i$ to consider at each iteration. In this project, you will pick 2 (or more if you wish) modern stochastic optimization methods, implement them (in any language; even MATLAB is fine for this project), and compare their performance in a realistic setting.

A baseline implementation will be provided if necessary. Peter Richtarik will be able to provide this and more details about the methods.

You can pick from the following list:

1. Gradient descent (GD): This is the only non-stochastic baseline method in the list. In one iteration, gradient of $f$ is computed and one updates $x$ using this information. Hence, one needs to scan all data in each iteration.
2. Stochastic gradient descent (SGD): In one iteration, one picks a random function $f_i$ and computes its gradient. Only this information is used to update $x$. Hence, one iteration is $n$ times cheaper than GD.
3. Coordinate descent (CD): In each iteration, CD picks a random coordinate from among the $d$ coordinates of $x$, and computes the relevant partial derivative of $f$. Only this information is used to construct the next $x$. Again, each iteration can be much cheaper than that of GD.
4. Parallel coordinate descent (PCD): As in 3, but one can pick a random subset of coordinates and update more of them in each iteration.
5. Stochastic dual coordinate ascent (SDCA): Here one solves the dual optimization problem of the problem above using CD — it turns out this is very efficient and one solves the original problem.
6. Semi-stochastic gradient descent (S2GD): This method in each iterations does the work similar to SGD, but the method has a built in variance reduction strategy which makes it much more efficient (similar to SDCA).
7. Mini-batch SGD: Similar to SGD, but one picks more functions $f_i$ in each iteration.
8. Mini-batch SDCA: Similar to SDCA, but one updates more coordinates in each iteration.
9. Mini-batch S2GD: Similar to mini-batch SGD, but has built-in variance reduction property. This is a mini-batch version of S2GD.
10. Random search: This is a method that does not evaluate gradients and evaluates function values only. In each iteration, a random direction is chosen and x is updated by moving in that direction. This method is useful when one does not have access to computing derivatives; say when f is the output of some simulation.
11. MISO: This is a method with similar benefits as 6.
12. SAGA algorithm: This is a method with similar benefits as 6.

Particle physics data set

• Description: This data set was used in the KDD Cup 2004 data mining competition. The training data is from high-energy collision experiments. There are 50 000 training examples, describing the measurements taken in experiments where two different types of particle were observed. Each training example has 78 numerical attributes.
• Size
• 50 000 training examples, 100 000 test examples
• 78 numerical attributes
• 147 MB as uncompressed text
• References:
• Task: Perform exploratory data analysis to get a good feel for the data and prepare the data for data mining. Train at least two classifiers to distinguish between two types of particle generated in high-energy collider experiments. The original competition asked participants to provide four separate sets of predictions, optimising separately the accuracy, area under the ROC curve, cross-entropy, and q-score. Software to calculate these measures can be downloaded from the competition website.
• Challenges: No labels are given to the attributes to help interpret them. There is missing data for 8 of the attributes (with out-of-range values of 999 and 9999 used as placeholders).

Brain-Computer Interface data set

• Description: This data set was used in the BCI Competition III (dataset V). Using a cap with 32 integrated electrodes, EEG data were collected from three subjects while they performed three activities: imagining moving their left hand, imagining moving their right hand, and thinking of words beginning with the same letter. As well as the raw EEG signals, the data set provides precomputed features obtained by spatially filtering these signals and calculating the power spectral density.
• Size:
• 31216 records in training data, 10464 in test data
• Each record has 96 continuous values and a numerical label
• 63 MB as uncompressed text
• Reference:
• Task: Perform exploratory data analysis to get a good feel for the data and prepare the data for data mining. Train at least two different classifiers to assign class labels to the test data to indicate which activity the subject was performing while the data were collected.
• Challenges: This data set represents time series of EEG readings. A baseline approach could be based on the given precomputed features. It might also be possible to train a classifier on a window of some size around each time step. Both of these approaches ignore the fact that the data is really a time series; one might consider using an explicit time-series model such as a Hidden Markov Model.

Prediction of Gene/Protein Localization data set

• Description: This dataset was used in the 2001 kdd cup data mining competition. There were in fact two tasks in the competition with this dataset, the prediction of the "Function" attribute, and prediction of the "Localization" attribute. Here we focus on the latter (this is somewhat easier as genes can have many functions, but only one localization, at least in this dataset). The dataset provides a variety of details about the several genes of one particular type of organism. The main dataset, (the downloadable files are Genes_relation.{data,test}) contains row data of the following form:

Gene ID, Essential, Class, Complex, Phenotype, Motif, Chromosome Number, Function, Localization.

The first attribute is a discrete variable corresponding to the gene (there are 1243 gene values). Also the remaining 8 attributes consist of discrete variables, most of them related to the proteins coded by the gene, e.g. the "Function" attribute describes some crucial functions the respective protein is involved in, and the "Localization" is simply the part of the cell where the protein is localized. In addition to the data of the above form, there are also data files (Interactions.relations.{data,test}) which contains information about interactions between pairs of genes.
• Size
• Gene_relation files: 6275 examples (4346 training, 1929 test), 9 categorical attributes.
• Interaction_relation files: 1806 records, 2 attributes (one categorical; one numerical)
• 1 MB
• References:
• Task: Perform exploratory data analysis to get a good feel for the data and prepare the data for data mining. The task in this dataset is to make predictions of the attribute "Localization". Compare at least 2 different classifiers. One other possible comparison is to compare performance with or without the use of the interactions data. One possible classifier that handles missing data easily (but does not use the interaction data) could be a belief network that has learned relationships between the Essential, Class, Complex, Phenotype, Motif, Chromosome Number, and Localization attributes.
• Challenges: This dataset is a great challenge. From data mining point of view the important challenge is to find a way to efficiently use the Interaction_relation data files, which is not obvious. Another issue is that there is a high proportion of missing variables in the Genes_relation data.

Prediction of Molecular Bioactivity for Drug Design: Binding to Thrombin dataset

• Description:This dataset was used in the 2001 kdd cup data mining competition. It was produced by DuPont Pharmaceuticals Research Laboratories and concerns drug design. Drugs are typically small organic molecules. The first step in the discovery of a new drug is usually to identify and isolate the receptor to which it should bind (in this case this is the thrombin site), followed by testing many small molecules for their ability to bind to the target site. Some molecules are able to bind the site, so there are "active" while other remain "inactive". It would be interesting to learn how to separate active from inactive molecules. This dataset provides data of these two classes of drugs (active and inactive).
• Size:
• 2545 data points: 1909 for training, 636 for testing
• 139,351 binary attributes, 2 classes
• 694 MB
• References:
• Task: Carefully read all the information given in kdd cup 2001 compeptition about this data. Perform exploratory data analysis to get a good feel for the data and prepare the data for data mining. The task is to learn a classifier using the training set that predicts the behavior of a drug (active or inactive). Note that the number of attributes is much larger than the number of training examples, thus an efficient classifier should use feature reduction. Train and compare at least two classifiers. You can check your answers on the test set by looking at the corresponding separate file which can be downloaded from the kdd cup 2001 site.
• Challenges: This is a difficult data set. Firstly there is a great imbalance between the two class; only 42 examples belong to the active classes from the total 1909 training examples. The larger data mining challenge, however, concers the huge number of binary attributes (139,351). Selecting "good" features will be the most important part of developing an good classifier.

The 4 Universities dataset

• Description: This data set contains WWW-pages collected from computer science departments of various universities in January 1997 by the World Wide Knowledge Base (WebKb) project of the CMU text learning group. The 8,282 pages were manually classified into 7 classes: 1) student, 2) faculty, 3) staff, 4) department, 5) course, 6) project and 7) other. For each class the data set contains pages from the four universities: Cornell, Texas, Washington, Wisconsin and 4,120 miscellaneous pages from other universities. The files are organized into a directory structure, one directory for each class. Each of these seven directories contains 5 subdirectories, one for each of the 4 universities and one for the miscellaneous pages. These directories in turn contain the Web-pages.
• Size:
• 8,282 webspages, 7 classes
• 60.8 MB
• References:
• Task: Prepare the data for mining and perform an exploratory data analysis (these steps will probably not be independent). The data mining task is to classify the texts according to the 7 classes. You should compare at least 2 different classifiers. Since each university's web pages have their own idiosyncrasies, it is not recommended to do training and testing on pages from the same university. We recommend training on three of the universities plus the misc collection, and testing on the pages from a fourth, held-out university (four-fold cross validation). An additional topic might be to look at labelled/unlabelled data, as in the reference.
• Challenges: An important challenge from web mining point of view will be the preprocessing of the dataset. Since the data are html files you have to remove all the irrelevant text information, such as html commands etc. and convert the rest of the text into a bag-of-words format. See help on the 4 Universities Data Set web page about doing this with rainbow.

• Size:
• 1558 attributes (3 continuous, the rest binary)
• 10 MB
• References:
• Task: Prepare the data for mining and perform an exploratory data analysis. The data mining task is to predict whether an image is an advertisement ("ad") or not ("nonad"). As you are not given an explicit training/test split you need to decide on a reasonable way of assessing performance. You should perform feature reduction in order to significantly reduce the number of features. Consider at least two different classifiers.
• Challenges: There is an inbalance of the number of data per each class. Also the number of attributes is very high compared to the size of the dataset, which suggests that efficient feature reduction is very important. One or more of the three continuous features are missing in 28% of the data.

The Reuters-21578 text dataset

• Description: This is a very often used test set for text categorisation tasks. It contains 21578 Reuters news documents from 1987. They were labeled manually by Reuters personnel. Labels belong to 5 different category classes, such as 'people', 'places' and 'topics'. The total number of categories is 672, but many of them occur only very rarely. Some documents belong to many different categories, others to only one, and some have no category. Over the past decade, there have been many efforts to clean the database up, and improve it for use in scientific research. The present format is divided in 22 files of 1000 documents delimited by SGML tags (here is as an example one of these files). Extensive information on the structure and the contents of the dataset can be found in the README file. In the past, this dataset has been split up into training and test data in many different ways. You should use the 'Modified Apte' split as described in the README file.
• Size:
• 21578 documents; according to the 'ModApte' split: 9603 training docs, 3299 test docs and 8676 unused docs.
• 27 MB
• References: This is a popular dataset for text mining experiments. The aim is usually to predict to which categories of the 'topics' category class a text belongs. Different splits into training ,test and unused data have been considered. Previous use of the Reuters dataset includes:
• Task: Carefully read the README file provided by Lewis to get an idea what the data are about. Select the documents as specified in the description of the 'Modified Apte' split. Prepare the data for mining and perform an exploratory data analysis (these steps will probably not be independent). The data mining task is to classify the texts according to the categories in the 'topics' field. You should compare at least 2 different classifiers. An extra task could be document clustering.
• Challenges: An important challenge will be the preprocessing of the dataset. The file is delimited by SGML tags, and the text is just plain text format. For any text mining task, this will have to be converted into bag-of-words format. Apart from this, you will have to deal with texts that belong to a varying number of categories. Most classification programs can only take one category per case.

The charitable donations dataset

• Description: This dataset was used in the 1998 kdd cup data mining competition. It was collected by PVA, a non-profit organisation which provides programs and services for US veterans with spinal cord injuries or disease. They raise money via direct mailing campaigns. The organisation is interested in lapsed donors: people who have stopped donating for at least 12 months. The available dataset contains a record for every donor who received the 1997 mailing and did not make a donation in the 12 months before that. For each of them it is given whether and how much they donated as a response to this. Apart from that, data are given about the previous and the current mailing campaign, as well as personal information and the giving history of each lapsed donor. Also overlay demographics were added. See the documentation and the data dictionary for more information.
• Size:
• 191779 records: 95412 training cases and 96367 test cases
• 481 attributes
• 236.2 MB: 117.2 MB trainin data and 119 MB test data
• References:
• Task: Carefully read the information available about the dataset. Perform exploratory data analysis to get a good feel for the data and prepare the data for data mining. It will be important to do good feature and case selection to reduce the data dimensionality. The data mining task is in the first place to classify people as donors or not. Try at least 2 different classifiers, like for example logistic regression or Naive Bayes. As an extra, you can go on to predict the amount someone is going to give. A good way of going about this is described in Zadrozny and Elkan's paper. The success of a solution can then be assessed by calculating the profits of a mailing campaign targetting all the test individuals that are predicted to give more than the cost of sending the mail. The profits when targetting the entire test set is \$10,560.
• Challenges: This is definitely not an easy dataset. To start with, some of the attributes have quite a lot of missing values, and there are some records with formatting errors. An important issue is feature selection. There are far too many features, and it will be necessary to select the most relevant ones, or to construct your own features by combining existing ones (the kdd cup winners claim that the secret of their success lies in good feature selection). Also case selection will be important: the training set is huge (95,412 cases), but contains only 5% positive examples. Finally, building a useful model for this dataset is made more difficult by the fact that there is an inverse relationship between the probability to donate and the amount donated.

The caravan insurance data

• Description: This dataset was used for the Coil 2000 data mining competition. It contains customer data for an insurance company. The feature of interest is whether or not a customer buys a caravan insurance. Per possible customer, 86 attributes are given: 43 socio-demographic variables derived via the customer's ZIP area code, and 43 variables about ownership of other insurance policies.
• Size:
• 9822 records: 5822 training records and 4000 test records
• 86 attributes
• 1.7 MB
• References:
• Task: The data mining task is to predict whether someone will buy a caravan insurance policy. You should first do some exploratory data analysis. Visualising the data should give you some insight into certain particularities of this dataset. Then prepare the data for data mining. It will be important to select the right features, and to construct new features from existing ones, as is described in the paper of the prediction competition winner. Try out at least 2 different data mining algorithms, and compare the use of mere feature selection with intelligent feature construction. As an extra, you could try to do the second task laid out in the Coil competition: to derive information about the profile of a typical caravan insurance buyer.
• Challenges: Like for the kdd cup data, feature selection and extraction will be very important. This can only be done properly after you have spend a considerable amount of time getting to know the data. And also like in the kdd cup data, the data are unbalanced: only 5 to 6% of the customers in the training data set actually buy the insurance policy. There are no missing or noisy data.

The yeast S. cerevisiae gene expression vectors

• Description: These are the data from the paper Support Vector Machine Classification of Microarray Gene Expression Data. For 2467 genes, gene expression levels were measured in 79 different situations (here is the raw data set). Some of the measurements follow each other up in time, but in the paper they were not treated as time series (although to a certain extend that would be possible). For each of these genes, it is given whether they belong to one of 6 functional classes (class lables on-line). The paper is concerned with classifying genes in into 5 of these classes (one class is unpredictable). The data contain many genes that belong to other functional classes than these 5, but those are not discernable on the basis of their gene expression levels alone.
• Size:
• 2467 genes
• 79 measurements, 6 class labels
• 1.8 MB: 1.7 MB measurement data and 125 KB labels
• References:
• Support Vector Machine Classification of Microarray Gene Expression Data (1999) by M. P. S. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. Sugnet, T. S. Furey, M. Ares Jr. and D. Hausslerhref (local copy): This is the original paper from which the data were obtained. It uses SVM's to classify the genes, and compares this to other methods like decision trees. A good description of difficulties with the data can also be found here.
• Cluster analysis and display of genome-wide expression patterns (1998) by M. B. Eisen, P. T. Spellman, P. O. Brown and D. Botstein: This paper describes clustering of genes. The results of this paper showed that the 5 different classes Brown et Al. are trying to predict more or less cluster together. So it indicated that these classes were discernable based on the gene expression levels. This was the basis for the selection of these 5 functional classes for the SVM classification task.
• Task: Read the data descriptions in the SVM paper and do exploratory data analysis to understand the characteristics of this dataset. The data mining task is to predict whether a gene belongs to one of the 5 functional classes, based on its expression levels. Try at least two different classification algorithms. The low frequency of the smallest classes will probably pose specific problems. You can also do clustering like performed by Eisen et Al..
• Challenges: This dataset is quite noisy and contains a rather high number of missing values. Furthermore, it is very unbalanced: there are only a few positive examples of each of the 5 classes, most of the genes don't belong to any of them. Finally, there are some genes that belong to a certain class, but have different expression levels, and there are genes that don't belong to the class they share prediction level patterns with. These cases will unavoidably lead to false negatives and positives. An overview of these difficult cases can be found in SVM classification paper.

The colon cancer data

• Description: This dataset is similar to the yeast gene expression dataset: it contains expression levels of 2000 genes taken in 62 different samples. For each sample it is indicated whether it came from a tumor biopsy or not. Numbers and descriptions for the different genes are also given. This dataset is used in many different research papers on gene expression data. It can be used in two ways: you can treat the 62 samples as records in a high-dimensional space, or you can treat the genes as records with 62 attributes.
• Size:
• 2000 genes
• 62 samples
• 1.9 MB data, 529 KB names, 207 bytes labels
• References:
• Tissue Classification with Gene Expression Profiles (2000) by A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer and Z. Yakhini: This paper describes classification of tissues on the colon cancer and the leukemia (see below) datasets. It also describes how gene selection can be done.
• Coupled Two-Way Clustering Analysis of Gene Microarray Data (2001) by G. Getz, E. Levine and E. Domany: This paper exploits the fact that the gene expression dataset can be viewed in two ways. The authors describe a way of alternating between clustering in the gene domain and in the sample domain. This method should give insight into which genes are defining for sample classifications (and possibly vice versa).
• Gene expression data analysis (2000) by A. Brazma and J. Vilo: An overview of the research in the new domain of microarray data analysis. Much of the work described here makes use of the colon cancer and/or the leukemia dataset.
• Task: First perform exploratory data analysis to get familiar with the data and prepare them for mining. The data mining task is to classify samples as cancerous or not. Compare at least two different classification algorithms. You will have to deal with issues arising from the fact that there are many attributes and only a small number of samples. Some classifiers will be more robust to this than others. Some ideas about how to deal with this can be found in the papers refered to above (and the feature selection paper referenced below). As an extra you can perform clustering, in the two different domains (genes and samples). The tissue classification paper describes a way of using clustering for classification: the parameters of the unsupervised learning procedure are defined in a supervised way to make the clusters correspond to classes.
• Challenges: The data are quite noisy, due to sample contamination. The real challenge, however, is the shape of the data matrix. When the genes are treated as attributes, the dimensionality of the feature space is very high compared to the number of cases. It will be important to avoid overfitting. Use simple classifiers, or select the most predictive genes. Also, the number of cases is very low, which means that splitting into a training and a test set is not really a good option (although it has been done for the very similar leukemia dataset, as described in the gene expression analysis overview paper and in the feature selection paper referenced below). When combining feature selection with cross-validation, be careful not to use the classifier's test data during the feature selection phase.

The leukemia data set

• Description: The leukemia data set contains expression levels of 7129 genes taken over 72 samples. Labels indicate which of two variants of leukemia is present in the sample (AML, 25 samples, or ALL, 47 samples). This dataset is of the same type as the colon cancer dataset and can therefore be used for the same kind of experiments. In fact, most of the papers that use the colon cancer data also use the leukemia data.
• Size:
• 72 samples, split into 38 training and 34 test samples
• 7129 genes
• 3.8 MB
• References:
• All of the references mentioned above for the colon cancer dataset also use the leukemia data.
• Feature selection for high-dimensional genomic microarray data (2001) by E. P. Xing, M. L. Jordan and R. M. Karp: They describe a three-phase feature selection methods to identify the most predictive genes. They use the division into 38 training and 34 test samples. They find that feature selection works better than regularization.
• Task: The task is the same as for the colon cancer data. First perform exploratory data analysis and prepare the data for mining. Then compare at least two different classifiers to identify the kind of leukemia of the sample. Again you will have to deal with problems of high feature dimensionality. You can choose to use the training-test set division the data are presented in, or you can use techniques like cross-validation, as described in the tissue classification paper. Also here, as an extra you can perform clustering in the two different data spaces.
• Challenges: The same comments as for the colon cancer dataset can be made: the data are noisy, and the most important challenge is the unusual shape of the data matrix.

The human splice site data

• Description: This dataset contains sequences of human DNA material around splice sites. Gene DNA sequence data contain coding (exons) and non-coding regions (introns). Splice site is the general term for the point at the beginning (donor site) or at the end (acceptor site) of an intron. Donor and acceptor sites typically correspond to certain patterns, but the same patterns can also be found in other places in the DNA sequence. So it is important to learn better classifiers to identify real splice sites. In the past, people have used probability matrices which encode the probability of having a certain nucleotide in a certain position. A disadvantage of this method is that dependencies between positions are not taken into account. Other methods have tried to solve this by building a conditional probability matrix for example, or by using neural networks. To get the best results, many methods don't only use base positions, but also other features, like the presence of certain combinations of nucleotides. Most recently, people have turned to probabilistic models to model the whole gene structure at once. Prediction of splice sites is then helped by the detection of coding and non-coding areas around it (see for example Prediction of complete gene structures in human genomic dna (1997) by burge and carlin). Some information on the problem of genefinding can be found on-line. Information about existing methods can be found in Ficket's overview paper. The dataset presented here contains windows of fixed size around true and false donor sites and true and false acceptor sites.
• Size: This dataset is divided along three binary dimensions: acceptor (a) versus donor (d) sites, training (t) versus test (e) data, and true (t) versus false (f) examples.
• 13123 cases, divided as follows: a-e-f: 881 / d-e-f: 782 / a-e-t: 208 / d-e-t: 208 / a-t-f: 4672 / d-t-f: 4140 / a-t-t: 1116 / d-t-t: 1116
• Window length:
donor data: 15 base positions
acceptor data: 90 base positions
• 198 KB, divided as follows: a-e-f: 16k / d-e-f: 7k / a-e-t: 7k / d-e-t: 2k / a-t-f: 82k / d-t-f: 36k / a-t-t: 37k / d-t-t: 11k
• References:
• Task: Perform exploratory data analysis and prepare the data for mining. Develop a classifier for donor sites and one for acceptor sites. Compare at least 2 different classifiers for each. As an extra, you can try to run your classifiers on the Burset and Guigo DNA sequence dataset. This dataset contains full gene dna sequences, together with indications of where coding regions start and stop.
• Challenges: The data are well prepared, so building a predictor should be quite straightforward. The best existing predictors use other features than just nucleotide positions. Maybe it is possible to detect and use some of these features to improve the classifier. When testing the classifiers on Burset and Guigo's dna datasets, you will need to make some adaptations.

Volcanoes on Venus

• Description: This dataset contains images collected by the Magellan expedition to Venus. Venus is the most similar planet to the Earth in size, and therefore researchers want to learn about its geology. Venus' surface is scattered with volcanos, and the aim of this dataset was to develop a program that can automatically identify volcanoes (from training data that have been labeled by human experts between 1-with 98% probability a volcano- and 4-with 50% probability). A tool called JARtool was developed to do this. The makers of this tool made the data publicly available to allow more research and establish a benchmark in this field. They provide in total 134 images of 1024*1024 8-bit pixels (out of the 30000 images of the original project). The dataset you will use is a preprocessed version of these images: possibly interesting 15*15 pixel frames ('chips') were taken from the images by the image recognition program of JARtool, and each was labeled between 0 (not labeled by the human experts, so definitely not a volcano), 1 (98% certain a volcano) and 4 (50% certainty according to the human experts). More information can be found in the data documentation.
• Size: The image chips are spread over groups, according to experiments carried out for the JARtool software. The training and test sets for experiments C1 and D4 together cover all chips (see the experiments table):
• Records: 37280 image chips, divided as follows: C1_trn: 12018 / C1_tst: 16608 / D4_trn: 6398 / D4_tst: 2256
• Features: 15 * 15 pixels
• 8.4 MB
• References:
• These data were used for the development of JARtool, a software system that learns to recognize volcanoes in images from Venus. The technical details about this tool are described in the paper Learning to Recognize Volcanoes on Venus (1998) by M. C. Burl, L. Asker, P. Smyth, U. Fayyad, P. Perona, L. Crumpler and J. Aubele. This paper should give you a good example of how data mining can be performed on this dataset (you can ignore the part about Focus of Attention, because that has already been done for you).
• Task: Perform Exploratory data analysis. Prepare the data for data mining. Feature space reduction will be necessary, because the number of features is very high compared to the number of positive volcano examples. Then build at least two classifiers to detect volcanoes: implement the basic classifier from Burl et Al.'s paper, and at least one other. You can follow Burl et Al.'s paper, where classes 1 up to 4 are considered positive examples. As an extra, you can try to perform clustering to find the different types of volcanoes as mentioned in Burl et Al.'s paper.
• Challenges: It will be necessary to normalise the pixel frames, as there is a difference in brightness between the different images and even between different parts of the same image. Also, feature extraction will be necessary, because there are quite a lot of pixels per frame. This is especially a problem because the dataset is highly unbalanced: the number of positive examples is very low. Finally, there is the fact that the volcanos are of different kinds, and it is difficult to build one classifier for all of them together.

Network intrusion data

• Description: These data were used for the 1999 kdd cup. They were gathered by Lincoln Labs: nine weeks of raw TCP dump data were collected from a typical US air force LAN. During the use of the LAN, several attacks were performed on it. The raw packet data were then aggregated into connection data. Per record, extra features were derived, based on domain knowledge about network attacks. There are 38 different attack types, belonging to 4 main categories. Some attack types appear only in the test data, and the frequency of attack types in test and training data is not the same (to make it more realistic). More information about the data can be found in the task file, and in the overview of the KDDcup results. On that page, it is also indicated that there is a cost matrix associated with misclassifications. The winner of the KDDcup99 competition used C5 decision trees in combination with boosting and bagging.
• Size:
• 8,050,290 records, divided as follows: 4,940,000 training records and 3,110,290 test records. A 10% sample is available for both.
• 41 attributes and 1 label
• 1,173 MB: 743 MB training data and 430 MB test data
• References:
• Task: Perform exploratory data analysis and prepare the data for mining. The data mining task is to classify connections as legitimate or belonging to one of the 4 fraud categories. The misclassification costs should be taken into account. Compare at least two different classification algorithms.
• Challenges: The amount of data preprocessing needed is quite limited. You will need data reduction to deal with the sheer size of the dataset. The major difficulty, however, is probably the class distribution: while the DoS attack type appears in 79% of the connections, the u2r attack type only appears in 0.01 percent of the records. And this least frequent attack type is in the same time the most difficult to predict and the most costly to miss.

The SuperCOSMOS Sky Survey objects catalogue

• Description: The SuperCOSMOS Sky Survey programme is carried out at the University of Edinburgh.The project used the SuperCOSMOS machine, a high-precision plate scanning facility, to scan in the Schmidt photographic atlas material. This has produced a digitised survey of the entire sky in three colours (B, R and I), with one colour (R) at two epochs. From these digital images, objects have been extracted, and an objects catalogue has been composed. For each object, useful astronomical characteristics have been registered, such as the size, the brightness, the position, etc. A project was then caried out to classify the objects as stars or galaxies. External labeling to evaluate the classification algorithm was obtained from the more precise data of the Sloan Digital Sky Survey.
• Size:
• There are 4 object sets, one for B and I, and two for R (one set from pictures taken in the 50's and one set more recent). Each of these is divided in a set of paired objects (for which a corresponding SDSS object was found) and a set of unpaired ones:
• B-paired: 34663 / B-unpaired: 68987
• R-paired (recent): 26791 / R-unpaired: 54920
• I-paired: 15645 / I-unpaired: 41596
• R-paired (50's): 15834 / R-unpaired: 34426
• Paired datasets have 40 attributes (including some from SDSS), unpaired 34.
• The size of the datasets is as follows:
• B-paired: 16.4MB / B-unpaired: 23.5MB
• R-paired (recent): 12.6MB / R-unpaired: 18.7MB
• I-paired: 14MB / I-unpaired: 7.3MB
• R-paired (50's): 7.4MB / R-unpaired: 11.7MB
• References:
• Task: First read the information in the README file, and in paper II (and the paper by Weir) referenced above. Then perform exploratory data analysis and prepare the data for data mining. You can concentrate on one of the paired datasets. Classify sky objects as stars or galaxies (use the SDSS classification as label). Compare at least two different classification algorithms. Try the effect of excluding/including fields 19 and 31, the classification efforts of the SSS team. Also, do a performance evaluation with respect to the magnitude as was done in paper II.
• Challenges: These are astronomical data, and all the documentation is written in 'astronomical language', so it is quite difficult to understand what the data are all about and how the previous research has been caried out. Furthermore, the dataset is quite big, so case reduction might be necessary.

Less interesting datasets

You are allowed to come up with your own dataset for this project. In order to guide you in this search, we present here some examples of datasets which were considered less interesting.

The Landsat image data from Statlog

• Description: This dataset contains pixel information for a 82*100 pixel part of an image of the earth surface taken from a satellite. For each pixel, 4 frequency values (each between 0-255) are given. This is because foto's are taken in 4 different spectral bands. Each pixel represent an area on the earth's surface of 80*80 metres. Every line in the dataset gives the 4 values for 9 pixels (a 3*3 neighbourhood frame around a central pixel. The aim is to classify the central pixel (a piece of land) into 1 of 6 classes based on its values and those of its immediate neighbours. More info about this is given in the documentation file.
• Objections: The data have been perfectly preprocessed, and the classes are quite well balanced. This dataset is not very challenging, as very good results can be obtained very easily.

The OHSUMED document collection

• Description: This is the dataset used for the filtering track in TREC-9, the 2000 Text Retrieval Conference. It contains 348,566 texts to be searched and classified. These texts are records of medical articles containing fields for the author, the title, the source, the publication type, a number of human assigned relevant terms, and in about two thirds of the cases also the abstract. Full texts are not available. In the TREC filtering task, the program gets a user profile (a query) and a sample of texts that match this profile. The aim is to search the massive text database to find more texts that match the profile. Solutions to the filtering problem presented in TREC-9 include eg a kNN-method and an adaptive term and treshold selection method.
• Objections: This dataset is too hard, mainly because of its sheer magnitude. In An Evaluation of Statistical Approaches to Text Categorization (1997), the use of different text classification methods on different text datasets was examined. One of the datasets used was the OHSUMED set. It was pointed out that this set is much harder than for example the Reuters set.

The predictive toxicology dataset

• Description: The PKDD conferences organize discovery challenges, not as a competition, but with the aim that different researchers would work together to try and find solutions for certain kdd problems. One of the tasks in the 2001 challenge used a dataset of chemical structures. For each structure, it was indicated whether or not this substance caused cancer to mice or rats. The aim was to create a predictor of toxicology for substances based on their chemical structure.
• Objections: This is in fact a very difficult task. Chemical structures can take on an enormous variety of shapes and sizes. They are a perfect example of data that do not fit in the classical attribute-value format. This means that traditional data mining techniques cannot be used on these data. On the dataset website, there are links to solution papers. All of these seem to try to derive attribute-value formats from the chemical structures, using domain background knowledge. This is definitely an interesting dataset, but certainly too difficult for a mini-project.

20 News Groups dataset

• Description: This is a well known data set for text classification, used mainly for training classifiers by using both labeled and unlabeled data (see references below). The data set is a collection of 20,000 messages, collected from UseNet postings over a period of several months in 1993. The data are divided almost evenly among 20 different UseNet discussion groups. Many of the categories fall into overlapping topics; for example 5 of them are about companies discussion groups and 3 of them discuss religion. Other topics included in News Groups are: politics, sports, sciences and miscellanious.
• Objections: This dataset is too well known and is in fact used as the example dataset for the rainbow software documentation.

Yeast Gene Regulation Prediction dataset

• Description:This dataset was used in the 2002 kdd cup data mining competition. The data describes the activity of some (hidden) biological system in yeast cells. Particularly, a set of yeast strains have been generated where each of them is characterized by a single gene being knocked out (i.e. disabled). Thus each example in the data set is related to a single knocked-out gene which is labeled with a discrete measurement of how active the hidden system in the cell is when this gene is knocked out. There are three labels:
• "nc": This label indicates that the activity of the hidden system (i.e. yeast strain) was not significantly different than the baseline.
• "control": Indicates that the activity of the hidden system was significantly different than the baseline for the given instance, but that the activity of another hidden system (the control) was also significantly changed versus its baseline.
• "change": Describes examples which the activity of the hidden system was significantly changed, but the activity of the control system was not significantly changed.
A variety of other information accompanies the above data. These data sources include categorical features describing gene localization and function, abstracts from the scientific literature(MEDLINE), and a table of protein-protein interactions that relate the products of pairs of genes. A local copy of file describing the whole database is available here.
• Objections: Too complex/challenging for a DME mini-project. There are three sources of data to take into account (including carrying out information extraction from the scientific abstracts). One of these sources is information on protein-protein interactions, which will have to be coded in an appropriate way for machine learning. Also the classes are very imbalanced.

CATS benchmark

• Description: The data set consists of a time series 5000 time steps long. 5 blocks of 20 values are missing from the training data (elements 981–1000, 1981–2000, 2981–3000, 3981–4000, 4981–5000). The task is to predict the 100 missing values. Solutions are evaluated by calculating the mean square error compared to the true values.
• Objections: This is an artificial data set, not a genuine data-mining problem.

This page was originally written by Frederick Ducatelle for the Data Mining and Exploration course. This IRDS version is maintained by Charles Sutton.

 Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk Please contact our webadmin with any comments or corrections. Logging and Cookies Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh