Data sets

This page contains links to mineable data sets and information about them.
I am trying to organise the interesting ones a bit better in a table at the bottom of the page.

The yearly organisedText REtrieval Conference (TREC) tries to encourage research in information retrieval from large text collections. It was organised for the 9th time in 2000. Every year different tasks (tracks) on different text databases are defined, and conference participants are challenged to find good solutions for the tasks. Most of the papers for the conference (overviews of the different tasks and papers about the solutions of the different participants are available on-line).
They have the following datasets on-line:
- The TREC-9 Filtering Track Collection.
  These are the data used for the filtering track in TREC-9, the 2000 conference (click here for an overview of TREC-9 and here for a report on the filtering track in that conference). In the filtering task, the program gets a user profile (a query) and a sample of texts that do match this profile. The aim is then to search the rest of the massive text database to find more texts that match the profile. Different versions of this task exist. The database contains:
  - 348,566 texts to be searched and classified: they are actually records of medical texts containing fields for the author, the title, the source, the publication type, a number of human assigned relevant terms, and in about two thirds of the cases also the abstract. Full texts are not available. This is the so-called OHSUMED document collection, often used in text mining experiments.
  - topic statements: the queries defining the user profiles, the queries.
  - the relevance judgements, defining the relevance of the different texts for the different topics.
  More information on the precise form of the dataset is available here.
  Solutions to the filtering problem presented in TREC-9 include eg a kNN-method and an adaptive term and treshold selection method.
  
  In An Evaluation of Statistical Approaches to Text Categorization, the use of different text classification methods on different text data sets was examined. One of the data sets used was the OHSUMED set. It was pointed out there that this set is harder than for example the Reuters set (mainly because of its sheer magnitude).
- The data for the confusion track of TREC-5 (1996).
  The aim of this task was to identify one specific document on the basis of a unique set of search terms (there was only one correct answer). The difficulty of the task is that the documents in the data set are corrupted (they were read in using optical character recognition). A discussion of the task, the data and the proposed solutions can be found here.
Lewis's research site at AT&T has two datasets available ( http://www.research.att.com/~lewis/reuters21578.html):
- The first is the Reuters-21578 text categorisation test collection. This is a very often used test set for text categorisation tasks. It contains 21578 news documents, delimited by SGML tags. Each document belongs to a list of different categories. More information on the exact contents of the test set and how they can be devided into training and test data (a good division is the 'Apte'-split) can be found in the README file. The data set is split into 22 different files (here is an example of what they look like).
- The second is a rather small database about countries in prolog format ( a prolog fact base about countries ). There should be around 288 countries, and there are 11 input lines per country. It is quite small, but it might provide a good data set for small exercises (I should test it, maybe some interesting relationships could turn up). Quite some data preparation will be necessary though.
On http://www.ecn.purdue.edu/KDDCUP/, the data for the KDD cup competition are available:
- KDD cup 2000 data are confidential (userid kddcup, password legcare4KDD - academic and educational use only). These data are internet sessions of users on a commercial internet site (each record is a page access in the raw data, in the data to be mined there is one record per session). The questions for the kdd cup were things like: predict whether the user is going to leave the site after a certain page. The file is very large and in winzip format. This is the solution paper of the winner of the first question. Details about the data and the questions are available here. For some users (the ones that actually made a purchase from the site and therefore had to give personal details), more information is available, and their names were matched against the Acxiom data network (a company that sells customer data records) to get more personal info. As not many people make a purchase, and Acxiom only provides a match of about 60-70%, there are a lot of missing values here.
  
  The dataset is really huge, and to get a better idea of the data, I would have to do some preprocessing and exploration (and the pc I am working with at the moment is definitely not fast enough to be useful, nor does it have the right tools - but I guess it is my responsibility to find those). There is some statistical info about the data available, which I downloaded on the pc. From those stats it becomes clear that there are indeed a lot of variables with many missing values.
  
  In the FAQ (where more useful info can be found), it is admitted that this task is very difficult, so I am not sure if this data set is appropriate.
- Also for the 1998 KDD cup competition, the data are available. They come from a direct mailing database of a non profit organisation. The organisation is especially interested in lapsed donors: people who have stopped donating for at least 12 months. The dataset available here contains the data about the '97 mailing campaign, but only for those people who did not make any donation in the 12 months before that. The dataset contains data about the new mailing campaign and how the people reacted to that (how much, if at all, they donated), about the 12 months of campaign they did not react to, and about their previous donating behaviour. About some of the people there is quite a lot of personal and demographic information available. All the background information about the data and the data mining task are given. And here you will find the data and their description (the data can be viewed straight through the webbrowser). The dataset is quite big (95412 labeled records for training and 96367 unlabeled for validation (but these labels are available separately)) with 481 data fields. The aim is to predict the $ amount of donation every person is likely to give.
  
  I think this could be a good data set for exercises. It is certainly big enough, contains all the usual difficulties for data preprocessing (missing data, noise, feature extraction and construction, unbalanced data (the number of donors is only 5%), ... ), and I think interesting information could come out of it. An interesting fact is that the organisation knows that there is an inverse relation between the probability of donating and the amount donated, so it is not enough to classify people as donators or not (because then large donators might be left out). A (very) short text about the winner's method is also available.
The PKDD conferences organize discovery challenges, not as a competition, but with the aim that different researchers would work together to try and find solutions for certain kdd problems. I spent quite a lot of time to understand the following datasets, only to discover that they are probably way too difficult for a 'mini-project'. I think it's more something for a real msc project (or, in the case of the first database, even a phd).
- Analysing chemical data was one of the tasks at the PKDD 2001 Discovery Challenge: the aim was to analyse chemical formula's to classify them as toxic or not toxic (for male/female rats/mice). This is in fact a very difficult task: chemical formulas are structural data which cannot be represented as attribute-value pairs. This means that traditional data mining techniques can only be used with special feature extraction efforts (for which a lot of domain knowledge is necessary). This dataset is definitely too difficult to do exercises on in class. It is more something for ongoing data mining research (could actually be really interesting: I went over some of the solution papers, and all of them seem to try to extract attribute value pairs based on knowledge of chemistry-I haven't seen anyone propose a special-purpose data mining technique). Another reason why we probably can't use it is because the dataset is very small.
- The other task in the PKDD 2001 Discovery Challenge was to analyse data of thrombosis with collagen disease patients. (a password is needed to access the data, but you get it without any problem) (use mine: fredduc@dai.ed.ac.uk and BARBORA).
  Collagen diseases are disorders of auto-immune system. Patients generate anti-bodies which attack their own bodies. Thrombosis is one of the most important and severe complications in collagen diseases, and one of the major causes of death. Important questions that could find an answer in these data are: to predict when thrombosis is going to occur, and to classify collagen diseases. There are possibilities to mine sequences of events. The data are available in the form of relational databases and each table can be viewed through the internet browser (choose the CSV format). The same dataset was used in previous discovery challenges, and some past solution papers are available on-line:
  - A paper from PKDD99 using association rules. These rules were discovered using the WizWhy data mining software (the paper was entered by people of Wizsoft).
  - A paper from PKDD2000 by the same company using the same software to discover association rules.
  - This paper tries to make a graph of causal relationships using a method they call Tetrad. It starts with a complete graph showing full causality between all the differentdifferent features, and then applies statistical tests to detect independencies and remove edges.
  - A paper using only visual data mining (in combination with feature construction). They use a product they are developing, called InfoZoom, which allows different forms of visualisation of data. This paper also gives some improved insight into the data because it says quite a lot about the data preparation phase.
  - This paper does feature selection and uses decision trees.
  This dataset is definitely not straightforward, and I'd have to play a bit with the data to get a better feel for it. A problem with this data set is that it is quite small.
A lot can be found on the UCI KDD Archive website:
- A completely preprocessed dataset about the demographics of internet users. The description of the data is quite concise, but detailed descriptions of the data fields (+ already derived statistical information) can be found here. These are the results of a survey about general demographics to which there were 10109 respondents (so the data set is not too large). This is certainly mineable for students. I think mainly association rules and cluster analysis could be done here (making facts in the data base clearer, rather than doing prediction, classification or estimation).
- Another data set is the census-income database. This contains census data of the 94-94 US census bureau surveys. They are mainly data concerning income and some personal info (like age, ethnic background, ...) The data description is not really clear, but it can be augmented with info from the original data dictionary. The number of records here is huge (almost 200,000, but the number of features limited: no more than 40).
- The US Census Bureau allows to extract data via the Data Extraction System. I found it difficult though to obtain data from there, as I could not find good data definition documentation.
- Also the data for the Coil kdd challenge are available here. There is a precise question: can you predict who would be interested in buying a caravan insurance policy and give an explanation why? So in fact it is a double task: prediction (pick the 800 customers of the test set with the highest chance of buying the policy), and description (describe general characteristics of customers who would buy a policy). The dataset is big but manageable: 5822 customers in the training set and 4000 in the test set, and 86 attributes (of which 43 are sociodemographic and 43 are about product ownership). The data description is satisfactory. I am pretty sure this is a good data set for the module.
  More info on the coil challenge is available. This page includes the results of the competition. Apparantly the winner of the prediction task used a naive baysian classifier (+ two self-derived features), and the winner of the description task did feature selection (through an evolutionary algorithm and also through statistical tests) and then looked for association rules. There are in total 29 solution papers available on the web.
- Anonymous web data of Microsoft contains data about page access on the microsoft site for one week: per user, a list of website areas he accessed. This dataset is very sparse, as you might expect. It was used for testing collaborative filtering systems in the paper Empirical Analysis of Predictive Algorithms for Collaborative Filtering (1998). This could be interesting, but not for traditional data mining.

Summary table of possibly interesting datasets

Name

General description

Records

Attributes

Size (in Mb)

Type of dataset

Difficulties

Possible tasks

Interesting?

Reuters-21578

This is a very often used test set for text categorisation tasks. It contains 21578 news documents, delimited by SGML tags. Each document belongs to a list of different categories. The total number of categories is 672, but the number of categories appearing in at least one text is 445, and the number appearing in at least 20 texts is only 148. More information on the exact contents of the test set and how they can be devided into training and test data (a good division is the 'Apte'-split) can be found in the README file. Here is a 1000 document sample of the data set.

21578

not applicable

+-27Mbyte

text data set

An important problem is that the text is in raw text format. So a conversion to feature tuple format is necessary. Apparantly for the previous version of this dataset, a feature tuple format data set was available, but that is now no longer the case. I will still keep on looking for such a format. Apart from this, I downloaded Smart, a text preprocessing tool, but so far I have not been able to make it work.

text classification
text clustering
text retrieval

definitely

20 newsgroups

1000 Usenet articles were taken from 20 newsgroups (20000 in total). This is the testset used in section 6.10 of Mitchell's Machine Learning book. He classifies these articles per newsgroup using a naieve bayesian classifier. His results were very good (89% accuracy). As in the Reuters database, the articles are not preprocessed (they are actually quite messy), but Mitchell describes in his book how to do it (and it's easy: removing the most and the least frequent words, which should not be too difficult using perl, I guess). Also, a link is provided to the Bow project, which is a library of C code for text analysis.

20000 (a miniversion of 2000 is also available)

not applicable

61.6Mbyte (6.2 for the mini-version)

text data set

Also here the text is in its original format and a conversion to feature vectors is necessary (but Mitchell explains how to do it in his book). The texts are quite messy (take a look at this example), some hardly have any text. Still, the fact that these texts belong to exactly one of twenty categories makes the classification task a bit easier.

text classification
text clustering
text retrieval

yes, maybe, the fact that it is used by Mitchell in his book can be interesting.

The 1998 kdd cup

PVA is a not-for-profit organization that provides programs and services for US veterans with spinal cord injuries or disease. They raise money via direct mailing campaigns. The organisation is interested in lapsed donors: people who have stopped donating for at least 12 months. The available dataset contains a record for every donor who received the 1997 mailing and did not make a donation in the 12 months before that. Data are given about the previous and the current mailing campaign, as well as personal information and the giving history of each lapsed donor. Also overlay demographics were added. See the documentation and the data dictionary for more info. The aim is to predict the amount (if anything) each lapsed donor is likely to give. One interesting aspect is that there tends to be an inverse relationship between the probability to donate and the amount donated, making a simple classification (as donor or not) insufficient.

Train: 95412
Test: 96367

481

Train: 117.2 Mbyte
Test: 119 Mbyte

Customer DB, non-profit

Missing values: some attributes have quite a lot of them. The the KDD cup documentation gives some advice on how to treat the different attributes with missing values.
Noisy data: the documentation points out that there are some records with formatting errors.
Feature extraction: there are far too many features, and it will be important to choose a good subset (or to construct good new or aggregating features from existing ones) (the winners claim that the secret of their success lies in good feature selection).
Unbalanced data: there is only 5% donors.

Classification and regression: the main aim is to predict the amount donated, but also a classification as donor or not can be useful (the winners used a double regression model: one gave the probability of donating and the other the conditional donation amount). The task for the kddcup was to define the expected donation amount, and to keep those customers in the test set who were likely to give more than the cost of mailing an individual. For them, the real profit would be evaluated using the labels of the test set.

yes: a lot of useful, real-world data and a clear-cut task

The Coil kdd challenge

The question for this kdd Challenge is to predict which customers will buy a caravan insurance policy, and to describe their profile (so there were two tasks: prediction and description). The data set is derived from a real business problem, and contains for every possible customer 86 variables: 43 socio-demographic variables derived from the zip area code and 43 variables about ownership of insurance policies. The challenge's homepage with 29 reports of entered solutions is available on-line.

Train: 5822
Test: 4000

1.7 Mbyte

Customer DB, profit

Feature selection and extraction: according to the winners of the prediction task, there are only a few useful features. They augmented these with two self-derived features to get an optimal prediction. The winners of the description task used more features (but still no more than 21). The papers of the winners are on-line .
Unbalanced data: only 5 to 6% of the customers in the training data set actually buy the insurance policy.
I don't think there are any missing or noisy data. All the data have been converted to numeric format.

Classification: the first formal task of the challenge was to define the top 800 most likely buyers of a caravan insurance policy from the test data set.
Description: the second task was to describe the profile of a typical buyer.

yes: real data with two clear data mining tasks. Probably easier than the previous set though, as the data are better prepared.

Anonymous web data of Microsoft

This dataset contains data about page access on the microsoft site for one week: per user, a list of the website areas he accessed. It was used for testing collaborative filtering systems in the paper Empirical Analysis of Predictive Algorithms for Collaborative Filtering. As every user only visits a few pages (or rather page categories) per week (the average in the dataset is 3), the dataset is very sparse. The data is in a special format for sparse datasets.

Train: 32711 users
Test: 5000 users

294 pages

Train: 305K
Test: 52K

Web usage data

The fact that the dataset is very sparse might cause some difficulties.

Collaborative filtering: recommend pages to new users based on what others visited before them. This comes down to predicting what other pages a user will visit given a few he already visited. In the tests in the paper refered to before, the experimenters view a certain number of randomly selected visits for every user in the test set and try to predict the remaining pages.

maybe: I think the use of this dataset is limited to what was done in the paper.

EachMovie collaborative filtering dataset

This dataset contains 18 months of votes for movies, gathered by compaq for collaborative filtering experiments. They have some information about the voters (age, gender, zipcode) and about the movies (video_release, theater_release, genre, ...). Every vote is on a 0-5 scale. This dataset was used by compaq to build a colaborative filtering system (some info available on-line, but not really interesting), in Empirical Analysis of Predictive Algorithms for Collaborative Filtering (the same paper as mentioned above), and in Mining the Network Value of Customers (2001) by P. Domingos and M. Richardson (in this paper usful information can be found about the difficulties with the data and the necessary preprocessing). To access the data you need username (ducatelle) and password (ianc$Cog.Sci).

Voters: 72916
Movies: 1628
Votes: 2811983

Voters: 4
Movies: 9
Votes: 5

17.6 Mbyte (compressed)

movie voting data

I have not seen the data yet, but I think they will need some preprocessing (there is for example a high number of 0 votes, which might indicate a problem).

Collaborative filtering: predicting someone's votes for other movies, using the same test setup as for the webdata above.
Description: as more details are available than in the web usage data, I think clustering and association rules must be possible.

More than the web usage data: there are more details available, and the dataset is less sparse (on average 39 votes per user).

The yeast S. cerevisiae gene expression vectors

These are the data from the paper Support Vector Machine Classification of Microarray Gene Expression Data. For 2467 genes, gene expression levels were measured in 79 different situations (here is the raw data set). Some of the measurements follow each other up in time, but in the paper they were not treated as time series (although to a certain extend that would be possible). For each of these genes, it is given whether they belong to one of 6 functional classes ( class lables on-line). The paper is concerned with classifying the data in into 5 of these classes (one class is unpredictable). In Cluster analysis and display of genome-wide expression patterns (Eisen et Al.), clustering is performed on more or less the same data.

2467 genes

79 measurements
6 class labels

Measurements: 1.7 Mb
Labels: 125 Kb

Biological data

Like many of the other data sets, this set is unbalanced. Some of the classes actually have very few cases:

class 1: 17
class 2: 30
class 3: 121
class 4: 35
class 5: 11
class 6: 16

Also, the data set is quite small. In the SVM paper, this is overcome by using three-fold cross-validation.
Thirdly, some of the instances have different expression levels even though they belong to the same class. These can impossibly be classified correctly. Fortunately, an overview of unavoidable false positives and negatives is given in the paper.
Fourthly, there are quite a lot of missing values: 3760. They are quite evenly distributed over rows and columns (there are no columns without missing values, and only 25% of the records have no missing values).
Finally, I think the data might contain quite some noise.

Classification can be done as in the SVM paper.
Also, like in Eisen's paper, clustering can be done. The clustering should be able to reflect the classes. The number of classes provided in this dataset are quite limited, though (the authors used only these classes because not all gene classes are predictable on the basis of expression information - the classes used here were the ones that clustered together in Eisen's work which is also based on expression info). More class information can be found in the on-line MIPS yeast genome database.

Useful but quite challenging, I think.

The Landsat image data from Statlog

This dataset contains pixel information for a 82*100 pixel part of an image of the earth surface taken from a satellite. For each pixel, 4 frequency values (each between 0-255) are given. This is because foto's are taken in 4 different spectral bands. Each pixel represent an area on the earth's surface of 80*80 metres. Every line in the dataset gives the 4 values for 9 pixels (a 3*3 neighbourhood frame around a central pixel. The aim is to classify the central pixel (a piece of land) into 1 of 6 classes based on its values and those of its immediate neighbours. More info about this is given in the documentation file.

Training: 4435
Test: 2000

36 values (4 values * 9 pixels)

Training: 514Kb
Test: 231Kb

Spatial image data

There are no missing data.
All data are in numeric form, and I don't think any reformatting is necessary.

The obvious task is classification.

I am not sure if these data are really useful. They are already preprocessed (which is probably the most challenging aspect of spatial data), and the one obvious task is classification. This is more a machine learning test set. For an exercise with spatial data you'd probably want something more challenging.

Colon cancer data

Contains expression levels of 2000 genes taken in 62 different samples. For each sample it is indicated if it is a tumor biopsy or not. Numbers and descriptions for the different genes are also given. This dataset is used in many different research papers on gene expression data. It can be used in two ways: you can treat the 62 samples as records in a high-dimensional space, and do classification for cancer (see Tissue Classification with Gene Expression Profiles (2000) by Ben-Dor et Al.), or you can treat the genes as records, and do cluster analysis to find out which genes are similar (like in the clustering paper for the yeast data set mentioned above). A combination of both is also possible, as described in Coupled two-way clustering analysis of gene microarray data (2001) by Getz, Levine and Domany. A short overview of work in this domain (much of which uses this dataset) can be found in Gene expression data analysis (2000) by Brazma and Vilo.

2000 genes
62 samples

data: 1.9 Mb
names: 529 Kb
labels: 207 bytes

gene expression data

When samples are treated as records (probably the most interesting case), two important problems arise: there are only a few records (a good reason to use techniques like leave-one-out-cross-validation), and a very high number of features (so feature selection becomes important, in the form of gene selection).
Another problem is sample contamination. Samples are usually quite noisy. Also, there are differences between cancerous and non-cancerous samples that are not interesting for the kind of research performed here: eg non-tumor samples usually contain more muscle-specific genes, and tumor samples usually contain more ribosomal genes. According to Ben-Dor et Al., all this does not pose too many problems though (see section 5 of their paper for an overview).
There are no missing values.

The one obvious task is classification of cancer samples. As has been said in the general description of the data, also clustering can be done, in two possible directions.

I think this dataset is quite challenging. It could be interesting, though, because the specific problems it poses are different from those of other datasets.

Leukemia data set

The Leukemia data set contains expression levels of 7129 genes taken over 72 samples. Labels indicate which of two variants of Leukemia is present in the sample (AML, 25 samples, or ALL, 47 samples). This data set is of the same type as the previous one and can therefore be used for the same kind of experiments. In fact, most of the papers that use the colon cancer data also use the leukemia data. One paper that uses only the leukemia data set is Feature selection for high-dimensional genomic microarray data (2001) by Xing et Al..

Train: 38 samples
Test: 34 samples
7129 genes

Train: 2Mb
Test: 1.8Mb

gene expression data

The same comments as for the colon cancer data can be made here.

See the colon cancer data.

human splice site data

This data set contains sequences of dna material around human splice sites. A splice site is the point at the beginning (donor site) or at the end (acceptor site) of an intron (a non-coding part of the gene dna sequence data). These sites typically correspond to certain patterns, but the same patterns can also be found in other places, so it is important to learn better classifiers to identify splice sites. In the past, people have used probability matrices which encode the probability of having a certain nucleotide in a certain position (or variations to that, see Ficket's overview paper). A disadvantage of this method is that dependencies between positions are not taken into account. Other methods build a conditional probability matrix to account for dependenies between adjacent positions (see Salzberg's paper), or split up the dataset before they build probability matrices, so that the probabilities are in fact conditional on the split of the dataset (the MDD method: see pages 13 to 15 of Prediction of complete gene structures in human genomic dna (1997) by burge and carlin - page 9 of this paper gives a schematic overview of the structure of a gene dna sequence). Also neural networks have been used quite successfully (see Training Knowledge-Based Neural Networks to Recognize Genes in DNA Sequences (1991), which was trained on old data available at the uci machine learning repository). Many nethods base themselves not only on base positions, but also on other features, like the presence of certain groups of neucleotides (see Ficket's overview paper), to get the best results. The dataset we are describing here was used to train and test GENIE, a gene finding system. It is divided into 8 different sets: acceptor versus donor sequences, training versus test data, and true versus false examples.

acceptor (a) / donor (d) - train(t) / test (e) - true (t) / false (f):

      a-e-f: 881
      d-e-f: 782
      a-e-t: 208
      d-e-t: 208
      a-t-f: 4672
      d-t-f: 4140
      a-t-t: 1116
      d-t-t: 1116

donor data: 15 base positions

acceptor data: 90 base positions

acceptor (a) / donor (d) - train(t) / test (e) - true (t) / false (f):

      a-e-f: 16k
      d-e-f: 7k
      a-e-t: 7k
      d-e-t: 2k
      a-t-f: 82k
      d-t-f: 36k
      a-t-t: 37k
      d-t-t: 11k

dna sequence data

The data are well prepared, I don't think there are any missing values.

There are two binary classification tasks: donor site or not, and acceptor site or not. Maybe running rule extraction algorithms could also be interesting.

I'm not sure whether this is good as a data mining exercise. It could be, as neural networks and decision trees have been used before. Maybe SVM's are also possible. The best results are obtained by combining base position information with other features (such as the presence of certain pairs and triplets of nucleotides). But even without those, it should be possible to find useful information. I wonder if it is possible to extract these other features automatically without the right biological background knowledge (maybe by detecting frequent episodes in sequences, see Smyth's book section 13.5). Once classifiers have been built, they could actually be tested on the Burset and Guigo set described below (there might be an overlap though, between the training data of this set and the data in Burset and guigo's set).

Volcanoes on Venus

This dataset contains images collected by the Magellan expedition to Venus. Venus is the most similar planet to the Earth in size, and therefore researchers want to learn about its geology. Venus' surface is scattered with volcanos, and task in this dataset is to develop a program that can automatically identify volcanoes (from training data that have been labeled by humans between 1-definitely a volcano and 4-definitely not one). A tool called JARtool was developed to do this. The makers of this tool provide the data to allow more research and establish a benchmark in this field. They provide in total 134 images of 1024*1024 8-bit pixels (out of the 30000 images of the original project). Very interesting is the fact that they provide a preprocessed version of the data: possibly interesting pixel frames ('chips') taken from the images by their image recognition tool, each labeled between 0 (not labeled by the humans), 1 (definitely a volcano) and 4 (definitely not a volcano). This is a good format for a data mining exercise, as the feature extraction from pixels (eg with PCA) still has to be done, but the difficult task of identifying interesting regions is taken care of. For more information, read the data documentation or the following paper by Burl, Smyth, Fayyad and others: Learning to Recognize Volcanoes on Venus (1998).

The chips are spread over different groups, according to experiments caried out for the JARtool software. There is some overlap between the groups, but I think among them the training and test sets of experiments C1 and D4 cover all chips (see experiments table):

C1_trn: 12018 records
C1_tst: 16608 records
D4_trn: 6398 records
D4_tst: 2256 records

They are 15*15 pixel frames.

187 Mbyte, but that is including everything, also the original images. If we restrict to the pixel frames it is less: 31 Mb. If we restrict to the frames covering the complete dataset (see records): 8.4 Mb.

image data

It will be necessary to normalise the pixel frames, as there is a difference in brightness between the different images and even between different parts of the same image.
Feature extraction (in the original JARtool project they use PCA) will be necessary, because per frame, there are quite a lot of pixels, resulting in a high number of attributes in comparison to the total number of training examples.
An important challenge will be, like in most of the previous datasets, the fact that the data are unbalanced (few positive examples). This is the class distribution:

0 (not considered by human experts): 36058
1 (definitely a volcano): 127
2 (probably): 254
3 (possibly): 397
4 (only a pit): 444

Obviously, the main task is to classify pixel frames as being volcanos or not on a scale between 0 and 4. As described in the paper Learning to Recognize Volcanoes on Venus (1998), also clustering is possible, as there are in fact different kinds of volcanos. A classifier for each kind of volcano can then be built.

This could be a useful spatial dataset. It won't be easy, though, mainly because of the small number of positive examples: as described in the paper mentioned before, feature extraction will be important to reduce the number of pixels compared to the number of training records.

Network intrusion data

These data were used for the '99 kdd cup. They were gathered by Lincoln Labs: they collected nine weeks of raw TCP dump data from a typical US air force LAN. During the use of the LAN, they performed several attacks on it. The raw packet data were aggregated into connection data. Per record, extra features were derived, based on domain knowledge about network attacks. There are 38 different attack types, belonging to 4 main categories. There are some attack types that only appear in the test data, and the frequency of attack types in test and training data is not the same (to make it more realistic). More information about the data can be found in the task file, in Cost-based Modeling and Evaluation for Data Mining With Application to Fraud and Intrusion Detection: Results from the JAM Project, and in the overview of the KDDcup results. On that page, it is also indicated that there is a cost matrix associated with misclassifications. Another paper that uses the same dataset is PNrule: A New Framework for Learning Classifier Models in Data Mining (A Case-Study in Network Intrusion Detection) (2000). It uses simple general-to-specific rule learning in two stages. See also Data Mining Techniques for Intrusion Detection: apart from giving a good solution architecture, this paper provides a thorough description of the data and the needed preprocessing. The winner of the KDDcup99 competition used C5 decision trees in combination with boosting.
A similar data set, but with raw data instead of aggregated records is described below.

train: 4,940,000 (a smaller set of 10% is also available)
test: around 3,110,290 (a 10% set is again available)

41 attributes and 1 label

Full training set: 743 Mb
10% training set: 75 Mb
Full test set: 430 Mb
10% test set: 45 Mb

LAN connection fraud data

The major problem is probably that the data set is very unbalanced: while the DoS attacks appear in 79% of the connections, the u2r attacks only appear in 0.01 percent of the records. And this least frequent attack type is in the same time the most difficult to predict and the most costly to miss. Also, many of the attributes have a sparse distribution with a large variance. And according to Data Mining Techniques for Intrusion Detection all of the attributes are important so that no data reduction can be done
A little data preprocessing is necessary, but that is well described in Data Mining Techniques for Intrusion Detection.

The task is classification of connections: 1 of the 4 intrusion types or normal. techniques for dealing with unfrequent classes will be necessary. An interesting approach is proposed in Data Mining Techniques for Intrusion Detection, where the most obvious class is tested before the others, so that in fact the other classifiers are conditional on the fact that the record does not belong to the first class. Maybe for the least frequent fraud class, techniques for outlier analysis can be used (see Han and Kamber section 8.9). Maybe visualisation could be interesting (although difficult if indeed all 41 attributes are important).

I think this could be a good data set: because the task is clear but not too obvious. And the type of dataset is different from the other ones so far.

Probably less interesting datasets:

The Syskill and Webert Web Page Ratings. This dataset contains webpages that have been rated by users. The aim is to build a predictor for the pages' rating. It comes down to text mining. The problem is that the dataset is very small: only 332 texts.
DNA sequence data (Burset and Guigo)
This dataset seems to be a favourite for research on DNA sequences. There is a file with 570 DNA sequences (2.8 Mbyte), a file which delimits per sequence the encoded genes (2649 exons in total), and a file representing the proteins that are encoded in these genes. What people usually want to do with this kind of data is to develop signal sensors (to identify start and stop codons for example), content sensors (to identify coding regions as opposed to non-coding regions), or combined sensors. A short introduction to the field is available on-line, as well as an overview of methods for content sensoring (which uses this dataset). An introduction in the form of a slide show is also available. The same data were also used in the research papers of the UCSC Computational Biology Research Projects. The whole research area seems really interesting, but might prove too difficult for a data mining course, as one really needs ad hoc methods that differ very much from traditional data mining algorithms. The favourite method at the moment seems to be generalised hidden markov models, because they can combine the search for signals and content (so they only label sequences as genes if they obey to the right syntactical structures indicated by the signals). This is probably better as an example in class than as an exercise. Here is a central hub for information about this field.
Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization by Spellman et Al.. The data are available in a tab-delimited txt-file. The website is http://genome-www.stanford.edu/cellcycle/.
In this paper, researchers investigate which yeast genes have expression levels that vary periodically with the cell cycle. The data set has a concatenation of time series of gene expressions under different conditions for each gene (it is very similar to the yeast data mentioned above). These time series are compared to times series of genes that are already known to vary periodically. The comparison is done by first doing a Fourrier transform of the time series and then calculating correlation between the new data series and the already known ones. The dataset is very similar to the yeast gene data mentioned above, but the scientific paper is very difficult (more written for biologist than for informaticists).
Using Bayesian Networks to Analyze Expression Data (2000) by Friedman et Al. uses the same dataset. They only use the genes that are found by Spellman to vary periodically over the cell cycle. They take 250 of these genes, and place in each tuple the expression levels of these genes at the same moment. So they have a dataset of only 76 samples (the same approach as in the cancer data above). They use this dataset to build different Bayesian models, and try to form conclusions about dependencies and causalities among genes.
I don't think this dataset adds anything new to what we already have in the other datasets.
splice data at the uci machine learning repository: recognising acceptor and donor sites in dna sequences. This data set contains 3190 records with 62 attributes. The records are labeled as donor, acceptor or neither. They are quite old though (added to the repository in 1992). They were used in some papers in the beginning of the 90's, like Training Knowledge-Based Neural Networks to Recognize Genes in DNA Sequences (1991).
There is a small promoter data set at the uci ML repository (106 tuples, 59 attributes). I did not look too much into promoter recognition yet, as it seemed quite difficult, by judging from the following paper: Recognizing Promoters in DNA Using Bayesian Neural Networks. They do use some of the techniques used in splice site recognition though (the MDD technique).
Network intrusion dataset. This dataset contains raw tcpdump data (one line per data packet), rather than aggregated connection records like in the network intrusion dataset for the KDDcup. They were collected from an enterprise LAN on which 4 different attacks were simulated. There is one file with data from normal traffic (30 Mb, 358759 records), and 4 other files, one for each simulated attack (all of the same order of magnitude as the first). A fact that makes this dataset more interesting than the aggregated one is that the raw data are given, and therefore the data miner can extract his own features (it was indicated in Data Mining Techniques for Intrusion Detection that some interesting aggregates were not made in the other intrusion dataset). This is however not so simple: good knowledge of both computer network trafic and typical fraud strategies are needed to make meaningful aggregates.
Several other datasets are available at the UCI KDD archive and at http://www.mlnet.org/, the data mining site supported by the European Commission.

Home : Teaching : Courses : Dme : Fredduc