Data sets

This page contains links to mineable data sets and information about them.
I am trying to organise the interesting ones a bit better in a table at the bottom of the page.

Summary table of possibly interesting datasets
Name General description Records Attributes Size (in Mb) Type of dataset Difficulties Possible tasks Interesting?
Reuters-21578
This is a very often used test set for text categorisation tasks. It contains 21578 news documents, delimited by SGML tags. Each document belongs to a list of different categories. The total number of categories is 672, but the number of categories appearing in at least one text is 445, and the number appearing in at least 20 texts is only 148. More information on the exact contents of the test set and how they can be devided into training and test data (a good division is the 'Apte'-split) can be found in the README file. Here is a 1000 document sample of the data set.
21578 not applicable +-27Mbyte text data set
An important problem is that the text is in raw text format. So a conversion to feature tuple format is necessary. Apparantly for the previous version of this dataset, a feature tuple format data set was available, but that is now no longer the case. I will still keep on looking for such a format. Apart from this, I downloaded Smart, a text preprocessing tool, but so far I have not been able to make it work.
text classification
text clustering
text retrieval
definitely
20 newsgroups
1000 Usenet articles were taken from 20 newsgroups (20000 in total). This is the testset used in section 6.10 of Mitchell's Machine Learning book. He classifies these articles per newsgroup using a naieve bayesian classifier. His results were very good (89% accuracy). As in the Reuters database, the articles are not preprocessed (they are actually quite messy), but Mitchell describes in his book how to do it (and it's easy: removing the most and the least frequent words, which should not be too difficult using perl, I guess). Also, a link is provided to the Bow project, which is a library of C code for text analysis.
20000 (a miniversion of 2000 is also available) not applicable 61.6Mbyte (6.2 for the mini-version) text data set
Also here the text is in its original format and a conversion to feature vectors is necessary (but Mitchell explains how to do it in his book). The texts are quite messy (take a look at this example), some hardly have any text. Still, the fact that these texts belong to exactly one of twenty categories makes the classification task a bit easier.
text classification
text clustering
text retrieval
yes, maybe, the fact that it is used by Mitchell in his book can be interesting.
The 1998 kdd cup
PVA is a not-for-profit organization that provides programs and services for US veterans with spinal cord injuries or disease. They raise money via direct mailing campaigns. The organisation is interested in lapsed donors: people who have stopped donating for at least 12 months. The available dataset contains a record for every donor who received the 1997 mailing and did not make a donation in the 12 months before that. Data are given about the previous and the current mailing campaign, as well as personal information and the giving history of each lapsed donor. Also overlay demographics were added. See the documentation and the data dictionary for more info. The aim is to predict the amount (if anything) each lapsed donor is likely to give. One interesting aspect is that there tends to be an inverse relationship between the probability to donate and the amount donated, making a simple classification (as donor or not) insufficient.
Train: 95412
Test: 96367
481 Train: 117.2 Mbyte
Test: 119 Mbyte
Customer DB, non-profit
Missing values: some attributes have quite a lot of them. The the KDD cup documentation gives some advice on how to treat the different attributes with missing values.
Noisy data: the documentation points out that there are some records with formatting errors.
Feature extraction: there are far too many features, and it will be important to choose a good subset (or to construct good new or aggregating features from existing ones) (the winners claim that the secret of their success lies in good feature selection).
Unbalanced data: there is only 5% donors.
Classification and regression: the main aim is to predict the amount donated, but also a classification as donor or not can be useful (the winners used a double regression model: one gave the probability of donating and the other the conditional donation amount). The task for the kddcup was to define the expected donation amount, and to keep those customers in the test set who were likely to give more than the cost of mailing an individual. For them, the real profit would be evaluated using the labels of the test set.
yes: a lot of useful, real-world data and a clear-cut task
The Coil kdd challenge
The question for this kdd Challenge is to predict which customers will buy a caravan insurance policy, and to describe their profile (so there were two tasks: prediction and description). The data set is derived from a real business problem, and contains for every possible customer 86 variables: 43 socio-demographic variables derived from the zip area code and 43 variables about ownership of insurance policies. The challenge's homepage with 29 reports of entered solutions is available on-line.
Train: 5822
Test: 4000
86 1.7 Mbyte Customer DB, profit
Feature selection and extraction: according to the winners of the prediction task, there are only a few useful features. They augmented these with two self-derived features to get an optimal prediction. The winners of the description task used more features (but still no more than 21). The papers of the winners are on-line .
Unbalanced data: only 5 to 6% of the customers in the training data set actually buy the insurance policy.
I don't think there are any missing or noisy data. All the data have been converted to numeric format.
Classification: the first formal task of the challenge was to define the top 800 most likely buyers of a caravan insurance policy from the test data set.
Description: the second task was to describe the profile of a typical buyer.
yes: real data with two clear data mining tasks. Probably easier than the previous set though, as the data are better prepared.
Anonymous web data of Microsoft
This dataset contains data about page access on the microsoft site for one week: per user, a list of the website areas he accessed. It was used for testing collaborative filtering systems in the paper Empirical Analysis of Predictive Algorithms for Collaborative Filtering. As every user only visits a few pages (or rather page categories) per week (the average in the dataset is 3), the dataset is very sparse. The data is in a special format for sparse datasets.
Train: 32711 users
Test: 5000 users
294 pages Train: 305K
Test: 52K
Web usage data
The fact that the dataset is very sparse might cause some difficulties.
Collaborative filtering: recommend pages to new users based on what others visited before them. This comes down to predicting what other pages a user will visit given a few he already visited. In the tests in the paper refered to before, the experimenters view a certain number of randomly selected visits for every user in the test set and try to predict the remaining pages.
maybe: I think the use of this dataset is limited to what was done in the paper.
EachMovie collaborative filtering dataset
This dataset contains 18 months of votes for movies, gathered by compaq for collaborative filtering experiments. They have some information about the voters (age, gender, zipcode) and about the movies (video_release, theater_release, genre, ...). Every vote is on a 0-5 scale. This dataset was used by compaq to build a colaborative filtering system (some info available on-line, but not really interesting), in Empirical Analysis of Predictive Algorithms for Collaborative Filtering (the same paper as mentioned above), and in Mining the Network Value of Customers (2001) by P. Domingos and M. Richardson (in this paper usful information can be found about the difficulties with the data and the necessary preprocessing). To access the data you need username (ducatelle) and password (ianc$Cog.Sci).
Voters: 72916
Movies: 1628
Votes: 2811983
Voters: 4
Movies: 9
Votes: 5
17.6 Mbyte (compressed) movie voting data
I have not seen the data yet, but I think they will need some preprocessing (there is for example a high number of 0 votes, which might indicate a problem).
Collaborative filtering: predicting someone's votes for other movies, using the same test setup as for the webdata above.
Description: as more details are available than in the web usage data, I think clustering and association rules must be possible.
More than the web usage data: there are more details available, and the dataset is less sparse (on average 39 votes per user).
The yeast S. cerevisiae gene expression vectors
These are the data from the paper Support Vector Machine Classification of Microarray Gene Expression Data. For 2467 genes, gene expression levels were measured in 79 different situations (here is the raw data set). Some of the measurements follow each other up in time, but in the paper they were not treated as time series (although to a certain extend that would be possible). For each of these genes, it is given whether they belong to one of 6 functional classes ( class lables on-line). The paper is concerned with classifying the data in into 5 of these classes (one class is unpredictable). In Cluster analysis and display of genome-wide expression patterns (Eisen et Al.), clustering is performed on more or less the same data.
2467 genes 79 measurements
6 class labels
Measurements: 1.7 Mb
Labels: 125 Kb
Biological data
Like many of the other data sets, this set is unbalanced. Some of the classes actually have very few cases:
  • class 1: 17
  • class 2: 30
  • class 3: 121
  • class 4: 35
  • class 5: 11
  • class 6: 16
Also, the data set is quite small. In the SVM paper, this is overcome by using three-fold cross-validation.
Thirdly, some of the instances have different expression levels even though they belong to the same class. These can impossibly be classified correctly. Fortunately, an overview of unavoidable false positives and negatives is given in the paper.
Fourthly, there are quite a lot of missing values: 3760. They are quite evenly distributed over rows and columns (there are no columns without missing values, and only 25% of the records have no missing values).
Finally, I think the data might contain quite some noise.
Classification can be done as in the SVM paper.
Also, like in Eisen's paper, clustering can be done. The clustering should be able to reflect the classes. The number of classes provided in this dataset are quite limited, though (the authors used only these classes because not all gene classes are predictable on the basis of expression information - the classes used here were the ones that clustered together in Eisen's work which is also based on expression info). More class information can be found in the on-line MIPS yeast genome database.
Useful but quite challenging, I think.
The Landsat image data from Statlog
This dataset contains pixel information for a 82*100 pixel part of an image of the earth surface taken from a satellite. For each pixel, 4 frequency values (each between 0-255) are given. This is because foto's are taken in 4 different spectral bands. Each pixel represent an area on the earth's surface of 80*80 metres. Every line in the dataset gives the 4 values for 9 pixels (a 3*3 neighbourhood frame around a central pixel. The aim is to classify the central pixel (a piece of land) into 1 of 6 classes based on its values and those of its immediate neighbours. More info about this is given in the documentation file.
Training: 4435
Test: 2000
36 values (4 values * 9 pixels) Training: 514Kb
Test: 231Kb
Spatial image data
There are no missing data.
All data are in numeric form, and I don't think any reformatting is necessary.
The obvious task is classification.
I am not sure if these data are really useful. They are already preprocessed (which is probably the most challenging aspect of spatial data), and the one obvious task is classification. This is more a machine learning test set. For an exercise with spatial data you'd probably want something more challenging.
Colon cancer data
Contains expression levels of 2000 genes taken in 62 different samples. For each sample it is indicated if it is a tumor biopsy or not. Numbers and descriptions for the different genes are also given. This dataset is used in many different research papers on gene expression data. It can be used in two ways: you can treat the 62 samples as records in a high-dimensional space, and do classification for cancer (see Tissue Classification with Gene Expression Profiles (2000) by Ben-Dor et Al.), or you can treat the genes as records, and do cluster analysis to find out which genes are similar (like in the clustering paper for the yeast data set mentioned above). A combination of both is also possible, as described in Coupled two-way clustering analysis of gene microarray data (2001) by Getz, Levine and Domany. A short overview of work in this domain (much of which uses this dataset) can be found in Gene expression data analysis (2000) by Brazma and Vilo.
2000 genes
62 samples
2000 genes
62 samples
data: 1.9 Mb
names: 529 Kb
labels: 207 bytes
gene expression data
When samples are treated as records (probably the most interesting case), two important problems arise: there are only a few records (a good reason to use techniques like leave-one-out-cross-validation), and a very high number of features (so feature selection becomes important, in the form of gene selection).
Another problem is sample contamination. Samples are usually quite noisy. Also, there are differences between cancerous and non-cancerous samples that are not interesting for the kind of research performed here: eg non-tumor samples usually contain more muscle-specific genes, and tumor samples usually contain more ribosomal genes. According to Ben-Dor et Al., all this does not pose too many problems though (see section 5 of their paper for an overview).
There are no missing values.
The one obvious task is classification of cancer samples. As has been said in the general description of the data, also clustering can be done, in two possible directions.
I think this dataset is quite challenging. It could be interesting, though, because the specific problems it poses are different from those of other datasets.
Leukemia data set
The Leukemia data set contains expression levels of 7129 genes taken over 72 samples. Labels indicate which of two variants of Leukemia is present in the sample (AML, 25 samples, or ALL, 47 samples). This data set is of the same type as the previous one and can therefore be used for the same kind of experiments. In fact, most of the papers that use the colon cancer data also use the leukemia data. One paper that uses only the leukemia data set is Feature selection for high-dimensional genomic microarray data (2001) by Xing et Al..
Train: 38 samples
Test: 34 samples
7129 genes
Train: 38 samples
Test: 34 samples
7129 genes
Train: 2Mb
Test: 1.8Mb
gene expression data
The same comments as for the colon cancer data can be made here.
See the colon cancer data.
See the colon cancer data.
human splice site data
This data set contains sequences of dna material around human splice sites. A splice site is the point at the beginning (donor site) or at the end (acceptor site) of an intron (a non-coding part of the gene dna sequence data). These sites typically correspond to certain patterns, but the same patterns can also be found in other places, so it is important to learn better classifiers to identify splice sites. In the past, people have used probability matrices which encode the probability of having a certain nucleotide in a certain position (or variations to that, see Ficket's overview paper). A disadvantage of this method is that dependencies between positions are not taken into account. Other methods build a conditional probability matrix to account for dependenies between adjacent positions (see Salzberg's paper), or split up the dataset before they build probability matrices, so that the probabilities are in fact conditional on the split of the dataset (the MDD method: see pages 13 to 15 of Prediction of complete gene structures in human genomic dna (1997) by burge and carlin - page 9 of this paper gives a schematic overview of the structure of a gene dna sequence). Also neural networks have been used quite successfully (see Training Knowledge-Based Neural Networks to Recognize Genes in DNA Sequences (1991), which was trained on old data available at the uci machine learning repository). Many nethods base themselves not only on base positions, but also on other features, like the presence of certain groups of neucleotides (see Ficket's overview paper), to get the best results. The dataset we are describing here was used to train and test GENIE, a gene finding system. It is divided into 8 different sets: acceptor versus donor sequences, training versus test data, and true versus false examples.
acceptor (a) / donor (d) - train(t) / test (e) - true (t) / false (f):
      a-e-f: 881
      d-e-f: 782
      a-e-t: 208
      d-e-t: 208
      a-t-f: 4672
      d-t-f: 4140
      a-t-t: 1116
      d-t-t: 1116
      
donor data: 15 base positions

acceptor data: 90 base positions

acceptor (a) / donor (d) - train(t) / test (e) - true (t) / false (f):
      a-e-f: 16k
      d-e-f: 7k
      a-e-t: 7k
      d-e-t: 2k
      a-t-f: 82k
      d-t-f: 36k
      a-t-t: 37k
      d-t-t: 11k
      
dna sequence data
The data are well prepared, I don't think there are any missing values.
There are two binary classification tasks: donor site or not, and acceptor site or not. Maybe running rule extraction algorithms could also be interesting.
I'm not sure whether this is good as a data mining exercise. It could be, as neural networks and decision trees have been used before. Maybe SVM's are also possible. The best results are obtained by combining base position information with other features (such as the presence of certain pairs and triplets of nucleotides). But even without those, it should be possible to find useful information. I wonder if it is possible to extract these other features automatically without the right biological background knowledge (maybe by detecting frequent episodes in sequences, see Smyth's book section 13.5). Once classifiers have been built, they could actually be tested on the Burset and Guigo set described below (there might be an overlap though, between the training data of this set and the data in Burset and guigo's set).
Volcanoes on Venus
This dataset contains images collected by the Magellan expedition to Venus. Venus is the most similar planet to the Earth in size, and therefore researchers want to learn about its geology. Venus' surface is scattered with volcanos, and task in this dataset is to develop a program that can automatically identify volcanoes (from training data that have been labeled by humans between 1-definitely a volcano and 4-definitely not one). A tool called JARtool was developed to do this. The makers of this tool provide the data to allow more research and establish a benchmark in this field. They provide in total 134 images of 1024*1024 8-bit pixels (out of the 30000 images of the original project). Very interesting is the fact that they provide a preprocessed version of the data: possibly interesting pixel frames ('chips') taken from the images by their image recognition tool, each labeled between 0 (not labeled by the humans), 1 (definitely a volcano) and 4 (definitely not a volcano). This is a good format for a data mining exercise, as the feature extraction from pixels (eg with PCA) still has to be done, but the difficult task of identifying interesting regions is taken care of. For more information, read the data documentation or the following paper by Burl, Smyth, Fayyad and others: Learning to Recognize Volcanoes on Venus (1998).
The chips are spread over different groups, according to experiments caried out for the JARtool software. There is some overlap between the groups, but I think among them the training and test sets of experiments C1 and D4 cover all chips (see experiments table):
  • C1_trn: 12018 records
  • C1_tst: 16608 records
  • D4_trn: 6398 records
  • D4_tst: 2256 records
They are 15*15 pixel frames. 187 Mbyte, but that is including everything, also the original images. If we restrict to the pixel frames it is less: 31 Mb. If we restrict to the frames covering the complete dataset (see records): 8.4 Mb. image data
It will be necessary to normalise the pixel frames, as there is a difference in brightness between the different images and even between different parts of the same image.
Feature extraction (in the original JARtool project they use PCA) will be necessary, because per frame, there are quite a lot of pixels, resulting in a high number of attributes in comparison to the total number of training examples.
An important challenge will be, like in most of the previous datasets, the fact that the data are unbalanced (few positive examples). This is the class distribution:
  • 0 (not considered by human experts): 36058
  • 1 (definitely a volcano): 127
  • 2 (probably): 254
  • 3 (possibly): 397
  • 4 (only a pit): 444
Obviously, the main task is to classify pixel frames as being volcanos or not on a scale between 0 and 4. As described in the paper Learning to Recognize Volcanoes on Venus (1998), also clustering is possible, as there are in fact different kinds of volcanos. A classifier for each kind of volcano can then be built.
This could be a useful spatial dataset. It won't be easy, though, mainly because of the small number of positive examples: as described in the paper mentioned before, feature extraction will be important to reduce the number of pixels compared to the number of training records.
Network intrusion data
These data were used for the '99 kdd cup. They were gathered by Lincoln Labs: they collected nine weeks of raw TCP dump data from a typical US air force LAN. During the use of the LAN, they performed several attacks on it. The raw packet data were aggregated into connection data. Per record, extra features were derived, based on domain knowledge about network attacks. There are 38 different attack types, belonging to 4 main categories. There are some attack types that only appear in the test data, and the frequency of attack types in test and training data is not the same (to make it more realistic). More information about the data can be found in the task file, in Cost-based Modeling and Evaluation for Data Mining With Application to Fraud and Intrusion Detection: Results from the JAM Project, and in the overview of the KDDcup results. On that page, it is also indicated that there is a cost matrix associated with misclassifications. Another paper that uses the same dataset is PNrule: A New Framework for Learning Classifier Models in Data Mining (A Case-Study in Network Intrusion Detection) (2000). It uses simple general-to-specific rule learning in two stages. See also Data Mining Techniques for Intrusion Detection: apart from giving a good solution architecture, this paper provides a thorough description of the data and the needed preprocessing. The winner of the KDDcup99 competition used C5 decision trees in combination with boosting.
A similar data set, but with raw data instead of aggregated records is described below.
train: 4,940,000 (a smaller set of 10% is also available)
test: around 3,110,290 (a 10% set is again available)
41 attributes and 1 label Full training set: 743 Mb
10% training set: 75 Mb
Full test set: 430 Mb
10% test set: 45 Mb
LAN connection fraud data
The major problem is probably that the data set is very unbalanced: while the DoS attacks appear in 79% of the connections, the u2r attacks only appear in 0.01 percent of the records. And this least frequent attack type is in the same time the most difficult to predict and the most costly to miss. Also, many of the attributes have a sparse distribution with a large variance. And according to Data Mining Techniques for Intrusion Detection all of the attributes are important so that no data reduction can be done
A little data preprocessing is necessary, but that is well described in Data Mining Techniques for Intrusion Detection.
The task is classification of connections: 1 of the 4 intrusion types or normal. techniques for dealing with unfrequent classes will be necessary. An interesting approach is proposed in Data Mining Techniques for Intrusion Detection, where the most obvious class is tested before the others, so that in fact the other classifiers are conditional on the fact that the record does not belong to the first class. Maybe for the least frequent fraud class, techniques for outlier analysis can be used (see Han and Kamber section 8.9). Maybe visualisation could be interesting (although difficult if indeed all 41 attributes are important).
I think this could be a good data set: because the task is clear but not too obvious. And the type of dataset is different from the other ones so far.

Probably less interesting datasets:


Home : Teaching : Courses : Dme : Fredduc 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh