Summary table of possibly interesting datasets |
Name |
General description |
Records |
Attributes |
Size (in Mb) |
Type of dataset |
Difficulties |
Possible tasks |
Interesting? |
Reuters-21578
|
This is a very often used test set for text
categorisation tasks. It contains 21578 news documents, delimited by
SGML tags. Each document belongs to a list of different
categories. The total number of categories is 672, but the number of
categories appearing in at least one text is 445, and the number
appearing in at least 20 texts is only 148. More information on the
exact contents of the test set and how they can be devided into
training and test data (a good division is the 'Apte'-split) can be
found in the
README file. Here is a 1000
document sample of the data set.
|
|
21578 |
not applicable |
+-27Mbyte |
text data set |
An important problem is that the text is in raw text format. So a
conversion to feature tuple format is necessary. Apparantly for the
previous version of this dataset, a feature tuple format data set
was available, but that is now no longer the case. I will still keep
on looking for such a format. Apart from this, I downloaded Smart, a
text preprocessing tool, but so far I have not been able to make it work.
|
|
text classification text clustering text retrieval |
definitely |
20
newsgroups
|
1000 Usenet articles were taken from 20 newsgroups (20000 in total).
This is the testset used in section 6.10 of Mitchell's Machine Learning
book. He classifies these articles per newsgroup using a naieve bayesian
classifier. His results were very good (89% accuracy). As in the Reuters
database, the articles are not preprocessed (they are actually quite messy),
but Mitchell describes in his book how to do it (and it's easy: removing
the most and the least frequent words, which should not be too difficult
using perl, I guess). Also, a link is provided to the Bow project, which is
a library of C code for text analysis.
|
|
20000 (a miniversion of 2000 is also available) |
not applicable |
61.6Mbyte (6.2 for the mini-version) |
text data set |
Also here the text is in its original format and a conversion to feature
vectors is necessary (but Mitchell explains how to do it in his book).
The texts are quite messy (take a look at this example),
some hardly have any text. Still, the fact that these texts belong to exactly one of
twenty categories makes the classification task a bit easier.
|
|
text classification text clustering text retrieval |
yes, maybe, the fact that it is used by Mitchell in his book can
be interesting. |
The 1998 kdd cup
|
PVA is a not-for-profit organization that provides programs and services
for US veterans with spinal cord injuries or disease. They raise money
via direct mailing campaigns. The organisation is interested in lapsed
donors: people who have stopped donating for at least 12 months. The
available dataset contains a record for every donor who received the
1997 mailing and did not make a donation in the 12 months before that.
Data are given about the previous and the
current mailing campaign, as well as personal information and the
giving history of each lapsed donor. Also overlay demographics were added.
See
the documentation and
the data dictionary for more info. The aim is to predict the
amount (if anything) each lapsed donor is likely to give. One
interesting aspect is that there tends to be an inverse relationship
between the probability to donate and the amount donated, making a
simple classification (as donor or not)
insufficient.
|
|
Train: 95412 Test: 96367 |
481 |
Train: 117.2 Mbyte Test: 119 Mbyte |
Customer DB, non-profit |
Missing values: some attributes have quite a
lot of them. The the KDD cup
documentation gives some advice on how to treat the different
attributes with missing values. Noisy data: the documentation
points out that there are some records with formatting errors.
Feature extraction: there are far too many features, and it will be
important to choose a good subset (or to construct good new or
aggregating features from existing ones) (the
winners claim that the secret of their success lies in good feature
selection). Unbalanced data: there is only 5% donors.
|
|
Classification and regression: the main aim
is to predict the amount donated, but also a classification as donor or not
can be useful (the winners used a double regression model: one gave
the probability of donating and the other the conditional donation
amount). The task for the kddcup was to define the expected donation
amount, and to keep those customers in the test set who were likely to
give more than the cost of mailing an individual. For them, the real
profit would be evaluated using the labels of the test set.
|
|
yes: a lot of useful, real-world data and
a clear-cut task |
|
The Coil
kdd challenge
|
The question for this kdd Challenge is to predict which customers will
buy a caravan insurance policy, and to describe their profile (so there
were two tasks: prediction and description). The data set is derived from
a real business problem, and contains for every possible customer 86
variables: 43 socio-demographic variables derived from the zip area code and
43 variables about ownership of insurance policies.
The challenge's homepage
with 29 reports of entered solutions is available on-line.
|
|
Train: 5822 Test: 4000 |
86 |
1.7 Mbyte |
Customer DB, profit |
Feature selection and extraction: according to the winners of the prediction task, there
are only a few useful features. They augmented these with two self-derived features to get
an optimal prediction. The winners of the description task used more
features (but still no more than 21). The
papers of the winners are on-line . Unbalanced data: only 5
to 6% of the customers in the training data set actually buy
the insurance policy. I don't think there are any missing or
noisy data. All the data have been converted to numeric format.
|
|
Classification: the first formal task of the challenge was to define the top
800 most likely buyers of a caravan insurance policy from the test data set.
Description: the second task was to describe the profile of a typical buyer.
|
|
yes: real data with two clear data mining
tasks. Probably easier than the previous set though, as the data are
better prepared. |
|
Anonymous web data of Microsoft
|
This dataset contains data about page access on the microsoft site for one week:
per user, a list of the website areas he accessed. It was used for testing
collaborative filtering systems in the paper Empirical
Analysis of Predictive Algorithms for Collaborative Filtering. As every user
only visits a few pages (or rather page categories) per week (the average in the dataset is 3),
the dataset is very sparse. The data is in a special format for sparse
datasets.
|
|
Train: 32711 users Test: 5000 users |
294 pages |
Train: 305K Test: 52K |
Web usage data |
The fact that the dataset is very sparse might cause some difficulties.
|
|
Collaborative filtering: recommend pages to new users based on what others
visited before them. This comes down to predicting what other pages a user
will visit given a few he already visited. In the tests in the paper refered
to before, the experimenters view a certain number of randomly selected visits
for every user in the test set and try to predict the remaining pages.
|
|
maybe: I think the use of this dataset is limited to what was done in the paper.
|
|
EachMovie collaborative filtering dataset |
This dataset contains 18 months of votes for movies, gathered by compaq
for collaborative filtering experiments. They have some information about
the voters (age, gender, zipcode) and about the movies (video_release,
theater_release, genre, ...). Every vote is on a 0-5 scale. This dataset
was used by compaq to build a colaborative filtering system (some info available
on-line, but not really interesting), in Empirical Analysis of Predictive
Algorithms for Collaborative Filtering (the same paper as mentioned
above), and in Mining the Network Value of
Customers (2001) by P. Domingos and M. Richardson (in this
paper usful information can be found about the difficulties with the
data and the necessary preprocessing). To access the data you need
username (ducatelle) and password (ianc$Cog.Sci).
|
|
Voters: 72916 Movies: 1628 Votes: 2811983 |
Voters: 4 Movies: 9 Votes: 5 |
17.6 Mbyte (compressed) |
movie voting data |
I have not seen the data yet, but I think they will need some preprocessing
(there is for example a high number of 0 votes, which might indicate a problem).
|
|
Collaborative filtering: predicting someone's votes for other movies, using the same
test setup as for the webdata above. Description: as more details are available than in
the web usage data, I think clustering and association rules must be possible.
|
|
More than the web usage data: there are more details available, and the dataset is
less sparse (on average 39 votes per user).
|
|
The yeast
S. cerevisiae gene expression vectors
|
|
2467 genes |
79 measurements 6 class labels |
Measurements: 1.7 Mb Labels: 125 Kb |
Biological data |
Like many of the other data sets, this set is unbalanced. Some of the
classes actually have very few cases:
- class 1: 17
- class 2: 30
- class 3: 121
- class 4: 35
- class 5: 11
- class 6: 16
Also, the data set is quite small. In the SVM paper, this is overcome
by using three-fold cross-validation.
Thirdly, some of the instances have different expression levels even
though they belong to the same class. These can impossibly be
classified correctly. Fortunately, an overview of unavoidable false
positives and negatives is given in the paper.
Fourthly, there are quite a lot of missing values: 3760. They are
quite evenly distributed over rows and columns (there are no
columns without missing values, and only 25% of the records have no
missing values).
Finally, I think the data might contain quite some noise.
|
|
Classification can be done as in the SVM paper. Also, like in Eisen's
paper, clustering can be done. The clustering should be able to reflect the
classes. The number of classes provided in this dataset are quite limited, though
(the authors used only these classes because not all gene classes are
predictable on the basis of expression information - the classes used
here were the ones that clustered together in Eisen's work which is
also based on expression info). More class information can be found in
the on-line MIPS
yeast genome database.
|
|
Useful but quite challenging, I think.
|
|
The Landsat image data from Statlog
|
This dataset contains pixel information
for a 82*100 pixel part of an image of the earth surface taken from a
satellite. For each pixel, 4 frequency values (each between 0-255) are
given. This is because foto's are taken in 4 different spectral
bands. Each pixel represent an area on the earth's surface of 80*80
metres. Every line in the dataset gives the 4 values for 9 pixels (a
3*3 neighbourhood frame around a central pixel. The aim is to classify
the central pixel (a piece of land) into 1 of 6 classes based on its
values and those of its immediate neighbours. More info about this is
given in the documentation file.
|
|
Training: 4435 Test: 2000 |
36 values (4 values * 9 pixels) |
Training: 514Kb Test: 231Kb |
Spatial image data |
There are no missing data. All data are in numeric form,
and I don't think any reformatting is necessary.
|
|
The obvious task is classification.
|
|
I am not sure if these data are really
useful. They are already preprocessed (which is probably the most
challenging aspect of spatial data), and the one obvious task is
classification. This is more a machine learning test set. For an exercise
with spatial data you'd probably want something more
challenging.
|
|
Colon cancer data
|
Contains expression levels of 2000 genes taken in 62 different samples.
For each sample it is indicated if it is a tumor biopsy or not. Numbers
and descriptions for the different genes are also given. This dataset
is used in many different research papers on gene expression data. It
can be used in two ways: you can treat the 62 samples as records in a
high-dimensional space, and do classification for cancer (see Tissue
Classification with Gene Expression Profiles (2000) by Ben-Dor
et Al.), or you can treat the genes as records, and do cluster
analysis to find out which genes are similar (like in the clustering
paper for the yeast data set mentioned above). A combination of both
is also possible, as described in
Coupled two-way clustering analysis of gene microarray data
(2001) by Getz, Levine and Domany. A short overview of work
in this domain (much of which uses this dataset) can be found in
Gene
expression data analysis (2000) by Brazma and Vilo.
|
|
2000 genes 62 samples |
2000 genes 62 samples |
data: 1.9 Mb names: 529 Kb labels: 207 bytes |
gene expression data |
When samples are treated as records
(probably the most interesting case), two important problems arise:
there are only a few records (a good reason to use techniques like
leave-one-out-cross-validation), and a very high number of features
(so feature selection becomes important, in the form of gene
selection). Another problem is sample contamination. Samples are
usually quite noisy. Also, there are differences between cancerous and
non-cancerous samples that are not interesting for the kind of
research performed here: eg non-tumor samples usually contain more
muscle-specific genes, and tumor samples usually contain more
ribosomal genes. According to Ben-Dor et Al., all this does not pose
too many problems though (see section 5 of their paper for an
overview). There are no missing values.
|
|
The one obvious task is classification of
cancer samples. As has been said in the general description of the
data, also clustering can be done, in two possible
directions.
|
|
I think this dataset is quite challenging. It could be interesting, though,
because the specific problems it poses are different from those of other datasets.
|
|
Leukemia data set
|
The Leukemia data set contains expression
levels of 7129 genes taken over 72 samples. Labels indicate which of
two variants of Leukemia is present in the sample (AML, 25 samples,
or ALL, 47 samples). This data set is of the same type as the previous
one and can therefore be used for the same kind of experiments. In
fact, most of the papers that use the colon cancer data also use the
leukemia data. One paper that uses only the leukemia data set is
Feature selection for high-dimensional genomic microarray data
(2001) by Xing et Al..
|
|
Train: 38 samples Test: 34 samples 7129 genes |
Train: 38 samples Test: 34 samples 7129 genes |
Train: 2Mb Test: 1.8Mb |
gene expression data |
The same comments as for the colon cancer
data can be made here.
|
|
See the colon cancer data. |
|
See the colon cancer data. |
|
human splice site data
|
This data set contains sequences of dna material around human
splice sites. A splice site is the point at the beginning (donor site)
or at the end (acceptor site) of an intron (a non-coding part of the
gene dna sequence data). These sites typically correspond to certain
patterns, but the same patterns can also be found in other places, so
it is important to learn better classifiers to identify splice
sites. In the past, people have used probability matrices which encode
the probability of having a certain nucleotide in a certain position
(or variations to that, see
Ficket's overview paper). A disadvantage of this method is that
dependencies between positions are not taken into account. Other
methods build a conditional probability matrix to account for
dependenies between adjacent positions (see Salzberg's
paper), or split up the dataset before they build probability matrices, so
that the probabilities are in fact conditional on the split of the
dataset (the MDD method: see pages 13 to 15 of Prediction of
complete gene structures in human genomic dna (1997)
by burge and carlin - page 9 of this paper gives a schematic overview
of the structure of a gene dna sequence). Also neural networks have
been used quite successfully (see Training
Knowledge-Based Neural Networks to Recognize Genes in DNA Sequences
(1991), which was trained on old data available at the
uci machine learning repository). Many nethods base themselves not
only on base positions, but also on other features, like the presence
of certain groups of neucleotides (see Ficket's overview paper), to
get the best results. The dataset we are describing here was used to
train and test GENIE, a gene finding system. It is divided into 8
different sets: acceptor versus donor sequences, training versus test
data, and true versus false examples.
|
|
acceptor (a) / donor (d) - train(t) / test (e) - true (t) / false (f):
a-e-f: 881
d-e-f: 782
a-e-t: 208
d-e-t: 208
a-t-f: 4672
d-t-f: 4140
a-t-t: 1116
d-t-t: 1116
|
donor data: 15 base positions acceptor data: 90 base positions
|
acceptor (a) / donor (d) - train(t) / test (e) - true (t) / false (f):
a-e-f: 16k
d-e-f: 7k
a-e-t: 7k
d-e-t: 2k
a-t-f: 82k
d-t-f: 36k
a-t-t: 37k
d-t-t: 11k
|
dna sequence data |
The data are well prepared, I don't think
there are any missing values.
|
|
There are two binary classification tasks:
donor site or not, and acceptor site or not. Maybe running rule
extraction algorithms could also be interesting.
|
|
I'm not sure whether this is good as a
data mining exercise. It could be, as neural networks and decision
trees have been used before. Maybe SVM's are also possible. The best
results are obtained by combining base position information with other
features (such as the presence of certain pairs and triplets of
nucleotides). But even without those, it should be possible to find useful
information. I wonder if it is possible to extract these other
features automatically
without the right biological background knowledge (maybe by detecting
frequent episodes in sequences, see Smyth's book section 13.5). Once
classifiers have been built, they could actually be tested on the
Burset and Guigo set described below (there might be an overlap
though, between the training data of this set and the data in Burset
and guigo's set).
|
|
Volcanoes on Venus
|
This dataset contains images collected by
the Magellan expedition
to Venus. Venus is the most similar planet to
the Earth in size, and therefore researchers want to learn about its
geology. Venus' surface is scattered with volcanos, and task in this
dataset is to develop a program that can automatically identify
volcanoes (from training data that have been labeled by humans between
1-definitely a volcano and 4-definitely not one). A tool called JARtool
was developed to do this. The makers of this tool provide the data to
allow more research and establish a benchmark in this field. They
provide in total 134 images of 1024*1024 8-bit pixels (out of the
30000 images of the original project). Very interesting is the fact
that they provide a preprocessed version of the data: possibly
interesting pixel frames ('chips') taken from the images by their image
recognition tool, each labeled between 0 (not labeled by the humans),
1 (definitely a volcano) and 4 (definitely not a volcano). This is a
good format for a data mining exercise, as the feature extraction from
pixels (eg with PCA) still has to be done, but the difficult task of
identifying interesting regions is taken care of. For more
information, read the data
documentation or the following paper by Burl, Smyth, Fayyad and others:
Learning
to Recognize Volcanoes on Venus (1998).
|
|
The chips are spread over different groups, according to experiments
caried out for the JARtool software. There is some overlap between the
groups, but I think among them the training and test sets of
experiments C1 and D4 cover all chips (see experiments
table):
- C1_trn: 12018 records
- C1_tst: 16608 records
- D4_trn: 6398 records
- D4_tst: 2256 records
|
They are 15*15 pixel frames.
|
187 Mbyte, but that is including everything, also the original
images. If we restrict to the pixel frames it is less: 31 Mb. If we
restrict to the frames covering the complete dataset (see records):
8.4 Mb.
|
image data |
It will be necessary to normalise the
pixel frames, as there is a difference in brightness between the
different images and even between different parts of the same
image. Feature extraction (in the original JARtool project they
use PCA) will be necessary, because per frame, there are quite a lot
of pixels, resulting in a high number of attributes in comparison to
the total number of training examples. An important challenge
will be, like in most of the previous datasets, the fact that the data
are unbalanced (few positive examples). This is the class
distribution:
- 0 (not considered by human experts): 36058
- 1 (definitely a volcano): 127
- 2 (probably): 254
- 3 (possibly): 397
- 4 (only a pit): 444
|
|
Obviously, the main task is to classify pixel frames as being volcanos
or not on a scale between 0 and 4. As described in the paper Learning to
Recognize Volcanoes on Venus (1998), also clustering is possible,
as there are in fact different kinds of volcanos. A classifier for
each kind of volcano can then be built.
|
|
This could be a useful spatial dataset. It won't be easy, though,
mainly because of the small number of positive examples: as described in
the paper mentioned before, feature extraction will be important to reduce
the number of pixels compared to the number of training records.
|
|
Network
intrusion data
|
These data were used for the '99 kdd
cup. They were gathered by Lincoln Labs: they collected nine weeks of
raw TCP dump data from a typical US air
force LAN. During the use of the LAN, they performed several attacks
on it. The raw packet data were aggregated into connection
data. Per record, extra features were derived, based on domain
knowledge about network attacks. There are 38 different attack types,
belonging to 4 main categories. There are some attack types that only
appear in the test data, and the frequency of attack types in test and
training data is not the same (to make it more realistic). More
information about the data can be found in the task
file, in Cost-based
Modeling and Evaluation for Data Mining With Application to Fraud and
Intrusion Detection: Results from the JAM Project, and in the overview
of the KDDcup results. On that page, it is also indicated that
there is a cost matrix associated with misclassifications. Another paper
that uses the same dataset is PNrule: A New Framework
for Learning Classifier Models in Data Mining (A Case-Study in Network
Intrusion Detection) (2000). It uses simple general-to-specific
rule learning in two stages. See also Data Mining
Techniques for Intrusion Detection: apart from giving a good
solution architecture, this paper provides a thorough description of
the data and the needed preprocessing. The winner
of the KDDcup99 competition used C5 decision trees in combination
with boosting. A similar data
set, but with raw data instead of aggregated records is described below.
|
|
train: 4,940,000 (a smaller set of 10% is also available)
test: around 3,110,290 (a 10% set is again available) |
41 attributes and 1 label |
Full training set: 743 Mb 10% training set: 75 Mb Full
test set: 430 Mb 10% test set: 45 Mb |
LAN connection fraud data |
The major problem is probably that the
data set is very unbalanced: while the DoS attacks appear in 79% of
the connections, the u2r attacks only appear in 0.01 percent of the
records. And this least frequent attack type is in the same time the
most difficult to predict and the most costly to miss. Also, many of
the attributes have a sparse distribution with a large variance. And
according to Data Mining Techniques for
Intrusion Detection all of the attributes are important so that no
data reduction can be done A little
data preprocessing is necessary, but that is well described in Data Mining Techniques for Intrusion
Detection.
|
|
The task is classification of connections:
1 of the 4 intrusion types or normal. techniques for dealing with
unfrequent classes will be necessary. An interesting approach is
proposed in Data Mining Techniques for
Intrusion Detection, where the most obvious class is tested
before the others, so that in fact the other classifiers are
conditional on the fact that the record does not belong to the first
class. Maybe for the least frequent fraud class, techniques for
outlier analysis can be used (see Han and Kamber section 8.9). Maybe
visualisation could be interesting (although difficult if indeed all
41 attributes are important).
|
|
I think this could be a good data set:
because the task is clear but not too obvious. And the type of dataset
is different from the other ones so far.
|
|