Datasets for Data Mining
This page contains a list of datasets that were selected for the
projects for Data Mining and Exploration. Students can choose one of
these datasets to work on, or can propose data of their own choice. At the bottom of this page, you will find some
examples of datasets which we judged as inappropriate for the projects.
- Description: This data set was used in the KDD Cup 2004 data
mining competition. The training data is from high-energy collision
experiments. There are 50 000 training examples, describing the
measurements taken in experiments where two different types of
particle were observed. Each training example has 78 numerical
attributes.
- Size
- 50 000 training examples, 100 000 test examples
- 78 numerical attributes
- 147 MB as uncompressed text
- References:
- Task:
Perform
exploratory data analysis to get a good feel for the data and prepare
the data for data mining.
Train at least two classifiers to distinguish between two types
of particle generated in high-energy collider experiments. The
original competition asked participants to provide four separate sets
of predictions, optimising separately the accuracy, area under the ROC
curve, cross-entropy, and q-score. Software to calculate these
measures can be downloaded from the
competition website.
- Challenges: No labels are given to the attributes to help
interpret them. There is missing data for 8 of the attributes (with
out-of-range values of 999 and 9999 used as placeholders).
- Description: This data set was made available for the
Physiological
Data Modeling Contest at ICML 2004. The data was collected from
subjects using BodyMedia wearable body monitors while performing their
usual activities. These monitors record acceleration, heat flux,
galvanic skin response, skin temperature, and near-body
temperature. The training data set includes several sessions for each of
multiple subjects, with measurements stored each minute during a
session. The test data set includes further sessions from the same
subjects, as well as sessions recording measurements from new subjects
who did not feature in the training data. Each record in the data
includes an annotation code giving information about the kind of
activity that the subject was performing at that time. Participants in the
competition were asked to train classifiers to apply two of these annotation
codes to the test data, and also to train a classifier to identify
subjects as men or women (this information is given in the training
data sequences).
- Size:
- About 10 000 hours of training data, 12 000 hours of
test data
- One record per minute in a session
- 16 fields in each record, including 9 fields of physiological data
- 138 MB as uncompressed text
- References:
- Task:
Perform
exploratory data analysis to get a good feel for the data and prepare
the data for data mining.
Train at least two different classifiers to detect entries in the test data
corresponding to two annotated states in the training data. Train a
classifier to predict the gender of the subjects in the test data.
(You may wish to focus on only a subset of the predictive tasks.)
- Challenges: Only a small proportion of the training data
corresponds to the two annotation states of interest, so there are
many more negative than positive examples. Much of the data is
not annotated (the annotation field contains zero).
- Description: This data set was used in the BCI
Competition III (dataset V). Using a cap with 32 integrated
electrodes, EEG data were collected from three subjects
while they performed three activities: imagining moving their left hand,
imagining moving their right hand, and thinking of words beginning
with the same letter. As well as the raw EEG signals, the data set
provides precomputed features obtained by spatially filtering these
signals and calculating the power spectral density.
- Size:
- 31216 records in training data, 10464 in test data
- Each record has 96 continuous values and a numerical label
- 63 MB as uncompressed text
- Reference:
- Task:
Perform
exploratory data analysis to get a good feel for the data and prepare
the data for data mining. Train at least two different classifiers
to assign class labels to the test data to indicate
which activity the subject was performing while the data were
collected.
- Challenges: This data set represents time series of EEG
readings. A baseline approach could be based on the given
precomputed features.
It might also be possible to train a classifier on a window of
some size around each time step. Both of these approaches ignore the
fact that the data is really a time series; one might consider using
an explicit time-series
model such as a Hidden Markov Model.
- Description: This dataset was used in the 2001
kdd cup data mining competition. There were in fact two tasks
in the competition with this dataset, the prediction of the
"Function" attribute, and prediction of the "Localization"
attribute. Here we focus on the latter (this is somewhat easier as
genes can have many functions, but only one localization, at least
in this dataset). The dataset provides
a variety of details about the several genes of one particular type
of organism. The main dataset, (the downloadable files are
Genes_relation.{data,test}) contains row data of the
following form:
Gene ID, Essential, Class, Complex, Phenotype, Motif, Chromosome
Number, Function, Localization.
The first attribute is a discrete variable
corresponding to the gene (there are 1243 gene values).
Also the remaining 8 attributes consist of
discrete variables, most of
them related to the proteins coded by the gene, e.g. the "Function"
attribute describes some crucial functions the respective protein is involved
in, and the "Localization" is simply the part of the cell where the
protein is localized.
In addition to the data of the above form,
there are also data files (Interactions.relations.{data,test})
which contains information
about interactions between pairs of genes.
-
Size
- Gene_relation files: 6275 examples (4346 training, 1929 test),
9 categorical attributes.
- Interaction_relation files: 1806 records, 2 attributes
(one categorical; one numerical)
- 1 MB
- References:
- Task:
Perform
exploratory data analysis to get a good feel for the data and prepare
the data for data mining.
The task in this dataset is to make predictions
of the attribute "Localization".
Compare at least 2 different classifiers. One other possible comparison
is to compare performance with or without the use of the
interactions data. One possible classifier that handles missing data
easily (but does not use the interaction data)
could be a belief network that has learned relationships between the
Essential, Class, Complex, Phenotype, Motif, Chromosome Number, and
Localization attributes.
- Challenges:
This dataset is a great challenge. From data
mining point of view the important challenge is to find a
way to efficiently use the
Interaction_relation data files, which is not obvious.
Another issue is that there is a high proportion
of missing variables in the Genes_relation data.
- Description:This dataset was used in the 2001
kdd cup data mining competition. It was produced by
DuPont Pharmaceuticals Research Laboratories and concerns drug design.
Drugs are typically small organic molecules. The first step in
the discovery of a new drug is usually to identify and isolate the
receptor to which it should bind (in this case this is the
thrombin site),
followed by testing many small molecules for their ability to bind
to the target site. Some molecules are able to bind the site, so there
are "active" while other
remain "inactive". It would be interesting to learn how to
separate active from
inactive molecules. This dataset provides data of these two
classes of drugs (active and inactive).
- Size:
- 2545 data points: 1909 for training, 636 for testing
- 139,351 binary attributes, 2 classes
- 694 MB
- References:
- Task: Carefully read all the information given in
kdd cup 2001 compeptition about this data.
Perform
exploratory data analysis to get a good feel for the data and prepare
the data for data mining.
The task is to
learn a classifier using the training set that predicts the behavior
of a drug (active or inactive). Note that the number of attributes
is much larger than the
number of training examples,
thus an efficient classifier should use feature reduction.
Train and
compare at least two classifiers. You can check your
answers on the test set by looking at the corresponding
separate file which can be downloaded from the kdd cup
2001 site.
- Challenges: This is a difficult data set. Firstly there is a
great imbalance between the two class; only 42 examples belong to the
active classes from the total 1909 training examples. The larger
data mining challenge, however, concers the huge number of binary
attributes (139,351). Selecting "good" features will be the most
important part of developing an good classifier.
- Description: This data set contains WWW-pages collected
from computer science departments of various universities in
January 1997 by the World Wide Knowledge Base (WebKb) project
of the CMU text learning group. The 8,282 pages were manually
classified into 7 classes:
1) student, 2) faculty, 3) staff, 4) department,
5) course, 6) project and 7) other.
For each class the data set contains pages from the
four universities: Cornell, Texas, Washington,
Wisconsin and 4,120 miscellaneous pages from other universities.
The files are organized into a
directory structure, one directory for each class. Each of
these seven directories contains 5 subdirectories, one for
each of the 4 universities and one for the miscellaneous
pages. These directories in turn contain the Web-pages.
- Size:
- 8,282 webspages, 7 classes
- 60.8 MB
- References:
- Task:
Prepare the data for mining and perform an
exploratory data analysis (these steps will probably not be
independent). The data mining task is to classify the texts according
to the 7 classes. You should compare at least 2
different classifiers.
Since each university's web pages have their own idiosyncrasies,
it is not recommended to do training and testing on pages from the same
university. We recommend training on three of the universities
plus the misc collection, and testing on the pages from a fourth,
held-out university (four-fold cross validation). An additional
topic might be to look at labelled/unlabelled data, as in the
reference.
- Challenges: An important challenge from web mining point of
view will be the preprocessing of the dataset. Since the data are html
files you have to remove all the irrelevant text information, such as html
commands etc. and convert the rest of the text into a
bag-of-words format. See help on the 4
Universities Data Set web page about doing this with rainbow.
- Description: These data are from the paper
Learning to remove Internet advertisements .
The dataset represents a set of possible advertisements on
Internet pages. The attributes encode the geometry of the image (if
available) as well as phrases occuring in the URL, the image's URL and
alt text, the anchor text, and words occuring near the
anchor text. There are two class labels: advertisement ("ad") and
not advertisement ("nonad"). The interesting about this data
is that someone might wish to filter the webpages
from irrelevant advertisements, as part of some
preproccesing procedure (e.g. useful for
subsequent classifcation of the website).
- Size:
- 3279 (2821 nonads, 458 ads)
- 1558 attributes (3 continuous, the rest binary)
- 10 MB
- References:
- Task:
Prepare the data for mining and perform an
exploratory data analysis.
The data mining task is to predict whether an image is
an advertisement ("ad") or not ("nonad"). As you are not given
an explicit training/test split you need to decide on a reasonable
way of assessing performance. You should perform
feature reduction in order to significantly reduce the number
of features. Consider at least two different classifiers.
- Challenges: There is an inbalance of the number of data per
each class. Also the number of attributes is very high compared to the
size of the dataset, which suggests that efficient feature reduction
is very important. One or more of the three continuous features are
missing in 28% of the data.
- Description:
This is a very often used test set for text
categorisation tasks. It contains 21578 Reuters news documents from 1987. They
were labeled manually by Reuters personnel. Labels belong to 5 different
category classes, such as 'people', 'places' and 'topics'. The total number
of categories is 672, but many of them occur only very rarely. Some
documents belong to many different categories, others to only one, and
some have no category. Over
the past decade, there have been many efforts to clean the database
up, and improve it for use in scientific research. The present format
is divided in 22 files of 1000 documents delimited by SGML tags (here
is as an example one of these
files). Extensive information on the structure and the contents of
the dataset can be found in the
README file. In the past, this dataset has been split up into
training and test data in many different ways. You should use the
'Modified Apte' split as described in the README file.
- Size:
- 21578 documents; according to the 'ModApte' split: 9603 training docs,
3299 test docs and 8676 unused docs.
- 27 MB
- References: This is a popular dataset for text mining
experiments. The aim is usually to predict to which categories of the
'topics' category class a text belongs. Different splits into training
,test and unused data have been considered. Previous use of the
Reuters dataset includes:
- Task:
Carefully read the
README file provided by Lewis to get an idea what the data are
about. Select the documents as specified in the description of the
'Modified Apte' split. Prepare the data for mining and perform an
exploratory data analysis (these steps will probably not be
independent). The data mining task is to classify the texts according
to the categories in the 'topics' field. You should compare at least 2
different classifiers. An extra task could be document clustering.
- Challenges:
An important challenge will be the preprocessing of the
dataset. The file is delimited by SGML tags, and the text is just
plain text format. For any text mining task, this will have to be
converted into bag-of-words format. Apart from this, you will have to
deal with texts that belong to a varying number of categories. Most
classification programs can only take one category per case.
- Description:
This dataset was used in the 1998
kdd cup data mining competition. It
was collected by PVA, a non-profit organisation which provides
programs and services for US veterans with spinal cord injuries or
disease. They raise money via direct mailing campaigns. The
organisation is interested in lapsed donors: people who have stopped
donating for at least 12 months. The available dataset contains a
record for every donor who received the 1997 mailing and did not make
a donation in the 12 months before that. For each of them it is given
whether and how much they donated as a response to this. Apart from
that, data are given about the previous and the current mailing
campaign, as well as personal information and the giving history of
each lapsed donor. Also overlay demographics were added. See
the documentation and
the data dictionary for more information.
- Size:
- 191779 records: 95412 training cases and 96367 test cases
- 481 attributes
- 236.2 MB: 117.2 MB trainin data and 119 MB test data
- References:
- Task:
Carefully read the information available about the dataset. Perform
exploratory data analysis to get a good feel for the data and prepare
the data for data mining. It will be important to do good feature and
case selection to reduce the data dimensionality. The data mining task
is in the first place to classify people as donors or not. Try at
least 2 different classifiers, like for example logistic regression or
Naive Bayes. As an extra, you can go on to predict the amount someone
is going to give. A good way of going about this is described in
Zadrozny and Elkan's paper. The success of a solution can then be
assessed by calculating the profits of a mailing campaign targetting
all the test individuals that are predicted to give more than the cost
of sending the mail. The profits when targetting the entire test set
is $10,560.
- Challenges:
This is definitely not an easy dataset. To start with, some of the
attributes have quite a lot of missing values, and there are some
records with formatting errors. An important issue is feature
selection. There are far too many features, and it will be necessary
to select the most relevant ones, or to construct your own features
by combining existing ones (the
kdd cup winners claim that the secret of their success lies in
good feature selection). Also case selection will be important: the
training set is huge (95,412 cases), but contains only 5% positive
examples. Finally, building a useful model for this dataset is made
more difficult by the fact that there is an inverse relationship
between the probability to donate and the amount donated.
- Description:
This dataset was used for the Coil 2000
data mining competition. It
contains customer data for an insurance company. The feature of
interest is whether or not a customer buys a caravan insurance. Per
possible customer, 86 attributes are given: 43 socio-demographic
variables derived via the customer's ZIP area code, and 43 variables
about ownership of other insurance policies.
- Size:
- 9822 records: 5822 training records and 4000 test records
- 86 attributes
- 1.7 MB
- References:
- Task:
The data mining task is to predict whether someone will buy a caravan
insurance policy. You should first do some exploratory data
analysis. Visualising the data should give you some insight into
certain particularities of this dataset. Then prepare the data for
data mining. It will be important to select the right features, and to
construct new features from existing ones, as is described in the paper
of the prediction competition winner. Try out at least 2 different
data mining algorithms, and compare the use of mere feature selection
with intelligent feature construction. As an extra, you could try to
do the second task laid out in the Coil competition: to derive
information about the profile of a typical caravan insurance buyer.
- Challenges:
Like for the kdd cup data, feature selection and extraction will be
very important. This can only be done properly after you have spend a
considerable amount of time getting to know the data. And also like in
the kdd cup data, the data are unbalanced: only 5 to 6% of the
customers in the training data set actually buy the insurance
policy. There are no missing or noisy data.
- Description:
These are the data from the paper
Support Vector Machine Classification of Microarray Gene Expression
Data. For 2467 genes, gene expression levels were measured in
79 different situations (here is the raw data
set). Some of the measurements follow each other up in time, but
in the paper they were not treated as time series (although to a
certain extend that would be possible). For each of these genes, it is
given whether they belong to one of 6 functional classes (class
lables on-line). The paper is concerned with classifying
genes in into 5 of these classes (one class is unpredictable). The data
contain many genes that belong to other functional classes than these
5, but those are not discernable on the basis of their gene expression
levels alone.
- Size:
- 2467 genes
- 79 measurements, 6 class labels
- 1.8 MB: 1.7 MB measurement data and 125 KB labels
- References:
-
Support Vector Machine Classification of Microarray Gene Expression
Data (1999) by M. P. S. Brown, W. N. Grundy, D. Lin,
N. Cristianini, C. Sugnet, T. S. Furey, M. Ares Jr. and
D. Hausslerhref (local copy): This
is the original paper from which the data were obtained. It uses SVM's
to classify the genes, and compares this to other methods like
decision trees. A good description of difficulties with the data can
also be found here.
-
Cluster analysis and display of genome-wide expression patterns
(1998) by M. B. Eisen, P. T. Spellman, P. O. Brown and
D. Botstein: This paper
describes clustering of genes. The results of this paper showed
that the 5 different classes Brown et Al. are trying to predict more or less
cluster together. So it indicated that these classes were discernable
based on the gene expression levels. This was the basis for the selection
of these 5 functional classes for the SVM classification task.
- Task:
Read the data descriptions in the SVM paper and do exploratory data
analysis to understand the characteristics of this dataset. The data
mining task is to predict whether a gene belongs to one of the 5
functional classes, based on its expression levels. Try at least two
different classification algorithms. The low frequency of the smallest
classes will probably pose specific problems. You can also do
clustering like performed by Eisen et Al..
- Challenges:
This dataset is quite noisy and contains a rather high number of
missing values. Furthermore, it is very unbalanced: there are
only a few positive examples of each of the 5 classes, most of the
genes don't belong to any of them. Finally, there are some genes that
belong to a certain class, but have different expression levels, and
there are genes that don't belong to the class they share prediction
level patterns with. These cases will unavoidably lead to false
negatives and positives. An overview of these difficult cases can be
found in SVM classification paper.
- Description:
This dataset is similar to the yeast gene expression dataset: it
contains expression levels of 2000 genes taken in 62 different
samples. For each sample it is indicated whether it came from a tumor
biopsy or not. Numbers and descriptions for the different genes are
also given. This dataset is used in many different research papers on
gene expression data. It can be used in two ways: you can treat the
62 samples as records in a high-dimensional space, or you can treat
the genes as records with 62 attributes.
- Size:
- 2000 genes
- 62 samples
- 1.9 MB data, 529 KB names, 207 bytes labels
- References:
-
Tissue Classification with Gene Expression Profiles (2000) by
A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer and
Z. Yakhini: This
paper describes classification of tissues on the colon cancer and the
leukemia (see below) datasets. It also describes how gene selection
can be done.
- Coupled Two-Way
Clustering Analysis of Gene Microarray Data (2001) by G. Getz,
E. Levine and E. Domany: This paper exploits the fact that the gene expression
dataset can be viewed in two ways. The authors describe a way of
alternating between clustering in the gene domain and in the sample
domain. This method should give insight into which genes are defining
for sample classifications (and possibly vice versa).
- Gene
expression data analysis (2000) by A. Brazma and J. Vilo: An
overview of the research in the new domain of microarray data
analysis. Much of the work described here makes use of the colon
cancer and/or the leukemia dataset.
- Task:
First perform exploratory data analysis to get familiar with the data and
prepare them for mining.
The data mining task is to classify samples as cancerous or
not. Compare at least two different classification algorithms. You
will have to deal with issues arising from the fact that there are
many attributes and only a small number of samples. Some classifiers
will be more robust to this than others. Some ideas about how to deal
with this can be found in the papers refered to above (and the feature
selection paper referenced below). As an extra you
can perform clustering, in the two different domains (genes and
samples). The tissue classification paper describes a way of using
clustering for classification: the parameters of the unsupervised
learning procedure are defined in a supervised way to make the
clusters correspond to classes.
- Challenges:
The data are quite noisy, due to sample contamination. The real
challenge, however, is the shape of the data matrix. When the genes
are treated as attributes, the dimensionality of the feature space is
very high compared to the number of cases. It will be important to
avoid overfitting. Use simple classifiers, or select the
most predictive genes. Also, the number of cases is very low, which
means that splitting into a training and a test set is not really a
good option (although it has been done for the very similar leukemia
dataset, as described in the gene expression analysis overview paper
and in the feature selection paper referenced below). When combining
feature selection with cross-validation, be careful not to use the
classifier's test data during the feature selection phase.
- Description:
The leukemia data set contains expression levels of 7129 genes taken
over 72 samples. Labels indicate which of two variants of leukemia is
present in the sample (AML, 25 samples, or ALL, 47 samples). This
dataset is of the same type as the colon cancer dataset and can therefore
be used for the same kind of experiments. In fact, most of the papers
that use the colon cancer data also use the leukemia data.
- Size:
- 72 samples, split into 38 training and 34 test samples
- 7129 genes
- 3.8 MB
- References:
- All of the references mentioned above for the colon cancer
dataset also use the leukemia data.
-
Feature selection for high-dimensional genomic microarray data
(2001) by E. P. Xing, M. L. Jordan and R. M. Karp: They
describe a three-phase feature selection methods to identify the most
predictive genes. They use the division into 38 training and 34 test
samples. They find that feature selection works better than
regularization.
- Task:
The task is the same as for the colon cancer data. First perform
exploratory data analysis and prepare the data for mining. Then
compare at least two different
classifiers to identify the kind of leukemia of the sample. Again you
will have to deal with problems of high feature dimensionality. You
can choose to use the training-test set division the data are
presented in, or you can use techniques like cross-validation, as
described in the tissue classification paper. Also here, as an extra
you can perform clustering in the two different data spaces.
- Challenges:
The same comments as for the colon cancer dataset can be made: the
data are noisy, and the most important challenge is the unusual
shape of the data matrix.
- Description:
This dataset contains sequences of human DNA material around
splice sites. Gene DNA sequence data contain coding (exons) and
non-coding regions (introns). Splice site is the general term for the
point at the beginning (donor site) or at the end (acceptor site) of
an intron. Donor and acceptor sites typically correspond to certain
patterns, but the same patterns can also be found in other places in
the DNA sequence. So it is important to learn better classifiers to
identify real splice sites. In the past, people have used probability
matrices which encode the probability of having a certain nucleotide
in a certain position. A disadvantage of this method is that
dependencies between positions are not taken into account. Other
methods have tried to solve this by building a conditional
probability matrix for example, or by using neural networks. To get
the best results, many methods don't only use base positions, but also
other features, like the presence of certain combinations of
nucleotides. Most recently, people have turned to probabilistic models
to model the whole gene structure at once. Prediction of splice sites
is then helped by the detection of coding and non-coding areas around
it (see for example Prediction of
complete gene structures in human genomic dna (1997)
by burge and carlin).
Some information on the problem of genefinding can be found
on-line. Information about existing methods can be found in
Ficket's overview paper. The dataset presented here contains
windows of fixed size around true and false donor sites and true and
false acceptor sites.
- Size:
This dataset is divided along three binary dimensions: acceptor (a) versus
donor (d) sites, training (t) versus test (e) data, and true (t)
versus false (f) examples.
- 13123 cases, divided as follows: a-e-f: 881 / d-e-f: 782 / a-e-t:
208 / d-e-t: 208 / a-t-f: 4672 / d-t-f: 4140 / a-t-t: 1116 / d-t-t: 1116
- Window length:
donor data: 15 base positions
acceptor data: 90 base positions
- 198 KB, divided as follows: a-e-f: 16k / d-e-f: 7k / a-e-t: 7k /
d-e-t: 2k / a-t-f: 82k / d-t-f: 36k / a-t-t: 37k / d-t-t: 11k
- References:
- Task:
Perform exploratory data analysis and prepare the data for
mining. Develop a classifier for donor sites and one for acceptor
sites. Compare at least 2 different classifiers for each. As an extra,
you can try to run your classifiers on the Burset
and Guigo DNA sequence dataset. This dataset contains full gene dna
sequences, together with indications of where coding regions start and
stop.
- Challenges:
The data are well prepared, so building a predictor should be quite
straightforward. The best existing predictors use other features than
just nucleotide positions. Maybe it is possible to detect and use some
of these features to improve the classifier. When testing the
classifiers on Burset and Guigo's dna datasets, you will need to make
some adaptations.
- Description:
This dataset contains images collected by the Magellan expedition to
Venus. Venus is the most similar planet to the Earth in size, and
therefore researchers want to learn about its geology. Venus' surface
is scattered with volcanos, and the aim of this dataset was to develop a
program that can automatically identify volcanoes (from training data
that have been labeled by human experts between 1-with 98% probability
a volcano- and 4-with 50% probability). A tool called JARtool
was developed to do this. The makers of this tool made the data
publicly available to allow more research and establish a benchmark in
this field. They provide in total 134 images of 1024*1024 8-bit pixels
(out of the 30000 images of the original project). The dataset you
will use is a preprocessed version of these images: possibly
interesting 15*15 pixel frames ('chips') were taken from the images by the
image recognition program of JARtool, and each was labeled between 0
(not labeled by the human experts, so definitely not a volcano), 1
(98% certain a volcano) and 4 (50% certainty according to the human
experts). More information can be found in the data
documentation.
- Size:
The image chips are spread over groups, according to experiments
carried out for the JARtool software. The training and test sets for
experiments C1 and D4 together cover all chips (see the experiments
table):
- Records: 37280 image chips, divided as follows: C1_trn: 12018 / C1_tst:
16608 / D4_trn: 6398 / D4_tst: 2256
- Features: 15 * 15 pixels
- 8.4 MB
- References:
- These data were used for the development of JARtool, a software
system that learns to recognize volcanoes in images from Venus. The
technical details about this tool are described in the paper Learning to
Recognize Volcanoes on Venus (1998) by M. C. Burl, L. Asker,
P. Smyth, U. Fayyad, P. Perona, L. Crumpler and J. Aubele. This paper
should give you a good example of how data mining can be performed on
this dataset (you can ignore the part about Focus of Attention,
because that has already been done for you).
- Task:
Perform Exploratory data analysis. Prepare the data for data
mining. Feature space reduction will be necessary, because the number
of features is very high compared to the number of positive volcano
examples. Then build at least two classifiers to detect volcanoes:
implement the basic classifier from Burl et Al.'s paper, and at least
one other. You can follow Burl et Al.'s paper, where classes 1 up to 4 are
considered positive examples. As an extra, you can try to perform
clustering to find the different types of volcanoes as mentioned in
Burl et Al.'s paper.
- Challenges:
It will be necessary to normalise the pixel frames, as there is a
difference in brightness between the different images and even between
different parts of the same image. Also, feature extraction will be
necessary, because there are quite a lot of pixels per frame. This is
especially a problem because the dataset is highly
unbalanced: the number of positive examples is very low. Finally,
there is the fact that the volcanos are of different kinds, and it is
difficult to build one classifier for all of them together.
- Description:
These data were used for the 1999 kdd cup. They were gathered by
Lincoln Labs: nine weeks of raw TCP dump data were collected from a
typical US air force LAN. During the use of the LAN, several attacks
were performed on it. The raw packet data were then aggregated into
connection data. Per record, extra features were derived, based on domain
knowledge about network attacks. There are 38 different attack types,
belonging to 4 main categories. Some attack types appear only in the
test data, and the frequency of attack types in test and training data
is not the same (to make it more realistic). More information about
the data can be found in the task
file, and in the overview of
the KDDcup results. On that page, it is also indicated that
there is a cost matrix associated with misclassifications. The winner of the KDDcup99 competition used C5 decision trees in combination with boosting and bagging.
- Size:
- 8,050,290 records, divided as follows: 4,940,000 training records
and 3,110,290 test records. A 10% sample is available for both.
- 41 attributes and 1 label
- 1,173 MB: 743 MB training data and 430 MB test data
- References:
- Task:
Perform exploratory data analysis and prepare the data for mining. The
data mining task is to classify connections as legitimate or belonging
to one of the 4 fraud categories. The misclassification costs should be
taken into account. Compare at least two different classification
algorithms.
- Challenges:
The amount of data preprocessing needed is quite limited. You will
need data reduction to deal with the sheer size of the dataset. The major
difficulty, however, is probably the class distribution: while
the DoS attack type appears in 79% of the connections, the u2r attack type
only appears in 0.01 percent of the records. And this least frequent
attack type is in the same time the most difficult to predict and the
most costly to miss.
- Description:
The SuperCOSMOS Sky
Survey programme is carried out at the University of
Edinburgh.The project used the SuperCOSMOS machine, a
high-precision plate scanning facility, to scan in the Schmidt
photographic atlas material. This has produced a digitised
survey of the entire sky in three colours (B, R and I), with one colour
(R) at two epochs. From these digital images, objects
have been extracted, and an objects catalogue has been composed. For
each object, useful astronomical characteristics have been registered,
such as the size, the brightness, the position, etc. A project was
then caried out to classify the objects as stars or
galaxies. External labeling to evaluate the classification algorithm
was obtained from the more precise data of the Sloan Digital Sky Survey.
- Size:
- There are 4 object sets, one for B and I, and two for R (one set
from pictures taken in the 50's and one set more recent). Each of
these is divided in a set of paired objects (for which a corresponding
SDSS object was found) and a set of unpaired ones:
- B-paired: 34663 / B-unpaired: 68987
- R-paired (recent): 26791 / R-unpaired: 54920
- I-paired: 15645 / I-unpaired: 41596
- R-paired (50's): 15834 / R-unpaired: 34426
- Paired datasets have 40 attributes (including some from SDSS),
unpaired 34.
- The size of the datasets is as follows:
- B-paired: 16.4MB / B-unpaired: 23.5MB
- R-paired (recent): 12.6MB / R-unpaired: 18.7MB
- I-paired: 14MB / I-unpaired: 7.3MB
- R-paired (50's): 7.4MB / R-unpaired: 11.7MB
- References:
- The SuperCOSMOS Sky
Survey - I. Introduction and description (2001) by N. Hambly,
H. MacGillivray, M. Read, S. Tritton, E. Thomson, D. Kelly, D. Morgan,
R. Smith, S. Driver, J. Williamson, Q. Parker, M. Hawkins, P. Williams
and A. Lawrence: This paper is an introduction to the SSS project.
- The SuperCOSMOS Sky
Survey. Paper II: Image detection, parameterisation, classification
and photometry (2001) by N. Hambly, M. Irwin and H. MacGillivray:
A description of the methods for image detection, parameterisation,
classification and photometry. A useful paper for you to read, as it
gives explanations about how the data were obtained and what they
mean, and about the object classification efforts by the SSS people.
- The SuperCOSMOS Sky
Survey. Paper III: Astrometry (2001) by N. Hambly, A. Davenhall,
M. Irwin and H. MacGillivray: An overview of how the astrometric
parameters of the data were derived. Probably less interesting for
you.
-
Automated Star/Galaxy Classification for Digitized POSS-II
(1995) by N. Weir, U. M. Fayyad and S. Djorgovski: This paper
uses a similar astronomical dataset. It is quite interesting, as it is
much more understandable than paper II above. It uses a similar
two-step classification method and should therefore give you some
insight in what is happening in paper II.
- Task:
First read the information in the README
file, and in paper II (and the paper by Weir) referenced
above. Then perform exploratory data analysis and prepare the data for
data mining. You can concentrate on one of the paired
datasets. Classify sky objects as stars or galaxies (use the SDSS
classification as label). Compare at least two different classification
algorithms. Try the effect of excluding/including fields 19 and 31,
the classification efforts of the SSS team. Also, do a performance
evaluation with respect to the magnitude as was done in paper II.
- Challenges:
These are astronomical data, and all the documentation is written in
'astronomical language', so it is quite difficult to understand what
the data are all about and how the previous research has been caried
out. Furthermore, the dataset is quite big, so case reduction might be
necessary.
Less interesting datasets
You are allowed to come up with your own dataset for this project. In
order to guide you in this search, we present here some examples of
datasets which were considered less interesting.
- Description:
This dataset contains pixel information for a 82*100 pixel part of an
image of the earth surface taken from a satellite. For each pixel, 4
frequency values (each between 0-255) are given. This is because
foto's are taken in 4 different spectral bands. Each pixel represent
an area on the earth's surface of 80*80 metres. Every line in the
dataset gives the 4 values for 9 pixels (a 3*3 neighbourhood frame
around a central pixel. The aim is to classify the central pixel (a
piece of land) into 1 of 6 classes based on its values and those of
its immediate neighbours. More info about this is given in the documentation file.
- Objections:
The data have been perfectly preprocessed, and the classes are quite
well balanced. This dataset is not very challenging, as very good
results can be obtained very easily.
- Description:
This is the dataset used for the filtering track in TREC-9, the 2000
Text Retrieval Conference. It
contains 348,566 texts to be searched and classified. These texts are
records of medical articles containing fields for the author, the title,
the source, the publication type, a number of human assigned relevant
terms, and in about two thirds of the cases also the abstract. Full
texts are not available. In the TREC filtering task, the program gets
a user profile (a query) and a sample of texts that match this
profile. The aim is to search the massive text database to find more
texts that match the profile. Solutions to the filtering problem
presented in TREC-9 include eg a kNN-method and an adaptive term and
treshold selection method.
- Objections:
This dataset is too hard, mainly because of its sheer magnitude. In An Evaluation of Statistical
Approaches to Text Categorization (1997), the use of
different text classification methods on different text datasets was
examined. One of the datasets used was the OHSUMED set. It was
pointed out that this set is much harder than for example the Reuters
set.
- Description:
The PKDD conferences organize discovery challenges, not as a
competition, but with the aim that different researchers would work
together to try and find solutions for certain kdd problems. One of
the tasks in the 2001 challenge used a dataset of chemical
structures. For each structure, it was indicated whether or not this
substance caused cancer to mice or rats. The aim was to create a predictor of
toxicology for substances based on their chemical structure.
- Objections:
This is in fact a very difficult task. Chemical structures can take on
an enormous variety of shapes and sizes. They are a perfect example of
data that do not fit in the classical attribute-value format. This
means that traditional data mining techniques cannot be used on these
data. On the dataset website, there are links to solution papers. All
of these seem to try to derive attribute-value formats from the chemical
structures, using domain background knowledge. This is definitely an
interesting dataset, but certainly too difficult for a
mini-project.
- Description:
This dataset contains webpages that have been rated by users. The
pages belong to 4 different categories, and a few different users were
asked to rate the pages as interesting or not. The dataset contains
the html source of each webpage, and a rating by a single user on a 3
point scale. The aim is to build a predictor for the pages' rating.
- Objections:
This dataset is too small for the kind of exercise we are looking
for (only 332 texts were rated).
- Description: This is a well known data set for text
classification, used mainly for training classifiers by using both
labeled and unlabeled data (see references below).
The data set is a collection of 20,000 messages,
collected from UseNet postings over a
period of several months in 1993. The data are
divided almost evenly among 20 different
UseNet discussion groups. Many of the categories
fall into overlapping topics; for example 5 of them are about
companies discussion groups and 3 of them discuss
religion. Other topics included in News Groups are:
politics, sports,
sciences and miscellanious.
- Objections: This dataset is too well known and is
in fact used as the example dataset for the rainbow
software documentation.
- Description:This dataset was used in the 2002
kdd cup data mining competition. The data describes the
activity of some (hidden) biological system in yeast
cells. Particularly, a set of yeast strains have been generated
where each of them is characterized by
a single gene being knocked out (i.e. disabled).
Thus each example in the data set
is related to a single knocked-out gene which is
labeled with a discrete measurement of how active the hidden
system in the cell is when this gene is knocked out. There are three
labels:
- "nc": This label indicates that the
activity of the hidden system (i.e. yeast strain) was
not significantly different than the baseline.
- "control": Indicates that the activity of the hidden system
was significantly different than the baseline for the given instance,
but that the activity of another hidden system (the control) was also
significantly changed versus its baseline.
- "change": Describes examples which the activity of the hidden system
was significantly changed, but the activity of the control system was
not significantly changed.
A variety of other information accompanies the above data. These data
sources include categorical features describing gene
localization and function, abstracts from the scientific literature(MEDLINE),
and a table of protein-protein interactions that relate the
products of pairs of genes. A local copy of file
describing the whole database is available
here.
- Objections: Too complex/challenging for a
DME mini-project. There are three
sources of data to take into account (including carrying out
information extraction from the scientific abstracts). One of these sources
is information on protein-protein interactions, which will have to be
coded in an appropriate way for machine learning. Also the classes
are very imbalanced.
- Description: The data set consists of a time series 5000
time steps long. 5 blocks of 20 values are missing from the training
data (elements 981–1000, 1981–2000, 2981–3000,
3981–4000, 4981–5000). The task is to predict the 100
missing values. Solutions are evaluated by calculating the mean
square error compared to the true values.
- Objections: This is an artificial data set, not a genuine
data-mining problem.
This page was originally written by Frederick Ducatelle and is maintained by
Charles Sutton.