Software for the data mining course
The following software packages are available on the inf system, and
you are recommended to use them for the data mining projects. There
are obviously many more tools available on the web, and you are of
course free to use any of those if you find them more suitable. Bear
in mind, however, that many of the publicly available tools contain a
lot of bugs and are hard to install. Also, it is not advisable to use
too many different tools together, as each of them will need its own
data format, and a lot of time and effort will be wasted doing data
conversions.
SAS:
SAS is a large scale statistical analysis, data handling, and business
intelligence package. It is extensively used in different business
domains as a primary analysis tool. It has good data handling functions, and
is based around a comprehensive language. Base SAS provides the core
tools that are needed. Visual extensions such as SAS enterprise miner provides
more graphical interface tools for basic data mining. If you want a job in this
area, then you would do well to learn to use SAS.
General Description:
Weka is open source data mining
software. It does not only support machine learning algorithms, but
also data preparation and meta-learners like bagging and boosting. The
whole suite is written in java, so it can be run on any platform. The
package has three different interfaces: a command line interface, an
Explorer GUI interface (which allows you to try out different
preparation, transformation and modelling algorithms on a dataset),
and an Experimenter GUI interface (which allows to run different
algorithms in batch and to compare the results). A
good introduction to Weka is the
tutorial given in chapter 8 of Data Mining (2000) by
I. H. Witten and E. Frank. This is an introduction to the command line
interface. For the Explorer GUI interface there is no documentation, but
the functionalities are more or less the same as for the command
line interface. For the Experimenter
interface a short, clear tutorial is available.
Functionalities:
The functionalities of Weka more or less boil down to the algorithms
described in Witten and Frank's data mining book. A
complete overview of the algorithms implemented is given in the
on-line documentation (also, when using the command line
interface, using the option '-h' will give you for any algorithm the
list of possible options). A short overview of the Weka functionalities:
- SVM's: only polynomial kernels are supported. Also,
support vector regression is not supported.
- Decision trees: ID3 and C4.5 are implemented, and
M5': a model tree induction algorithm for predicting numeric
values (each leaf node has a regression model). PART is a
rule-learner that makes rules by building different decision trees and
each time keeping the leaf with the largest coverage.
- Memory-based methods: kNN and locally weighted
regression.
- Neural Networks: only backpropagation with momentum is supported.
- Simpler methods: naive Bayes (for numeric values, a normal
distribution is used, but also 'kernel density estimation' can be used
to avoid assuming a normal distribution) and linear regression
are useful simple methods. Two-class logistic regression is also
supported. The algorithm uses a 'ridge estimator'.
- Other simple methods: decision tables, 1R (make a
rule based only on one attribute) and decision stump(one-level decision
trees). Although methods this simple might seem useless, they can be
combined via boosting or bagging, and form a strong classifier
through combining several weak ones.
Also meta-learning schemes are supported:
- Bagging
- Stacking: using a range of base classifiers and a meta
classifier which classifies their output.
- Adaboost: a boosting method based on Freund & Schapire's Adaboost
M1 method (see
Experiments with a New Boosting Algorithm (1996))
- MultiClassClassifier: to solve multiclass problems using
two-class classifiers: one classifier is built per class (it is also
possible to ask for one classifier per pair of classes).
- CVParameterSelection: will try out a defined range over a set of
parameters for a certain classifier using cross-validation, and use
the best combination to build the final model.
Weka also includes a package that contains clustering
algorithms. It supports:
- The EM algorithm: working with numeric as well as nominal values,
but assuming that all attributes are independent.
- Incremental clustering: a clustering technique that builds
a tree using the category utility measure (see section 6.6 in Witten
and Frank's book)
kMeans is also provided.
- Association rules: the APriori algorithm is supported.
- Some data preprocessing support is provided: you can add
new attributes (based on calculations of existing ones), transform
attribute values, manually select certain attributes, discretise
numeric values, remove attributes with only one distinct value, select
records on the basis of attribute values, transform nominal values
into binary ones, merge two nominal values into one, normalise numeric
values, randomise the order of the dataset, replace missing values by
the mean or the mode, create random subsamples, ...
- Weka supports several schemes for attribute selection. Both
filter methods and wrapper methods are supported. Among the provided filter methods are the chi-squared
method, the information gain and gain ratio measures, the performance
of the OneR classifier based on each single attribute, PCA, the
relief-F method (based on distances between sampled instances),
... In wrapper methods the feature subset
is evaluated using the actual classifier that is going to be used for
classification. This is obviously quite time-consuming.
When
searching for an optimal feature subset, a search strategy has to be
selected. Possibilities are: best-first, exhaustive search, forward
selection, ranking, genetic search, random search, ...
- Visualisation: Weka provides limited visualisation
possibilities. there are maximum three dimensions: 2 axis and one
overlay colour.
Advantages:
The obvious advantage of a package like Weka is that a whole range
of data preparation, feature selection and data mining algorithms are
integrated. This means that only one data format is needed, and
trying out and comparing different approaches becomes really easy. The
package also comes with a GUI, which should make it easier to
use.
Disadvantages:
Probably the most important disadvantage of data mining suites like this is
that they do not implement the newest techniques. For example the MLP
implemented has a very basic training algorithm (backprop with
momentum), and the SVM only uses polynomial kernels, and does not
support numeric estimation. Therefore, it will be necessary to combine
WEKA with some of the other tools like Netlab or SVM_torch. Another
important disadvantage arises from the fact that the software is for
free: the documentation for the GUI is quite
limited. Witten and Frank's data mining book is more
or less a summary of the functionalities of the program, and chapter 8
is a tutorial for it. It does not describe anything about the GUI
though. As the software is constantly growing, the documentation is
not up to date with everything either (the most up to date and
complete information about algorithm options can be obtained by using
the -h option in the command line interface). A third possible problem is
scaling. For difficult tasks on large datasets, the running time
can become quite long, and java sometimes gives an OutOfMemory
error. This problem can be reduced by using the '-mxx' option
when calling java, where x is memory size (eg '50m'). For large
datasets it will always be necessary to reduce the size to be able to
work within reasonable time limits. A fourth problem is that the
GUI does not implement all the possible options. Things that could
be very useful, like scoring of a test set, are not provided in the
GUI, but can be called from the command line interface. So sometimes
it will be necessary to switch between GUI and command line. Finally,
the data preparation and visualisation techniques offered might not
be enough. Most of them are very useful, but I think in most data
mining tasks you will need more to get to know the data well and to
get it in the right format.
Bow:
- General Description:
Bow is a library of C code for statistical text analysis, language
modeling and information retrieval. Together with this library, four
executable programs based on it are distributed. They are
Rainbow (for document classification), Arrow and Archer (for document
retrieval) and Crossbow (for document clustering).
- Functionalities:
- Rainbow:
Rainbow is the front-end to the library that supports text
classification. It is the best documented of the four, and is
the most useful for the projects. It has the following
functionalities:
- Text preprocessing: This supports the conversion of texts
into bag-of-words format.Different options allow to choose a sparse or a full
matrix model, to display a count or a binary present/absent flag, and to
include the word names or not. Also, feature selection methods can be
used, based on word frequency, word-document counts or information
gain.
- Naive Bayes classification
- kNN classification
- Support Vector Machine classification
- Tfidf classification (see A
Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text
Categorization)
- Expectency Maximisation
- Crossbow: Crossbow supports text clustering. No documentation is
available for it yet. Some indications of how to use it can be found
in the help file.
- Arrow: Arrow does document retrieval based on tfidf. There
is a little bit of information available in the README file. I don't think this function is
useful for the projects though.
- Archer: Archer also does document retrieval. No
documentation is available yet though.
- Evaluation:
Especially the text preprocessing functionalities are very useful
for text related projects. Once the conversion to bag-of-word format
is done, you have the option to use the classification and clustering
features of the bow package, or to export the matrix to other data
mining programs. The classification functionalities of Rainbow give
very fast and accurate results. The documentation for
Rainbow is quite good, if you combine the on-line
tutorial with the help file.
Netlab:
- General description:
Netlab is a toolbox of matlab functions and scripts, originally based
on Bishop's Neural networks for pattern recognition book. They
provide matlab implementations of some of the newest machine learning
algorithms.
Worked examples of how to use them can be downloaded in
zip-format.
- Functionalities:
This is a list of the functionalities supported by Netlab.
- PCA
- Mixtures of probabilistic PCA
- Gaussian mixture model with EM training algorithm
- Linear and logistic regression with IRLS training algorithm
- Multi-layer perceptron with linear, logistic and softmax outputs
and appropriate error functions
- Radial basis function (RBF) networks with both Gaussian and
non-local basis functions
- Optimisers, including quasi-Newton methods,
conjugate gradients and scaled conjugate gradients
- Multi-layer perceptron with Gaussian mixture outputs (mixture
density networks)
- Gaussian prior distributions over parameters for the MLP, RBF and
GLM including multiple hyper-parameters
- Laplace approximation framework for Bayesian inference (evidence
procedure)
- Automatic Relevance Determination (ARD) for input selection
- Markov chain Monte-Carlo including simple Metropolis and
hybrid Monte-Carlo
- K-nearest neighbour classifier
- K-means clustering
- Generative Topographic Map
- Neuroscale topographic projection
- Gaussian Processes
- Hinton diagrams for network weights
- Self-organising map
Help on how to
use these algorithms can be found on the Netlab homepage. Also
useful are
some worked
examples, which can be downloaded in zip-format. There are also
demos available (their functionality is described in the help files,
to see their content, type 'type demoname.m' in matlab).
For some of the algorithms, the use is pretty straightforward (knn,
mlp, gmm, ...). Other algorithms are less easy to use (rbf, ard, ...). In
general a thorough background knowledge of the algorithms is needed
to use them.
Some of the algorithms can be used for data preprocessing: PCA and
ARD. Also, there are functions to read from and write to ascii files,
which is very useful. Finally, a function is provided to calculate and
display confusion matrices and number of correct classifications.
- Advantages:
Working with netlab surely offers some important advantages. Compared
to a package like Weka, the greatest advantage is probably that the
algorithms implemented are up to date with the newest developments in
the field. Also, using the functions assumes quite some knowledge of
the theory behind the algorithms, so at least the data mining
analyst knows what is going on. Finally, the integration in
matlab gives some important advantages: all the matrix calculation
and visualisation functions are available, which can be very useful
for data preparation and exploration, and analysis of results
afterwards. Also, Matlab makes scripting is possible, which should
increase the efficiency of the analysis.
- Disadvantages:
Disadvantages are that some important data preprocessing
functionalities are missing: dealing with missing data, feature
selection, ... Another problem is that matlab can only work with
numeric data, so categorical data will have to be converted in
advance, and techniques like apriori and decision trees cannot be
implemented.
SVMTorch:
- General Description:
SVMTorch is an implementation of Vapnik's Support Vector Machine that
works both for classification and regression problems, and that has
been specifically tailored for large-scale problems (such as more than
20000 examples, even for input dimensions higher than 100).
- Functionalities:
SVMTorch implements support vector classification and regression. The
data have to be floating point values. It is possible to do
classification for multiple classes: then a separate classifier is
built for every class. A small user guide
for SVMTorch is available on-line.
- Evaluation:
SVMTorch offers many options for the use of SVM's. Linear, polynomial,
gaussian, sigmoidal and even user-defined kernels are
possible. Regression as well as classification can be done, and there
is support for multiple classes. This extended functionality is in
sharp contrast with the limited SVM support in Weka. Also, SVMTorch is
very fast and scales well to large amounts of data. A disadvantage of
using SVMTorch is that a separate data format is needed.
XGobi/GGobi:
- General Description:
This is an interactive visualisation tool for high-dimensional
data. It supports 2-D plots, 3-D rotations, scaling of axes, linked
brushing (allowing to color certain points, which will then stand out
in different views of the data), and much more. A good overview of the
functionalities of xgobi is given in
XGobi: Interactive Dynamic Data Visualisation in the X Window System
(1998) by D. F. Swayne, D. Cook and A. Buja. More information
can be found in the man pages (type 'man xgobi') or in the on-line
help files (click 'info' when xgobi is running).
- Evaluation:
This is definitely a useful tool when you do exploratory data
analysis. Weka has a visualisation functionality, and also Matlab
allows to plot data, but they don't offer the dynamic and interactive
features of XGobi.
c4.5:
- General Description:
This is software distributed by Quinlan to build decision trees using
his c4.5 algorithm. Information about how to use it can be found by
consulting the man pages (type 'man c4.5'). More information about the
working of c4.5 can be found in Mitchell's Machine Learning book.
- Evaluation:
The c4.5 algorithm is also embedded in Weka, so it makes sense to use
it there if you are using the other Weka functionalities. The separate
c4.5 program seems to go a little faster than the java implementation
in Weka, however, so for large files, it might be useful to use this.
cluster
:
- General Description:
This program supports hierarchical clustering and PCA. Information
about how to use it can be found in the man pages (type 'man 1 cluster').
- Evaluation:
Hierarchical clustering is not provided in any of the other software
packages presented above, so if you want to do that, you'll have to
use this program. It is quite slow though, so don't try to use it on
large amounts of data. The pca functionality is also available in
netlab and in weka, so you can use it there.
This page was written by Frederick Ducatelle
(fredduc@dai.ed.ac.uk)
and is maintained by
Amos
Storkey
(
a.storkey@ed.ac.uk)