Software for the data mining course

The following software packages are available on the inf system, and you are recommended to use them for the data mining projects. There are obviously many more tools available on the web, and you are of course free to use any of those if you find them more suitable. Bear in mind, however, that many of the publicly available tools contain a lot of bugs and are hard to install. Also, it is not advisable to use too many different tools together, as each of them will need its own data format, and a lot of time and effort will be wasted doing data conversions.

SAS:

General Description:

SAS is a large scale statistical analysis, data handling, and business intelligence package. It is extensively used in different business domains as a primary analysis tool. It has good data handling functions, and is based around a comprehensive language. Base SAS provides the core tools that are needed. Visual extensions such as SAS enterprise miner provides more graphical interface tools for basic data mining. If you want a job in this area, then you would do well to learn to use SAS.

Weka:

General Description:

Weka is open source data mining software. It does not only support machine learning algorithms, but also data preparation and meta-learners like bagging and boosting. The whole suite is written in java, so it can be run on any platform. The package has three different interfaces: a command line interface, an Explorer GUI interface (which allows you to try out different preparation, transformation and modelling algorithms on a dataset), and an Experimenter GUI interface (which allows to run different algorithms in batch and to compare the results). A good introduction to Weka is the tutorial given in chapter 8 of Data Mining (2000) by I. H. Witten and E. Frank. This is an introduction to the command line interface. For the Explorer GUI interface there is no documentation, but the functionalities are more or less the same as for the command line interface. For the Experimenter interface a short, clear tutorial is available.

Functionalities:

The functionalities of Weka more or less boil down to the algorithms described in Witten and Frank's data mining book. A complete overview of the algorithms implemented is given in the on-line documentation (also, when using the command line interface, using the option '-h' will give you for any algorithm the list of possible options). A short overview of the Weka functionalities:

SVM's: only polynomial kernels are supported. Also, support vector regression is not supported.
Decision trees: ID3 and C4.5 are implemented, and M5': a model tree induction algorithm for predicting numeric values (each leaf node has a regression model). PART is a rule-learner that makes rules by building different decision trees and each time keeping the leaf with the largest coverage.
Memory-based methods: kNN and locally weighted regression.
Neural Networks: only backpropagation with momentum is supported.
Simpler methods: naive Bayes (for numeric values, a normal distribution is used, but also 'kernel density estimation' can be used to avoid assuming a normal distribution) and linear regression are useful simple methods. Two-class logistic regression is also supported. The algorithm uses a 'ridge estimator'.
Other simple methods: decision tables, 1R (make a rule based only on one attribute) and decision stump(one-level decision trees). Although methods this simple might seem useless, they can be combined via boosting or bagging, and form a strong classifier through combining several weak ones.

Also meta-learning schemes are supported:

Bagging
Stacking: using a range of base classifiers and a meta classifier which classifies their output.
Adaboost: a boosting method based on Freund & Schapire's Adaboost M1 method (see Experiments with a New Boosting Algorithm (1996))
MultiClassClassifier: to solve multiclass problems using two-class classifiers: one classifier is built per class (it is also possible to ask for one classifier per pair of classes).
CVParameterSelection: will try out a defined range over a set of parameters for a certain classifier using cross-validation, and use the best combination to build the final model.

Weka also includes a package that contains clustering algorithms. It supports:

The EM algorithm: working with numeric as well as nominal values, but assuming that all attributes are independent.
Incremental clustering: a clustering technique that builds a tree using the category utility measure (see section 6.6 in Witten and Frank's book)

kMeans is also provided.

Association rules: the APriori algorithm is supported.
Some data preprocessing support is provided: you can add new attributes (based on calculations of existing ones), transform attribute values, manually select certain attributes, discretise numeric values, remove attributes with only one distinct value, select records on the basis of attribute values, transform nominal values into binary ones, merge two nominal values into one, normalise numeric values, randomise the order of the dataset, replace missing values by the mean or the mode, create random subsamples, ...
Weka supports several schemes for attribute selection. Both filter methods and wrapper methods are supported. Among the provided filter methods are the chi-squared method, the information gain and gain ratio measures, the performance of the OneR classifier based on each single attribute, PCA, the relief-F method (based on distances between sampled instances), ... In wrapper methods the feature subset is evaluated using the actual classifier that is going to be used for classification. This is obviously quite time-consuming.
When searching for an optimal feature subset, a search strategy has to be selected. Possibilities are: best-first, exhaustive search, forward selection, ranking, genetic search, random search, ...
Visualisation: Weka provides limited visualisation possibilities. there are maximum three dimensions: 2 axis and one overlay colour.

Advantages:

The obvious advantage of a package like Weka is that a whole range of data preparation, feature selection and data mining algorithms are integrated. This means that only one data format is needed, and trying out and comparing different approaches becomes really easy. The package also comes with a GUI, which should make it easier to use.

Disadvantages:

Probably the most important disadvantage of data mining suites like this is that they do not implement the newest techniques. For example the MLP implemented has a very basic training algorithm (backprop with momentum), and the SVM only uses polynomial kernels, and does not support numeric estimation. Therefore, it will be necessary to combine WEKA with some of the other tools like Netlab or SVM_torch. Another important disadvantage arises from the fact that the software is for free: the documentation for the GUI is quite limited. Witten and Frank's data mining book is more or less a summary of the functionalities of the program, and chapter 8 is a tutorial for it. It does not describe anything about the GUI though. As the software is constantly growing, the documentation is not up to date with everything either (the most up to date and complete information about algorithm options can be obtained by using the -h option in the command line interface). A third possible problem is scaling. For difficult tasks on large datasets, the running time can become quite long, and java sometimes gives an OutOfMemory error. This problem can be reduced by using the '-mxx' option when calling java, where x is memory size (eg '50m'). For large datasets it will always be necessary to reduce the size to be able to work within reasonable time limits. A fourth problem is that the GUI does not implement all the possible options. Things that could be very useful, like scoring of a test set, are not provided in the GUI, but can be called from the command line interface. So sometimes it will be necessary to switch between GUI and command line. Finally, the data preparation and visualisation techniques offered might not be enough. Most of them are very useful, but I think in most data mining tasks you will need more to get to know the data well and to get it in the right format.

Bow:

General Description: Bow is a library of C code for statistical text analysis, language modeling and information retrieval. Together with this library, four executable programs based on it are distributed. They are Rainbow (for document classification), Arrow and Archer (for document retrieval) and Crossbow (for document clustering).
Functionalities:
- Rainbow: Rainbow is the front-end to the library that supports text classification. It is the best documented of the four, and is the most useful for the projects. It has the following functionalities:
  - Text preprocessing: This supports the conversion of texts into bag-of-words format.Different options allow to choose a sparse or a full matrix model, to display a count or a binary present/absent flag, and to include the word names or not. Also, feature selection methods can be used, based on word frequency, word-document counts or information gain.
  - Naive Bayes classification
  - kNN classification
  - Support Vector Machine classification
  - Tfidf classification (see A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization)
  - Expectency Maximisation
- Crossbow: Crossbow supports text clustering. No documentation is available for it yet. Some indications of how to use it can be found in the help file.
- Arrow: Arrow does document retrieval based on tfidf. There is a little bit of information available in the README file. I don't think this function is useful for the projects though.
- Archer: Archer also does document retrieval. No documentation is available yet though.
Evaluation: Especially the text preprocessing functionalities are very useful for text related projects. Once the conversion to bag-of-word format is done, you have the option to use the classification and clustering features of the bow package, or to export the matrix to other data mining programs. The classification functionalities of Rainbow give very fast and accurate results. The documentation for Rainbow is quite good, if you combine the on-line tutorial with the help file.

Netlab:

General description: Netlab is a toolbox of matlab functions and scripts, originally based on Bishop's Neural networks for pattern recognition book. They provide matlab implementations of some of the newest machine learning algorithms. Worked examples of how to use them can be downloaded in zip-format.
Functionalities: This is a list of the functionalities supported by Netlab.
- PCA
- Mixtures of probabilistic PCA
- Gaussian mixture model with EM training algorithm
- Linear and logistic regression with IRLS training algorithm
- Multi-layer perceptron with linear, logistic and softmax outputs and appropriate error functions
- Radial basis function (RBF) networks with both Gaussian and non-local basis functions
- Optimisers, including quasi-Newton methods, conjugate gradients and scaled conjugate gradients
- Multi-layer perceptron with Gaussian mixture outputs (mixture density networks)
- Gaussian prior distributions over parameters for the MLP, RBF and GLM including multiple hyper-parameters
- Laplace approximation framework for Bayesian inference (evidence procedure)
- Automatic Relevance Determination (ARD) for input selection
- Markov chain Monte-Carlo including simple Metropolis and hybrid Monte-Carlo
- K-nearest neighbour classifier
- K-means clustering
- Generative Topographic Map
- Neuroscale topographic projection
- Gaussian Processes
- Hinton diagrams for network weights
- Self-organising map
Help on how to use these algorithms can be found on the Netlab homepage. Also useful are some worked examples, which can be downloaded in zip-format. There are also demos available (their functionality is described in the help files, to see their content, type 'type demoname.m' in matlab). For some of the algorithms, the use is pretty straightforward (knn, mlp, gmm, ...). Other algorithms are less easy to use (rbf, ard, ...). In general a thorough background knowledge of the algorithms is needed to use them.
Some of the algorithms can be used for data preprocessing: PCA and ARD. Also, there are functions to read from and write to ascii files, which is very useful. Finally, a function is provided to calculate and display confusion matrices and number of correct classifications.
Advantages: Working with netlab surely offers some important advantages. Compared to a package like Weka, the greatest advantage is probably that the algorithms implemented are up to date with the newest developments in the field. Also, using the functions assumes quite some knowledge of the theory behind the algorithms, so at least the data mining analyst knows what is going on. Finally, the integration in matlab gives some important advantages: all the matrix calculation and visualisation functions are available, which can be very useful for data preparation and exploration, and analysis of results afterwards. Also, Matlab makes scripting is possible, which should increase the efficiency of the analysis.
Disadvantages: Disadvantages are that some important data preprocessing functionalities are missing: dealing with missing data, feature selection, ... Another problem is that matlab can only work with numeric data, so categorical data will have to be converted in advance, and techniques like apriori and decision trees cannot be implemented.

SVMTorch:

General Description: SVMTorch is an implementation of Vapnik's Support Vector Machine that works both for classification and regression problems, and that has been specifically tailored for large-scale problems (such as more than 20000 examples, even for input dimensions higher than 100).
Functionalities: SVMTorch implements support vector classification and regression. The data have to be floating point values. It is possible to do classification for multiple classes: then a separate classifier is built for every class. A small user guide for SVMTorch is available on-line.
Evaluation: SVMTorch offers many options for the use of SVM's. Linear, polynomial, gaussian, sigmoidal and even user-defined kernels are possible. Regression as well as classification can be done, and there is support for multiple classes. This extended functionality is in sharp contrast with the limited SVM support in Weka. Also, SVMTorch is very fast and scales well to large amounts of data. A disadvantage of using SVMTorch is that a separate data format is needed.

XGobi/GGobi:

General Description: This is an interactive visualisation tool for high-dimensional data. It supports 2-D plots, 3-D rotations, scaling of axes, linked brushing (allowing to color certain points, which will then stand out in different views of the data), and much more. A good overview of the functionalities of xgobi is given in XGobi: Interactive Dynamic Data Visualisation in the X Window System (1998) by D. F. Swayne, D. Cook and A. Buja. More information can be found in the man pages (type 'man xgobi') or in the on-line help files (click 'info' when xgobi is running).
Evaluation: This is definitely a useful tool when you do exploratory data analysis. Weka has a visualisation functionality, and also Matlab allows to plot data, but they don't offer the dynamic and interactive features of XGobi.

c4.5:

General Description: This is software distributed by Quinlan to build decision trees using his c4.5 algorithm. Information about how to use it can be found by consulting the man pages (type 'man c4.5'). More information about the working of c4.5 can be found in Mitchell's Machine Learning book.
Evaluation: The c4.5 algorithm is also embedded in Weka, so it makes sense to use it there if you are using the other Weka functionalities. The separate c4.5 program seems to go a little faster than the java implementation in Weka, however, so for large files, it might be useful to use this.

cluster :

General Description: This program supports hierarchical clustering and PCA. Information about how to use it can be found in the man pages (type 'man 1 cluster').
Evaluation: Hierarchical clustering is not provided in any of the other software packages presented above, so if you want to do that, you'll have to use this program. It is quite slow though, so don't try to use it on large amounts of data. The pca functionality is also available in netlab and in weka, so you can use it there.

This page was written by Frederick Ducatelle (fredduc@dai.ed.ac.uk) and is maintained by Amos Storkey (a.storkey@ed.ac.uk)

Home : Teaching : Courses : Dme