Data mining software

I am starting to make some notes on using the software packages on the data sets.

MLC++:

General Description:
This is a library of C++ code for machine learning algorithms. There is quite good documentation available, but the problem is that the algorithms are quite old: all documents and code seems to be dated '96 or '97. Subsequently, only older techniques, like decision trees and kNN are implemented (not even MLP's are implemented).
Functionalities:
Only old induction algorithms like decision trees and kNN are available. Some options for data preprocessing and evaluation of results are given, but in the end the choice is quite limited.
Evaluation:
I did not download and try out the code, because it is obviously insufficient for us. It's ashame, because the documentation is in fact really good.

Weka:

General Description:
Weka is open source data mining software. It does not only support machine learning algorithms, but also data preparation and meta-learners like bagging and boosting. The whole suite is written in java, so it can be run on any platform. It is also possible to embed the classes in your own code, or to add your own machine learning algorithms.
Functionalities:
The functionalities of Weka more or less boil down to the algorithms described in Witten and Frank's data mining book (chapter 8 of that book is a tutorial for the software). A complete overview of the algorithms implemented is given in the on-line documentation. A short overview:
- SVM's: only with polynomial kernels seem to be supported. And it can only handle two-class datasets (no regression).
- Decision trees: ID3 and C4.5 are implemented, and M5': a model tree induction algorithm for predicting numeric values (each leaf node has a regression model). PART is a rule-learner that makes rules by building different decision trees and each time keeping the leaf with the largest coverage. Also some more exotic decision tree method are supported: ADTree (this incorporates boosting in the tree growing algorithm: see The alternating decision tree learning algorithm) and classification through model trees (numeric prediction trees are built to define the probability for each class, see Using Model Trees for Classification (1997)).
- Memory-based methods: kNN and locally weighted regression for numeric predictions.
- Neural Networks: only backpropagation with momentum is supported.
- Simpler methods: naive Bayes (for numeric values, a normal distribution is used, but also 'kernel density estimation' can be used to avoid assuming a normal distribution) and linear regression are useful simple methods. Two-class logistic regression is supported. The algorithm uses a 'ridge estimator'. A more obscure simple classifier is 'voting feature intervals', which divides the range of each feature into intervals, and classifies new instances using a weighted majority vote over the features (see Non-Incremental Classification Learning Algorithms Based On Voting Feature Intervals (1997)). Another obscure classifier is the hyperpipe classifier.
- Even simpler methods: decision tables, 1R (make a rule based only on one attribute) and decision stump(one-level decision trees). Although methods this simple might seem useless, they can be combined with boosting or bagging, and form a strong classifier through combining several weak ones.
- Also meta-learning schemes are supported:
  - Bagging
  - Stacking: using a range of base classifiers and a meta classifier which classifies their output.
  - Adaboost: a boosting method based on Freund & Schapire's Adaboost M1 method (see Experiments with a New Boosting Algorithm (1996))
  - Additive logistic boosting is another boosting algorithm proposed in Additive Logistic Regression: a Statistical View of Boosting (1998).
  - MultiClassClassifier: to solve multiclass problems using two-class classifiers: one classifier is built per class (it is also possible to ask for one classifier per pair of classes).
  - CVParameterSelection will try out a defined range over a set of parameters for a certain classifier using cross-validation, and use the best combination to build the final model.
- Weka also includes a package that contains clustering algorithms. It supports:
  - The EM algorithm: working with numeric as well as nominal values, but assuming that all attributes are independent.
  - Incremental clustering: a strang clustering technique that builds a tree using the category utility measure (see section 6.6 in Witten and Frank's book)
  - kMeans is also provided.
- Association rules: the APriori algorithm is supported.
- Some data preprocessing support is provided: you can add a new attribute (based on calculations of existing ones), transform attribute values, manually select certain attributes, discretise numeric values, remove attributes with only one distinct value, select records on the basis of attribute values, transform nominal values into binary ones, merge two nominal values into one, normalise numeric values, randomise the order of the dataset, replace missing values by the mean or the mode, create random subsamples, ...
- Weka supports several schemes for attribute selection. Both filter methods (not taking into account the classification scheme) and wrapper methods (using classification of the instances for attribute subset evaluation). Among the provided filter methods are the chi-squared method, the information gain and gain ratio measures, the performance of the OneR classifier based on each single attribute, PCA, the relief-F method (based on distances between sampled instances), ... Apart from that, wrapper methods can be used: the feature subset is evaluated using the actual classifier that is going to be used for classification. This is obviously quite time-consuming.
  For the selection methods that investigate feature subsets rather than individual features, a search strategy has to be selected. Possibilities are: best-first, exhaustive search, forward selection, ranking, genetic search, random search, ...
- Visualisation: Weka provides limited visualisation possibilities. there are maximum three dimensions: 2 axis and one overlay colour.
Advantages:
The obvious advantage of a package like Weka is that a whole range of data preparation, feature selection and data mining algorithms are integrated. This means that only one data format is needed, and trying out and comparing different approaches becomes really easy. The package also comes with a GUI, which should make it easier to use. A last important advantage is obviously that it comes for free, which is usually not the case for integrated data mining suites like this.
Disadvantages:
Probably the most important disadvantage of most data mining suites is that they do not implement the newest techniques. For example the MLP implemented has a very basic training algorithm (backprop with momentum), and the SVM only uses polynomial kernels, and does not support numeric estimation. Another important disadvantage arises from the fact that the software is for free: the documentation (especially for the GUI) is quite limited. Fortunately Witten and Frank's data mining book is more or less a summary of the functionalities of the program, and chapter 8 is a tutorial for it (it does not describe anything about the GUI though). As the software is constantly growing, the documentation is not up to date with everything either. A third possible problem is scaling. I ran several experiments with the landsat dataset. This contains 4000 training data and 2000 test data, with 37 attributes. For most algorithms there was no problem, but some more difficult tasks (like using the multiclassclassifier in combination with SVM's) caused a MemoryOutOfBounds error, which quite often arises in java. So I am not sure how well the package would scale. In any way, this is a problem that will probably come up for most packages, and even if it does not, students will have to sample large datasets in order to work on them within reasonable time limits. A fourth problem is that the GUI does not implement all the possible options. Things that could be very useful, like scoring of a test set, are not provided in the GUI, but can be called from the command line interface. So sometimes it will be necessary to switch between GUI and command line. Finally, the data preparation and visualisation techniques offered might not be enough. Most of them are very useful, but I think in most data mining tasks you will need more to get to know the data well and to get it in the right format.

Torch:

General Description:
Torch is a library of C++ code for machine learning algorithms. The classes provided allow to easily and quickly build c++ programs that use modern machine learning algorithms. Three working examples of machine learning programs that use this library are given: an mlp, an hmm and a gmm. On the whole, however, this is more a library for C++ programmers or machine learning researchers. Before the classes can effectively be used, a program has to be written that uses them. An example is the simple mlp program that is provided with the source code.
Functionalities:
SVM (classfication and regression), MLP, RBF, GMM and HMM, kmeans, kNN and Parzen regression and density estimator
Evaluation:
I am sure this library is very valuable for machine learning researchers and c++ programmers. It allows to integrate some of the most recent developments in machine learning algorithms into c++ programs. However, I don't think we can use this for data mining exercises. The students would need knowledge of c++, get acquainted with the library, and then build their own data mining algorithms. This is a lot of work which will keep them from trying out different algorithms and techniques.

Rainbow:

General Description:
This is a software package for text processing. It can provide a bag-of-words format of texts, and has classification and clustering functionalities.
Functionalities:
Most important for us is probably the text preprocessing functionality, which allows to convert a group of texts into a bag-of-words format (hence the name BOW). This seems to work properly and very fast. Different options allow to choose a sparse or a full matrix model, to display a count or a binary present/absent flag, and to include the word names or not. Also, feature selection methods can be used, based on word frequency, word-document counts or information gain.
Furthermore, the package allows text classification, text retrieval and text clustering. For classification, the naive bayes classifier seems to work without any problems. Knn is also supported, but does not work for one reason or the other. SVM works half (I'm probably doing something wrong there). Tfidf classification (see A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization) seems to work well. Also EM is implemented and works.
Crossbow is the interface that would support text clustering, but there is no documentation available yet, and I have been unable to make it work.
Arrow does document retrieval based on tfidf. No documentation is available yet.
Evaluation:
This package seems to be very useful. Especially the text preprocessing features are very interesting. They work perfectly well for the usenet newsgroup texts, but for the Reuters data, I can see some difficulties lying ahead: the tool accepts only one class per text (which the case for only 57% of the Reuters data) and the documents are supposed to be organised in directories, one class per directory, one text per file (Reuters is organised in 22 large files). Also, the Reuters dataset seems to be much more of a real life dataset, with some rare classes, and misspelled classnames. So extra preprocessing will be necessary. Maybe this is rather an advantage, because with a well prepared data set like the news groups, the possible tasks for the students are quite limited.
The classification algorithms seem to be less good. So far, I have not been able to make the SVM's and the KNN work properly. However, it should be possible to preprocess the texts with Rainbow and then use the matrix in Weka or matlab (although the high dimensions of the bag-of-words matrix will probably pose a big challenge to these packages).
Finally, the documentation is fairly good (at least for Rainbow), combining the on-line tutorial with the help file.

SMART:

This is software for text retrieval. Its text preprocessing functionalities are often used in scientific work about text analysis. Also this software I have been unable to install because of c compilation problems: the output type of several functions has to be defined in a separate h-file (src/h/sysfunc.h), depending on the system configurations. I have been able to solve some of the problems, but others are remaining.

Netlab:

General description:
Netlab is a toolbox of matlab functions and scripts, originally based on Bishop's Neural networks for pattern recognition book. The software is accompanied by a book, Netlab: Algorithms for Pattern Recognition. This book describes the theory behind the different algorithms, and describes for each one or more worked examples. The book costs 24.5 pounds, but the examples are freely available as zip-files. They provide some insight in how to use the netlab algortihms. Also freely available are the help pages for the different functions in netlab. There are also some demos given of how to use the functions.
Functionalities:
The new release of Netlab includes the following algorithms:
- PCA
- Mixtures of probabilistic PCA
- Gaussian mixture model with EM training algorithm
- Linear and logistic regression with IRLS training algorithm
- Multi-layer perceptron with linear, logistic and softmax outputs and appropriate error functions
- Radial basis function (RBF) networks with both Gaussian and non-local basis functions
- Optimisers, including quasi-Newton methods, conjugate gradients and scaled conjugate gradients
- Multi-layer perceptron with Gaussian mixture outputs (mixture density networks)
- Gaussian prior distributions over parameters for the MLP, RBF and GLM including multiple hyper-parameters
- Laplace approximation framework for Bayesian inference (evidence procedure)
- Automatic Relevance Determination (ARD) for input selection
- Markov chain Monte-Carlo including simple Metropolis and hybrid Monte-Carlo
- K-nearest neighbour classifier
- K-means clustering
- Generative Topographic Map
- Neuroscale topographic projection
- Gaussian Processes
- Hinton diagrams for network weights
- Self-organising map
For some of these algorithms, the use is pretty straightforward (knn, mlp, gmm, ...). Other algorithms are less easy to use (rbf, ard, ...). In general, you could say that thorough background knowledge about the algorithms is needed to use them. Which is more an advantage than a disadvantage.
Some of the algorithms can be used for data preprocessing: PCA and ARD. Also, there are functions to read from and write to ascii files, which is very useful. Finally, a function is provided to calculate and display confusion matrices and number of correct classifications.
Advantages:
Working with netlab surely offers some important advantages. Compared to a package like Weka, the greatest advantage is probably that the algorithms implemented are up to date with the newest developments in the field. Also, using the functions assumes quite some knowledge of the theory behind the algorithms, so at least the students will know what they're doing. Finally, the integration in matlab gives some important advantages: all the matrix calculation functions are available, which can be very useful for data preparation and analysis of results afterwards, and scripting is possible, which should increase the efficiency of the analysis.
Disadvantages:
Disadvantages are that some important data preprocessing functionalities are missing: dealing with missing data, feature selection, ... Another problem is that matlab can only work with numeric data, so categorical data will have to be converted in advance, and techniques like apriori and decision trees cannot be implemented.

svm_v0.54:

Description:
This is an SVM toolbox for matlab. It performs training with C++ encoded algorithms, both to speed up training and to avoid having to use extra matlab optimisation packages.
Features:
Linear, Polynomial and rbf kernels can be used. The basic training algorithm is two-class, but multiclass classification is possible using a max-win approach or a pairwise classification approach. This, however, makes the process much slower (as can be expected).
Evaluation:
Although the training goes indeed pretty fast, the labeling afterwards seems to be a problem. For two-class problems, reasonable speeds can be obtained (using the 'fixduplicates' and 'strip' functions from the demo), but especially for multiclass problems using fwd to label the test data takes forever (I don't know exactly how long, as I let it run overnight). Another problem is that the documentation is very limited: you have to get an idea from looking at the demo and at the different .m files.

osu_svm:

Description:
Like the previous, this is an SVM toolbox for matlab which uses C++ algorithms to speed up training. The C++ code was originally made for pc's, but alternatives for use on unix machines are provided.
Features:
Linear, polynomial and rbf kernels are supported. Multiclass classification is no problem. Support vector regression is not supported.
Evaluation:
A full tutorial is given as documentation. This covers everything about how to use the program, but does not say anything about data format (it wants cases in columns, attributes in rows - target class in one attribute, numbered from 1 to t). The algorithm is quite fast.

Bayes Net Toolbox for Matlab:

This is a toolbox to create and work with Bayesian nets in matlab. The toolbox allows to build all sorts of networks (so also naive bayesian), but I don't think it is developed to work with large amounts of data. Also, you have to do quite some programming yourself, there is no algorithm provided that does the classification immediatelly for you. I tried to build and train a naive bayes network for the landsat data, but that seems to take forever (maybe I did not program it in the right way). The scoring gave an accuracy of 74.5%, which is less than the naive Bayes classifier of Weka.
A positive point about this toobox is that the documentation is really good.

DMSK:

Description:
This software package can be purchased alongside the book 'Predictive Data Mining' by Weiss and Indurkhya. It implements the algorithms described in this book. The software is available for Unix and for Windows NT. It is downloadable from the web (normally for 25 pounds, but temporarily for free).
Features:
I have not installed this package yet, but from the description of the website, it does not seem to be too great. It includes some features for data preprocessing: normalisation, discretisation, feature reduction and selection, sampling, ... The classification and regression techniques supported are quite limited: linear regression, neural nets (without specification of how it is implemented), kNN, decision trees and rules and association rules. The meta-learning schemes Bagging and boosting are also supported. Interesting are the text preprocessing features: converting text to bag-of-word format is supported.

DAVIS:

This is data visualisation software in java. The description on the website is very promising, but I have not been able to make it work properly: it reads in the data but does not want to start working with it (java gives a nullpointerexception).

look at http://www.mlnet.org/:

ACE: provides decision tree building and rule induction.

also: consider matlab as first data exploration tool (and preprocessing - sampling?)
intelligent miner for text

IBM offers free use of DB2 and Intelligent Miner for academic use: http://www.almaden.ibm.com/cs/quest/scholar.html. I think this can be ignored because they are in fact aiming for a long-time cooperation with the university (they have certain conditions on the use of their software).

I am going to check out WizWhy, InfoZoom ...

Home : Teaching : Courses : Dme : Fredduc