I am starting to make some notes on using the software packages on the data sets.
This is a library of C++ code for machine learning algorithms. There is quite good documentation available, but the problem is that the algorithms are quite old: all documents and code seems to be dated '96 or '97. Subsequently, only older techniques, like decision trees and kNN are implemented (not even MLP's are implemented).
Only old induction algorithms like decision trees and kNN are available. Some options for data preprocessing and evaluation of results are given, but in the end the choice is quite limited.
I did not download and try out the code, because it is obviously insufficient for us. It's ashame, because the documentation is in fact really good.
Weka is open source data mining software. It does not only support machine learning algorithms, but also data preparation and meta-learners like bagging and boosting. The whole suite is written in java, so it can be run on any platform. It is also possible to embed the classes in your own code, or to add your own machine learning algorithms.
The functionalities of Weka more or less boil down to the algorithms described in Witten and Frank's data mining book (chapter 8 of that book is a tutorial for the software). A complete overview of the algorithms implemented is given in the on-line documentation. A short overview:
The obvious advantage of a package like Weka is that a whole range of data preparation, feature selection and data mining algorithms are integrated. This means that only one data format is needed, and trying out and comparing different approaches becomes really easy. The package also comes with a GUI, which should make it easier to use. A last important advantage is obviously that it comes for free, which is usually not the case for integrated data mining suites like this.
Probably the most important disadvantage of most data mining suites is that they do not implement the newest techniques. For example the MLP implemented has a very basic training algorithm (backprop with momentum), and the SVM only uses polynomial kernels, and does not support numeric estimation. Another important disadvantage arises from the fact that the software is for free: the documentation (especially for the GUI) is quite limited. Fortunately Witten and Frank's data mining book is more or less a summary of the functionalities of the program, and chapter 8 is a tutorial for it (it does not describe anything about the GUI though). As the software is constantly growing, the documentation is not up to date with everything either. A third possible problem is scaling. I ran several experiments with the landsat dataset. This contains 4000 training data and 2000 test data, with 37 attributes. For most algorithms there was no problem, but some more difficult tasks (like using the multiclassclassifier in combination with SVM's) caused a MemoryOutOfBounds error, which quite often arises in java. So I am not sure how well the package would scale. In any way, this is a problem that will probably come up for most packages, and even if it does not, students will have to sample large datasets in order to work on them within reasonable time limits. A fourth problem is that the GUI does not implement all the possible options. Things that could be very useful, like scoring of a test set, are not provided in the GUI, but can be called from the command line interface. So sometimes it will be necessary to switch between GUI and command line. Finally, the data preparation and visualisation techniques offered might not be enough. Most of them are very useful, but I think in most data mining tasks you will need more to get to know the data well and to get it in the right format.
Torch is a library of C++ code for machine learning algorithms. The classes provided allow to easily and quickly build c++ programs that use modern machine learning algorithms. Three working examples of machine learning programs that use this library are given: an mlp, an hmm and a gmm. On the whole, however, this is more a library for C++ programmers or machine learning researchers. Before the classes can effectively be used, a program has to be written that uses them. An example is the simple mlp program that is provided with the source code.
SVM (classfication and regression), MLP, RBF, GMM and HMM, kmeans, kNN and Parzen regression and density estimator
I am sure this library is very valuable for machine learning researchers and c++ programmers. It allows to integrate some of the most recent developments in machine learning algorithms into c++ programs. However, I don't think we can use this for data mining exercises. The students would need knowledge of c++, get acquainted with the library, and then build their own data mining algorithms. This is a lot of work which will keep them from trying out different algorithms and techniques.
This is a software package for text processing. It can provide a bag-of-words format of texts, and has classification and clustering functionalities.
Most important for us is probably the text preprocessing
functionality, which allows to convert a group of texts into a
bag-of-words format (hence the name BOW). This seems to work properly
and very fast. Different options allow to choose a sparse or a full
matrix model, to display a count or a binary present/absent flag, and to
include the word names or not. Also, feature selection methods can be
used, based on word frequency, word-document counts or information
gain.
Furthermore, the package allows text classification, text retrieval
and text clustering. For classification, the naive bayes classifier
seems to work without any problems. Knn is also supported, but does
not work for one reason or the other. SVM works half (I'm
probably doing something wrong there). Tfidf classification
(see A
Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text
Categorization) seems to work well. Also EM is
implemented and works.
Crossbow is the interface that would support text clustering,
but there is no documentation available yet, and I have been unable to
make it work.
Arrow does document retrieval based on tfidf. No documentation
is available yet.
This package seems to be very useful. Especially the text
preprocessing features are very interesting. They work perfectly well
for the usenet newsgroup texts, but for the Reuters data, I can see
some difficulties lying ahead: the tool accepts only one class per
text (which the case for only 57% of the Reuters data) and the documents
are supposed to be organised in directories, one class per directory,
one text per file (Reuters is organised in 22 large files). Also, the
Reuters dataset seems to be much more of
a real life dataset, with some rare classes, and misspelled
classnames. So extra preprocessing will be necessary. Maybe this is
rather an advantage, because with a well prepared data set like the
news groups, the possible tasks for the students are quite limited.
The classification algorithms seem to be less good. So far, I have not
been able to make the SVM's and the KNN work properly. However, it
should be possible to preprocess the texts with Rainbow and then use
the matrix in Weka or matlab (although the high dimensions of the
bag-of-words matrix will probably pose a big challenge to these
packages).
Finally, the documentation is fairly good (at least for Rainbow), combining
the on-line
tutorial with the help
file.
This is software for text retrieval. Its text preprocessing functionalities are often used in scientific work about text analysis. Also this software I have been unable to install because of c compilation problems: the output type of several functions has to be defined in a separate h-file (src/h/sysfunc.h), depending on the system configurations. I have been able to solve some of the problems, but others are remaining.
Netlab is a toolbox of matlab functions and scripts, originally based on Bishop's Neural networks for pattern recognition book. The software is accompanied by a book, Netlab: Algorithms for Pattern Recognition. This book describes the theory behind the different algorithms, and describes for each one or more worked examples. The book costs 24.5 pounds, but the examples are freely available as zip-files. They provide some insight in how to use the netlab algortihms. Also freely available are the help pages for the different functions in netlab. There are also some demos given of how to use the functions.
The new release of Netlab includes the following algorithms:
For some of these algorithms, the use is pretty straightforward (knn,
mlp, gmm, ...). Other algorithms are less easy to use (rbf, ard, ...). In
general, you could say that thorough background knowledge about the algorithms
is needed to use them. Which is more an advantage than a
disadvantage.
Some of the algorithms can be used for data preprocessing: PCA and
ARD. Also, there are functions to read from and write to ascii files,
which is very useful. Finally, a function is provided to calculate and
display confusion matrices and number of correct classifications.
Working with netlab surely offers some important advantages. Compared to a package like Weka, the greatest advantage is probably that the algorithms implemented are up to date with the newest developments in the field. Also, using the functions assumes quite some knowledge of the theory behind the algorithms, so at least the students will know what they're doing. Finally, the integration in matlab gives some important advantages: all the matrix calculation functions are available, which can be very useful for data preparation and analysis of results afterwards, and scripting is possible, which should increase the efficiency of the analysis.
Disadvantages are that some important data preprocessing functionalities are missing: dealing with missing data, feature selection, ... Another problem is that matlab can only work with numeric data, so categorical data will have to be converted in advance, and techniques like apriori and decision trees cannot be implemented.
This is an SVM toolbox for matlab. It performs training with C++ encoded algorithms, both to speed up training and to avoid having to use extra matlab optimisation packages.
Linear, Polynomial and rbf kernels can be used. The basic training algorithm is two-class, but multiclass classification is possible using a max-win approach or a pairwise classification approach. This, however, makes the process much slower (as can be expected).
Although the training goes indeed pretty fast, the labeling afterwards seems to be a problem. For two-class problems, reasonable speeds can be obtained (using the 'fixduplicates' and 'strip' functions from the demo), but especially for multiclass problems using fwd to label the test data takes forever (I don't know exactly how long, as I let it run overnight). Another problem is that the documentation is very limited: you have to get an idea from looking at the demo and at the different .m files.
Like the previous, this is an SVM toolbox for matlab which uses C++ algorithms to speed up training. The C++ code was originally made for pc's, but alternatives for use on unix machines are provided.
Linear, polynomial and rbf kernels are supported. Multiclass classification is no problem. Support vector regression is not supported.
A full tutorial is given as documentation. This covers everything about how to use the program, but does not say anything about data format (it wants cases in columns, attributes in rows - target class in one attribute, numbered from 1 to t). The algorithm is quite fast.
This is a toolbox to create and work with Bayesian nets in matlab. The
toolbox allows to build all sorts of networks (so also naive bayesian),
but I don't think it is developed to work with large amounts of
data. Also, you have to do quite some programming yourself, there is
no algorithm provided that does the classification immediatelly for
you. I tried to build and train a naive bayes network for the landsat
data, but that seems to take forever (maybe I did not program it in
the right way). The scoring gave an accuracy of 74.5%, which is less
than the naive Bayes classifier of Weka.
A positive point about
this toobox is that the documentation is really good.
This software package can be purchased alongside the book 'Predictive Data Mining' by Weiss and Indurkhya. It implements the algorithms described in this book. The software is available for Unix and for Windows NT. It is downloadable from the web (normally for 25 pounds, but temporarily for free).
I have not installed this package yet, but from the description of the website, it does not seem to be too great. It includes some features for data preprocessing: normalisation, discretisation, feature reduction and selection, sampling, ... The classification and regression techniques supported are quite limited: linear regression, neural nets (without specification of how it is implemented), kNN, decision trees and rules and association rules. The meta-learning schemes Bagging and boosting are also supported. Interesting are the text preprocessing features: converting text to bag-of-word format is supported.
This is data visualisation software in java. The description on the website is very promising, but I have not been able to make it work properly: it reads in the data but does not want to start working with it (java gives a nullpointerexception).
look at http://www.mlnet.org/:
also: consider matlab as first data exploration tool (and
preprocessing - sampling?)
intelligent miner for text
IBM offers free use of DB2 and Intelligent Miner for academic use: http://www.almaden.ibm.com/cs/quest/scholar.html. I think this can be ignored because they are in fact aiming for a long-time cooperation with the university (they have certain conditions on the use of their software).
I am going to check out WizWhy, InfoZoom ...
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk Please contact our webadmin with any comments or corrections. Logging and Cookies Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh |