Software for the data mining course

The following software packages are available on the inf system, and you are recommended to use them for the data mining projects. There are obviously many more tools available on the web, and you are of course free to use any of those if you find them more suitable. Bear in mind, however, that many of the publicly available tools contain a lot of bugs and are hard to install. Also, it is not advisable to use too many different tools together, as each of them will need its own data format, and a lot of time and effort will be wasted doing data conversions.


SAS:

SAS is a large scale statistical analysis, data handling, and business intelligence package. It is extensively used in different business domains as a primary analysis tool. It has good data handling functions, and is based around a comprehensive language. Base SAS provides the core tools that are needed. Visual extensions such as SAS enterprise miner provides more graphical interface tools for basic data mining. If you want a job in this area, then you would do well to learn to use SAS.

Weka:

General Description:

Weka is open source data mining software. It does not only support machine learning algorithms, but also data preparation and meta-learners like bagging and boosting. The whole suite is written in java, so it can be run on any platform. The package has three different interfaces: a command line interface, an Explorer GUI interface (which allows you to try out different preparation, transformation and modelling algorithms on a dataset), and an Experimenter GUI interface (which allows to run different algorithms in batch and to compare the results). A good introduction to Weka is the tutorial given in chapter 8 of Data Mining (2000) by I. H. Witten and E. Frank. This is an introduction to the command line interface. For the Explorer GUI interface there is no documentation, but the functionalities are more or less the same as for the command line interface. For the Experimenter interface a short, clear tutorial is available.

Functionalities:

The functionalities of Weka more or less boil down to the algorithms described in Witten and Frank's data mining book. A complete overview of the algorithms implemented is given in the on-line documentation (also, when using the command line interface, using the option '-h' will give you for any algorithm the list of possible options). A short overview of the Weka functionalities:

Also meta-learning schemes are supported: Weka also includes a package that contains clustering algorithms. It supports: kMeans is also provided.

Advantages:

The obvious advantage of a package like Weka is that a whole range of data preparation, feature selection and data mining algorithms are integrated. This means that only one data format is needed, and trying out and comparing different approaches becomes really easy. The package also comes with a GUI, which should make it easier to use.

Disadvantages:

Probably the most important disadvantage of data mining suites like this is that they do not implement the newest techniques. For example the MLP implemented has a very basic training algorithm (backprop with momentum), and the SVM only uses polynomial kernels, and does not support numeric estimation. Therefore, it will be necessary to combine WEKA with some of the other tools like Netlab or SVM_torch. Another important disadvantage arises from the fact that the software is for free: the documentation for the GUI is quite limited. Witten and Frank's data mining book is more or less a summary of the functionalities of the program, and chapter 8 is a tutorial for it. It does not describe anything about the GUI though. As the software is constantly growing, the documentation is not up to date with everything either (the most up to date and complete information about algorithm options can be obtained by using the -h option in the command line interface). A third possible problem is scaling. For difficult tasks on large datasets, the running time can become quite long, and java sometimes gives an OutOfMemory error. This problem can be reduced by using the '-mxx' option when calling java, where x is memory size (eg '50m'). For large datasets it will always be necessary to reduce the size to be able to work within reasonable time limits. A fourth problem is that the GUI does not implement all the possible options. Things that could be very useful, like scoring of a test set, are not provided in the GUI, but can be called from the command line interface. So sometimes it will be necessary to switch between GUI and command line. Finally, the data preparation and visualisation techniques offered might not be enough. Most of them are very useful, but I think in most data mining tasks you will need more to get to know the data well and to get it in the right format.


Bow:


Netlab:


SVMTorch:


XGobi/GGobi:


c4.5:


cluster :


This page was written by Frederick Ducatelle (fredduc@dai.ed.ac.uk) and is maintained by Amos Storkey (a.storkey@ed.ac.uk)


Home : Teaching : Courses : Dme 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh