January 2002. Frederick Ducatelle and Chris Williams. School of
Informatics, University of Edinburgh; revised Jan 2003.
In this tutorial we will first look at association rules, using the
APRIORI algorithm in Weka. After that we will investigate the use of
Matlab and its machine learning toolbox Netlab for data
mining.
- APRIORI works with categorical values only. Therefore we will use
a different dataset called "adult";
This dataset contains census data about
48842 US adults. The aim is to predict whether their income exceeds
$50000. The dataset is taken from the Delve website, and
originally came from the UCI Machine
Learning Repository. More information about it is available in
the original UCI Documentation.
Download a copy of
adult.arff and load it into Weka.
- This dataset is not immediately ready for use with
APRIORI. First, reduce its size by taking a random sample. You can do
this with the 'ResampleFilter' in the preprocess tab sheet: click on
the label under 'Filters', choose 'ResampleFilter' from the drop down
menu, set the 'sampleSizePercentage' (to 15 eg), click 'OK' and 'Add',
and click 'Apply Filters'. The 'Working relation' is now a subsample
of the original adult dataset. Now we have to get rid of the numerical
attributes. You can choose to discard them, or to discretise them. We
will discretise the first attribute ('age'): choose the
'DiscretizeFilter', set 'attributeIndices' to 'first', bins to a low
number, like 4 or 5, and the other options to 'False'. Then add this
new filter to the others. We will get rid of the
other numerical attributes: choose an 'AttributeFilter', set
'invertSelection' to 'False', and enter the indices of the remaining
numeric attributes (3,5,11-13). Apply all the filters together
now. Then click on 'Replace' to make the resulting 'Working relation'
the new 'Base relation'.
- Now go to the 'Associate' tab sheet and click under
'Associator'. Set 'numRules' to 25, and keep the other options on
their defaults. Click 'Start' and observe the results. What do you
think about these rules? Are they useful?
- From the previous, it is obvious that some attributes should not
be examined simultaneously because they lead to trivial results. Go
back to the 'Preprocess' sheet. If you have replaced the original 'Base
relation' by the 'Working relation', you can include and exclude
attributes very easily: delete all filters from the 'Filters'
window, then remove the check mark next to the attributes you want
to get rid of and click 'Apply Filters'. You now have a new 'Working
relation'. Try to remove different combinations of the attributes that
lead to trivial association rules. Run APRIORI several times and look
for interesting rules. You will find that there is often a whole range
of rules which are all based on the same simpler rule. Also, you will
often get rules that don't include the target class. This is why in
most cases you would use APRIORI for dataset exploration rather than
for predictive modelling.
- Now, we will look at the use of Netlab for data mining. Using
Netlab has the advantage that you have all of Matlab's functionalities
at your disposal. Also, with Netlab all the machine learning algorithms
are up to date with the latest development, and their working is clear
(unlike in Weka, where you are often left guessing which learning
scheme is actually implemented). Start Matlab and check if Netlab is
in your path by typing 'help netlab'. If this gives an error message,
you will have to add Netlab to your path. You can do this by choosing
'Set Path' under 'File'. Click 'Add Folder' and type
/opt/matlab-7.0.4/toolbox/local/netlab . You can also
do this from the Matlab command line by typing
addpath /opt/matlab-7.0.4/toolbox/local/netlab
We will use the Landsat data again. Netlab requires yet a another
format. Download the files
sattrn.txt
and
sattst.txt .
Load the training data into matrices x (the attribute values) and xt
(the class labels) and the test data into y and yt, by typing:
[x,xt,nin,nout,xndata] = datread('sattrn.txt');
[y,yt,nin,nout,yndata] = datread('sattst.txt');
The attribute values have been normalised to lie between 1 and -1, and
the class labels are coded as a 6 column matrix of zeros and ones. The
empty class 6 has been omitted.
- Matlab has some useful visualisation functionalities. You can get
a histogram of a variable using hist(X,N), where X is an array
of values, and N is the number of bins in the histogram. Remember how
we said in the previous tutorial how the traditional Naive Bayes
classifier assumes that numeric attributes are normally distributed?
Try to check this assumption visually by making histograms of the
different attributes. Another visualisation possibility is to make
plots of two variables. As an exercise, run the script
pca_tut.m .
It projects the data onto the first two
principal components and then plots them, using a different color for
each class. Display the contents of the script by typing 'type
pca_tut' and make sure you understand how it works.
- Let's now move on to predictive modelling. Download the matlab
functions ccgmbuild.m
and
ccgmeval.m .
ccgmbuild builds a class conditional
Gaussian mixture model by fitting a single multivariate Gaussian
to the examples belonging to each class. The resulting model
can be evaluated on the test data using 'ccgmeval'. For each
training case x, the probability p(C=c|X=x) is calculated for each
class. The class that gives the highest probability is chosen. The
output of the function is a confusion matrix and a classification
rate. View the code ('type ccgmbuild' and 'type ccgmeval') and try to
understand it.
Run 'ccgmbuild' on the landsat data with the 'naive' option
set to 1. This will build a naive Bayes model. The difference with
a full class conditional gaussian model is the assumption that all
attributes are conditionally independent given the class (so the
covariance matrix of the fitted gaussians is diagonal).
M = ccgmbuild(x,xt,1);
[C,rate] = ccgmeval(y,yt,M)
This should give you the same result as the Naive Bayes functionality in
Weka. Compare this performance with the other predictors you used last
week. What do you think of the independence assumption with regard to
the landsat data? Build a full gaussian model (with the same commands,
but setting the 'naive' option to 0) and compare the results.
- Another predictive modelling functionality in Matlab is logistic
regression. You can make a logistic regression model with the 'glm'
function (generalised linear model). In this option, a structure
is made with one node per output class. To each node a linear
combination of the inputs is fed, and inside the node this is transformed
to obtain the output for the class. You have three possible choices for the
activation function: linear, logistic and softmax. Choosing the
logistic function comes down to making one logistic regression model
per class and then combining them by assigning the class with the
highest probability (something you can also do in Weka, by choosing
the 'MultiClassClassifier' as Classifier, and then choosing the
'Logistic' classifier). This however does not guarantee that the output
probabilities will add up to 1. Another approach is to
build a multiple logistic regression model, using the softmax
function. This model predicts P(class k|x) = exp(w_k .x) /
sum_j exp(w_j .x), where w_k is the weight vector associated with
class k. (Small exercise: show that in the two class case this
reduces to logistic regression.)
Try building models with both options and check which
gives the best results. This is how you do it:
First build the model data structure:
net = glm(nin,nout,'logistic')
Then train the model:
options = foptions;
options(1) = 1;
options(14) = 10;
net = glmtrain(net,options,x,xt);
options(1) = 1 means that error values are displayed during training.
options(14) sets the maximum number of iterations for the IRLS
algorithm used to fit the model to data. Type help glmtrain
to see more details on options. To use the softmax function, replace
'logistic' with 'softmax'.
Run the model on the test data and evaluate the performance:
yf = glmfwd(net,y);
[C,rate] = confmat(yf,yt)
- Finally, we'll take a look at multi-layer perceptrons. In Weka,
only backpropagation with gradient descent is supported. Netlab
implements many the newest training techniques, and leaves much more
choice to the user. Therefore it is more flexible and more
transparent. Train an MLP as follows:
First build the network structure. 'nin' is the number of input nodes,
10 is the number of hidden nodes, 'nout' is the number of output nodes
and 'logistic' is the type of the output activation function (you
can also use 'softmax'). Type help mlp for more details.
net = mlp(nin, 10, nout, 'logistic');
Then train the network using the scaled conjugate gradient method.
options = zeros(1,18);
options(1) = 1; % This provides display of error values
options(14) = 100; % Number of training cycles
[net,options] = netopt(net,options,x,xt,'scg');
Run the model on the test data and evaluate the performance:
yf = mlpfwd(net,y);
[C,rate] = confmat(yf,yt);
You can try to improve the performance of the algorithm by changing
the number of hidden units, number of iterations etc. You can
also try using
a different activation function ('softmax'), and different learning
algorithms (like 'conjgrad', 'quasinew' and 'graddesc'; you can
get online help on these). It may help to write a small script or function
which you can easily modify to carry out these runs.
What is the best performance you can obtain?
- Netlab contains a lot of other useful functions. Type
help netlab to get an overview. RBFs, k-nearest neighbours,
Kohonen's SOM, GTM, mixture of Gaussians and k-means are all there.
To get hints on how to run these, look at the associated demo
files that are listed by the help netlab command, and/or
look at the book NETLAB : algorithms for pattern recognitions by
Ian T. Nabney (Springer, 2002).
You have now used a whole range of different classification algorithms
on the same dataset. Some of them gave great results, others were less
good. Some of them give clear explanations of their decisions, others
form a black box. Some are easy to optimise and have fast learning
times, others are more difficult and time-consuming. These are all
things to take into account when you choose classification algorithms
for your dataset. Often particular properties of the dataset and the
task will have a large influence on which algorithms are best to
use. Don't hesitate in your project to try out different algorithms
and to use the best elements of each of them.