DME Tutorial for Week 5

January 2002. Frederick Ducatelle and Chris Williams. School of Informatics, University of Edinburgh; revised Jan 2003.

In this tutorial we will first look at association rules, using the APRIORI algorithm in Weka. After that we will investigate the use of Matlab and its machine learning toolbox Netlab for data mining.

APRIORI works with categorical values only. Therefore we will use a different dataset called "adult"; This dataset contains census data about 48842 US adults. The aim is to predict whether their income exceeds $50000. The dataset is taken from the Delve website, and originally came from the UCI Machine Learning Repository. More information about it is available in the original UCI Documentation.
Download a copy of adult.arff and load it into Weka.
This dataset is not immediately ready for use with APRIORI. First, reduce its size by taking a random sample. You can do this with the 'ResampleFilter' in the preprocess tab sheet: click on the label under 'Filters', choose 'ResampleFilter' from the drop down menu, set the 'sampleSizePercentage' (to 15 eg), click 'OK' and 'Add', and click 'Apply Filters'. The 'Working relation' is now a subsample of the original adult dataset. Now we have to get rid of the numerical attributes. You can choose to discard them, or to discretise them. We will discretise the first attribute ('age'): choose the 'DiscretizeFilter', set 'attributeIndices' to 'first', bins to a low number, like 4 or 5, and the other options to 'False'. Then add this new filter to the others. We will get rid of the other numerical attributes: choose an 'AttributeFilter', set 'invertSelection' to 'False', and enter the indices of the remaining numeric attributes (3,5,11-13). Apply all the filters together now. Then click on 'Replace' to make the resulting 'Working relation' the new 'Base relation'.
Now go to the 'Associate' tab sheet and click under 'Associator'. Set 'numRules' to 25, and keep the other options on their defaults. Click 'Start' and observe the results. What do you think about these rules? Are they useful?
From the previous, it is obvious that some attributes should not be examined simultaneously because they lead to trivial results. Go back to the 'Preprocess' sheet. If you have replaced the original 'Base relation' by the 'Working relation', you can include and exclude attributes very easily: delete all filters from the 'Filters' window, then remove the check mark next to the attributes you want to get rid of and click 'Apply Filters'. You now have a new 'Working relation'. Try to remove different combinations of the attributes that lead to trivial association rules. Run APRIORI several times and look for interesting rules. You will find that there is often a whole range of rules which are all based on the same simpler rule. Also, you will often get rules that don't include the target class. This is why in most cases you would use APRIORI for dataset exploration rather than for predictive modelling.
Now, we will look at the use of Netlab for data mining. Using Netlab has the advantage that you have all of Matlab's functionalities at your disposal. Also, with Netlab all the machine learning algorithms are up to date with the latest development, and their working is clear (unlike in Weka, where you are often left guessing which learning scheme is actually implemented). Start Matlab and check if Netlab is in your path by typing 'help netlab'. If this gives an error message, you will have to add Netlab to your path. You can do this by choosing 'Set Path' under 'File'. Click 'Add Folder' and type /opt/matlab-7.0.4/toolbox/local/netlab . You can also do this from the Matlab command line by typing addpath /opt/matlab-7.0.4/toolbox/local/netlab
We will use the Landsat data again. Netlab requires yet a another format. Download the files sattrn.txt and sattst.txt . Load the training data into matrices x (the attribute values) and xt (the class labels) and the test data into y and yt, by typing:
```
  [x,xt,nin,nout,xndata] = datread('sattrn.txt');
  [y,yt,nin,nout,yndata] = datread('sattst.txt');
```
The attribute values have been normalised to lie between 1 and -1, and the class labels are coded as a 6 column matrix of zeros and ones. The empty class 6 has been omitted.
Matlab has some useful visualisation functionalities. You can get a histogram of a variable using hist(X,N), where X is an array of values, and N is the number of bins in the histogram. Remember how we said in the previous tutorial how the traditional Naive Bayes classifier assumes that numeric attributes are normally distributed? Try to check this assumption visually by making histograms of the different attributes. Another visualisation possibility is to make plots of two variables. As an exercise, run the script pca_tut.m . It projects the data onto the first two principal components and then plots them, using a different color for each class. Display the contents of the script by typing 'type pca_tut' and make sure you understand how it works.
Let's now move on to predictive modelling. Download the matlab functions ccgmbuild.m and ccgmeval.m . ccgmbuild builds a class conditional Gaussian mixture model by fitting a single multivariate Gaussian to the examples belonging to each class. The resulting model can be evaluated on the test data using 'ccgmeval'. For each training case x, the probability p(C=c|X=x) is calculated for each class. The class that gives the highest probability is chosen. The output of the function is a confusion matrix and a classification rate. View the code ('type ccgmbuild' and 'type ccgmeval') and try to understand it.
Run 'ccgmbuild' on the landsat data with the 'naive' option set to 1. This will build a naive Bayes model. The difference with a full class conditional gaussian model is the assumption that all attributes are conditionally independent given the class (so the covariance matrix of the fitted gaussians is diagonal).
```
  M = ccgmbuild(x,xt,1);
  [C,rate] = ccgmeval(y,yt,M)
```
This should give you the same result as the Naive Bayes functionality in Weka. Compare this performance with the other predictors you used last week. What do you think of the independence assumption with regard to the landsat data? Build a full gaussian model (with the same commands, but setting the 'naive' option to 0) and compare the results.
Another predictive modelling functionality in Matlab is logistic regression. You can make a logistic regression model with the 'glm' function (generalised linear model). In this option, a structure is made with one node per output class. To each node a linear combination of the inputs is fed, and inside the node this is transformed to obtain the output for the class. You have three possible choices for the activation function: linear, logistic and softmax. Choosing the logistic function comes down to making one logistic regression model per class and then combining them by assigning the class with the highest probability (something you can also do in Weka, by choosing the 'MultiClassClassifier' as Classifier, and then choosing the 'Logistic' classifier). This however does not guarantee that the output probabilities will add up to 1. Another approach is to build a multiple logistic regression model, using the softmax function. This model predicts P(class k|x) = exp(w_k .x) / sum_j exp(w_j .x), where w_k is the weight vector associated with class k. (Small exercise: show that in the two class case this reduces to logistic regression.) Try building models with both options and check which gives the best results. This is how you do it:
First build the model data structure:
```
  net = glm(nin,nout,'logistic')
```
Then train the model:
```
  options = foptions;
  options(1) = 1;
  options(14) = 10;
  net = glmtrain(net,options,x,xt);
```
options(1) = 1 means that error values are displayed during training. options(14) sets the maximum number of iterations for the IRLS algorithm used to fit the model to data. Type help glmtrain to see more details on options. To use the softmax function, replace 'logistic' with 'softmax'. Run the model on the test data and evaluate the performance:
```
  yf = glmfwd(net,y);
  [C,rate] = confmat(yf,yt)
```
Finally, we'll take a look at multi-layer perceptrons. In Weka, only backpropagation with gradient descent is supported. Netlab implements many the newest training techniques, and leaves much more choice to the user. Therefore it is more flexible and more transparent. Train an MLP as follows:
First build the network structure. 'nin' is the number of input nodes, 10 is the number of hidden nodes, 'nout' is the number of output nodes and 'logistic' is the type of the output activation function (you can also use 'softmax'). Type help mlp for more details.
```
  net = mlp(nin, 10, nout, 'logistic');
```
Then train the network using the scaled conjugate gradient method.
```
  options = zeros(1,18);
  options(1) = 1;    % This provides display of error values
  options(14) = 100; % Number of training cycles
  [net,options] = netopt(net,options,x,xt,'scg');
```
Run the model on the test data and evaluate the performance:
```
  yf = mlpfwd(net,y);
  [C,rate] = confmat(yf,yt);
```
You can try to improve the performance of the algorithm by changing the number of hidden units, number of iterations etc. You can also try using a different activation function ('softmax'), and different learning algorithms (like 'conjgrad', 'quasinew' and 'graddesc'; you can get online help on these). It may help to write a small script or function which you can easily modify to carry out these runs. What is the best performance you can obtain?
Netlab contains a lot of other useful functions. Type help netlab to get an overview. RBFs, k-nearest neighbours, Kohonen's SOM, GTM, mixture of Gaussians and k-means are all there. To get hints on how to run these, look at the associated demo files that are listed by the help netlab command, and/or look at the book NETLAB : algorithms for pattern recognitions by Ian T. Nabney (Springer, 2002).

You have now used a whole range of different classification algorithms on the same dataset. Some of them gave great results, others were less good. Some of them give clear explanations of their decisions, others form a black box. Some are easy to optimise and have fast learning times, others are more difficult and time-consuming. These are all things to take into account when you choose classification algorithms for your dataset. Often particular properties of the dataset and the task will have a large influence on which algorithms are best to use. Don't hesitate in your project to try out different algorithms and to use the best elements of each of them.

Home : Teaching : Courses : Dme