January 2002. Frederick Ducatelle and Chris Williams. School of
Informatics, University of Edinburgh; revised 2003.
This tutorial is about classifiers: we will look at kNN, decision
trees, Naive Bayes and SVM's. We will mainly work in Weka and we
still use the Landsat data
sattrn.arff
and
sattst.arff .
- Start up Weka and read in the Landsat data. Then choose the
'Classify' tab sheet.
- We will first consider a decision tree classifier.
Click on the label under 'Classifier' and choose
'j48.J48' from the drop down menu. This is the c4.5 decision tree
algorithm. The option 'minNumObj' defines the minimum number of
objects in the leaf nodes. There are two forms of pruning. The
traditional form uses post pruning based on confidence intervals. This
is selected by setting the option 'unpruned' to false and choosing a
confidenceFactor. The confidenceFactor should lie in (0,1).
The other form of pruning is
'reducedErrorPruning'. With this option, a validation set is set apart
before training, and the tree is evaluated on this set after
training. Then nodes are removed to maximally improve the performance
on this validation set.
Build a tree with the default pruning method
(i.e. using confidence intervals).
Under 'Test options' choose 'Supplied test set', click on
'Set...' and select 'sattst.arff'.
Try out different confidence factor values (e.g. 0.005, 0.05, 0.5).
Take a look at the size of the tree produced, the error rate and
study the confusion matrix. Which classes cause the
difficulties in classification? Does this confirm what you expected
from the visualisation exercises we did in previous tutorials?
Record the best performance you obtained with j48.J48:
and the parameter settings used to achieve it.
- Now use a very small confidence factor such as
0.0005 (giving very strong pruning), and set minNumObj to something
quite large (e.g. 10).
This will leave you with a fairly simple tree which still has an
acceptable performance. Such a tree accentuates one of the
strengths of decision tree algorithms: they produce classifiers which
are understandable to humans.
This can be an important asset in real life applications (people
are seldom prepared to do what a computer program tells them if there
is no clear explanation). Also, remember that c4.5 builds the tree by
choosing one by one the most discriminating attributes. It judges
these attributes using the information gain measure. So, in fact, it
does information gain attribute selection.
- Let's now have a look at Naive Bayes classification. This
classification scheme builds a Bayesian network, using two simplifying
assumptions (hence the term naive): it assumes that all predictive
attributes are conditionally independent given the class, and that no
hidden variables influence the prediction process. Choose 'NaiveBayes'
under 'Classifier'. You will see that you get to set one option,
'useKernelEstimator'. Leave this option to its default ('False') and
run the estimator.
Record the performance you obtained with Naive Bayes:
and the parameter settings used to achieve it.
Compare this result with those from previous classifiers.
- Now we come to back to the 'useKernelEstimator' option. In
traditional Naive Bayes classification, a numeric attribute is modeled
using a normal distribution. This is often a good choice (and an easy
one, as only the mean and the standard deviation have to be
calculated). However, sometimes the distribution is clearly
non-gaussian. In that case it can be interesting to approximate the
more complex distribution shape using kernel estimation (the paper
Estimating Continuous Distributions
in Bayesian Classifiers (John and Langley, 1995)
was the source for this Weka
functionality). Try this option. Does this improve the classification?
Can you think of other reasons why Naive Bayes would not perform very
well on this particular data set?
- Now let's look at a kNN classifier. Click on the label under
'Classifier', and then choose 'IBk' from the drop-down menu. You will
get a range of possible options. 'KNN' is the most important: it
defines the number of neighbors.
The
'distanceWeighting' option allows you to adapt the influence of the
neighbours according to their distance. 'noNormalization' means that
the attributes are not normalised before classification.
Try out different values for 'KNN' (using default values for the other
options). You might choose values of 1, 3 and 5.
Under 'Test options' choose 'Supplied test set', click on
'Set...' and select 'sattst.arff'. You should be able to get very good
results (certainly if you consider the fact that there are 6 different
classes, and constantly choosing the most common class would only
give you 24%). Can you think of any features of this dataset that make
it particularly appropriate for classification with kNN?
Record the best performance you obtained with kNN:
and the parameter settings used to achieve it
You can choose to let the program
find the best value for k automatically, by
cross validation (setting 'crossValidate' to
'True'). 'KNN' is then the maximum number of neighbours to be tried
out. Obviously this option takes long running times.
- Weka has only limited support for SVM's: only polynomial kernels are
possible, and no regression is supported (only 2-class
classification). Therefore, if you want to try SVM's, it is better
to use the special purpose program SVMTorch. Read through the short
introduction to SVMTorch in A small user
guide for SVMTorch. The program uses a different data format
from Weka and Xgobi, see the files
sattrn.svm
and
sattst.svm .
In these files, I left out the empty
class 6. Also, the numbering of the classes starts at 0. Build a
classifier for the landsat data by typing:
SVMTorch -multi -c 10 -t 2 -std 15 sattrn.svm model
In this line, the option -multi indicates that you have more
than two classes. The option -c is the parameter C in
the non-separable SVM optimisation formula (the trade-off between
minimising the margin and the slack variables). The option -t
chooses the kernel function (2 is gaussian), and the other options set
the parameters of the kernel function (for a gaussian kernel, there is
only one parameter: the standard deviation). The rest of the command
line specifies
the training data file and the name for the files that will contain the
developed models.
Now test the model by typing:
SVMTest -multi model sattst.svm
How good is the result? How does it compare to the other
classification schemes? Try to improve the classification rate by
finetuning the parameters -c and -std [Hint: you might
explore making std larger]. As the
performance improves, you will also see that the classification goes
faster.
Record the best performance you obtained with SVMs:
and the parameter settings used to achieve it
- You have now trained and tested different classification schemes
by running them one by one and comparing them. When you are doing your
project, it could be useful to automate this process: to have
different algorithms (or the same algorithm with different parameter
values) trained and compared to each other in batch. Weka has a
special functionality for this, the 'Experimenter'. It can be started
from the GUI chooser (the little window with the bird you get in
the beginning). Follow the Weka
Experimenter tutorial, and make sure you understand how the
environment works. Try for example to compare different values for k
in the kNN algorithm: are the performance differences statistically
significant?