DME Lab Class 3

This tutorial is about classifiers: we will look at kNN, decision trees and Naive Bayes. Next week we will cover Logistic regression and SVM's. We will work in Weka and we still use the Landsat data sattrn.arff and sattst.arff.

Start up Weka and read in the Landsat data (training set). Then choose the 'Classify' tab sheet.
We will first consider a decision tree classifier. Click on the 'Choose' button and select the J48 classifier under the 'trees' directory. This is the c4.5 decision tree algorithm. The option 'minNumObj' defines the minimum number of instances in the leaf nodes. There are two forms of pruning. The traditional form uses post pruning based on confidence intervals. This is selected by setting the option 'unpruned' to false and choosing a confidenceFactor. The confidenceFactor should lie in (0,1). Small confidence values produce more heavily pruned trees. The other form of pruning is 'reducedErrorPruning'. With this option, a validation set is set apart before training, and the tree is evaluated on this set after training. Then nodes are removed to maximally improve the performance on this validation set.
Under 'Test options' choose 'Supplied test set', click on 'Set...' and select 'sattst.arff'. Build a tree with the default pruning method (i.e. using confidence intervals). Try out different confidence factor values (e.g. 0.005, 0.01, 0.05, 0.1, 0.5). Take a look at the size of the tree produced, the error rate and study the confusion matrix. Which classes cause the difficulties in classification? Does this confirm what you expected from the visualisation exercises we did in previous tutorials?
Record the best performance you obtained with J48 and the parameter settings used to achieve it.
Now use a very small confidence factor such as 0.0005 (giving very strong pruning), and set minNumObj to something quite large (e.g. 20). This will leave you with a fairly simple (heavily pruned) tree which still has an acceptable performance. Such a tree accentuates one of the strengths of decision tree algorithms: they produce classifiers which are understandable to humans. This can be an important asset in real life applications (people are seldom prepared to do what a computer program tells them if there is no clear explanation). Also, remember that c4.5 builds the tree by choosing one by one the most discriminating attributes. It judges these attributes using the information gain measure. So, in fact, it does information gain attribute selection. You should be able to identify the attributes ranked highest in the previous labs as inner nodes of the decision tree.
Let's now have a look at Naive Bayes classification. This classification scheme builds a Bayesian network, using two simplifying assumptions (hence the term naive): it assumes that all predictive attributes are conditionally independent given the class, and that no hidden variables influence the prediction process. Choose 'NaiveBayes' under the 'Bayes' node. You will see that you get to set one option, 'useKernelEstimator'. Leave this option to its default ('False') and run the estimator.
Record the performance you obtained with Naive Bayes. Compare this result with those from previous classifiers.
Now we come to back to the 'useKernelEstimator' option. In traditional Naive Bayes classification, a numeric attribute is modeled using a normal distribution. This is often a good choice (and an easy one, as only the mean and the standard deviation have to be calculated). However, sometimes the distribution is clearly non-gaussian. In that case it can be interesting to approximate the more complex distribution shape using kernel estimation (the paper Estimating Continuous Distributions in Bayesian Classifiers (John and Langley, 1995) was the source for this Weka functionality). Try this option. Does this improve the classification? Can you think of other reasons why Naive Bayes would not perform very well on this particular data set?
Now let's look at a kNN classifier. Click on the label under 'Classifier', and then choose 'IBk' from the 'lazy' node. You will get a range of possible options. 'KNN' is the most important: it defines the number of neighbors. The 'distanceWeighting' option allows you to adapt the influence of the neighbours according to their distance. 'noNormalization' means that the attributes are not normalised before classification.
Try out different values for 'KNN' (using default values for the other options). You might choose values of 1, 3, 5. You should be able to get very good results (certainly if you consider the fact that there are 6 different classes, and constantly choosing the most common class would only give you 24%). Can you think of any features of this dataset that make it particularly appropriate for classification with kNN?
Record the best performance you obtained with kNN and the parameter settings used to achieve it.
You can choose to let the program find the best value for k automatically, by cross validation (setting 'crossValidate' to 'True'). 'KNN' is then the maximum number of neighbours to be tried out. Obviously this option takes long running times. What value of k returns the best results?
You have now trained and tested different classification schemes by running them one by one and comparing them. When you are doing your project, it could be useful to automate this process: to have different algorithms (or the same algorithm with different parameter values) trained and compared to each other in batch. Weka has a special functionality for this, the 'Experimenter'. It can be started from the GUI chooser (the little window with the bird you get in the beginning). Follow the Weka Experimenter tutorial, and make sure you understand how the environment works. Try for example to compare different values for k in the kNN algorithm: are the performance differences statistically significant?
The experimenter tutorial is quite long. Feel free to go through it on your own time and contact me with any questions.

Home : Teaching : Courses : Dme