In this final lab we consider unsupervised learning in the form of clustering methods and principal component analysis (PCA), as well as more thorough performance evaluation of classifiers.
We first consider clustering of the Landsat data. You will need the following dataset for this task:
To refresh your memory about the Landsat data you can read this description. Since there are 6 classes in the data, it would be interesting to try clustering with k=6 centres. Load the landsat data into Weka and go to the Cluster tab:
Select the SimpleKMeans clusterer, bring up its options window and set numClusters to 6.
In the Cluster mode panel, select Classes to clusters evaluation and hit Start
Ideally, we would hope to see all instances from a single class assigned to a single cluster, and no instances from different classes assigned to the same cluster. Look at the Classes to Clusters confusion matrix. Clearly, we don't have a perfect correspondence between classes and clusters, but:
Visualize the cluster assignments. To do this, right-click on the clusterer in the Result list panel and select Visualize cluster assignments. Plot Class against Cluster. All the data points will lie on top of each other, so increase the Jitter slide bar to about half way to add random noise to each point. This allows us to see more clearly where the bulk of the datapoints lies. In this scatter plot each row represents a class and each column a cluster.
We now consider PCA. Instead of selecting a subset of attributes, PCA allows us to construct a new set of features based on a linear combination of the attributes. We will use the Landsat satellite imaging set. Acquaint yourself with this by reading this description. Load up the Landsat data
On the Attribute Selection tab, choose the PrincipalComponents attribute evaluator. Use the Ranker search method, but set the numToSelect option to -1. Hit Start and observe the output.
On the Result list (at the bottom-left corner of the Attribute Selection tab), right-click on the PCA run and select Visualize transformed data.
Look at the scatter plots of the PCA attributes against each other.
We can consider training a classifier with the PCA feature representation. Do you expect the resulting model to achieve better performance than when trained on the original set of features?
If you have time, go to the Preprocess tab and select Choose > filters > supervised > attribute > AttributeSelection. Bring up the AttributeSelection options window and set the evaluator field to PrincipalComponents and the search field to Ranker. Click OK and then Apply. This filter will replace the original features with the PCA features.
We will continue our performance assessment on the Splice data set which was originally introduced in Lab 3. As a reminder: the classification task is to identify intron and exon boundaries on gene sequences. Read the description at the link above for a brief overview of how this works. The class attribute can take on 3 values: N, IE and EI. Now download the data sets below, converted into ARFF for you, and load the training set into Weka:
Set KNN to the value that provides the greatest percent of correctly classified instances in Lab 3, Question 5 and re-run the classifier on the held out test dataset splice_test.arff.
We wish to understand the output from WEKA with respect to TP rate, FP rate, Precision, Recall etc. Note that these are output on a per-class basis. As the Splice data is a three-class problem, we can consider each of the classes (N, EI, IE) as the "positive" class, and lump the remaining two together as the negative class. Thus to compute TP rate etc we need to reduce the 3 x 3 confusion matrix to a 2 x 2 confusion matrix. So if the positive class is EI, this will mean lumping together the results for the N and IE classes.
Reduce the 3 x 3 confusion matrix to a 2 x 2 confusion matrix using EI as the positive class. To check your reasoning, compute the TPR = TP/(TP+FN), FPR = FP/(FP+TN) and Precision = TP/(TP+FP) and check them against the results output by WEKA. Is the proportion of points correctly classified a sufficient measure of performance? Hint: Consider the other data sets that we've looked at. What does a false positive mean in identifying spam? Is this as bad as a false negative?
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: email@example.com
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh