In this lab we initially re-examine the spam filtering problem from Lab 1. This time, we train a Logistic Regression model and a linear Support Vector Machine for the spam or non-spam classification task. In the second part of the lab we examine both visualisation and more rigorous methods for feature selection.
Start up Weka, select the Explorer interface and load the preprocessed
Spambase data set from Lab 1, where all attributes are converted to Boolean and randomize the instances.
Cheat: If you did not save this data set, download it here.
Now it's time to train our classifiers. The task is to classify e-mails as spam or non-spam and we evaluate the performance of Logistic Regression and Support Vector Machines on this task. Go to the Classify tab and select Choose > functions > SimpleLogistic. Select the percentage split and set it to 10%. This is done in order to save us waiting while Weka works hard on a large data set.
Click Start to train the model. Examine the Classifier output frame to view information for the model you've just trained and try to answer the following questions:
What is the percentage of correctly classified instances?
How do the regression coefficients for class 1 relate to the ones for class 0? Can you derive this result from the form of the Logistic Regression model?
Write down the coefficients for class 1 for the attributes [word_freq_hp_binarized] and [char_freq_$_binarized]. Generally, we would expect the string $ to appear in spam, and the string hp to appear in non-spam e-mails, as the data was collected from HP Labs. Do the regression coefficients make sense given that class 1 is spam? Hint: Consider the sigmoid function and how it transforms values into a probability between 0 and 1. Since our attributes are boolean, a positive coefficient can only increase the total sum fed through the sigmoid and thus move the output of the sigmoid towards 1. What can happen if we have continuous, real-valued attributes?
We will now train a Support Vector Machine (SVM) on our classification task. In the Classify tab, select Choose > functions > SMO (SMO stands for Sequential Minimal Optimization, which is an algorithm for training SVMs). Use the default parameters and click Start. This will train a linear SVM (which is quite similar to logistic regression). Again, examine the Classifier output frame and try answering the following:
What is the percent of correctly classified instances? How does it compare to the result from Logistic Regression?
What are the coefficients for the attributes [word_freq_hp_binarized] and [char_freq_$_binarized]? Compare these to the ones you found with Logistic Regression.
How does a linear SVM relate to Logistic Regression? Hint: Consider the classification boundary learnt in each model.
We will now look at a few ways of assessing the performance of a classifier. To do so we will introduce a new data set, the Splice data set. The classification task is to identify intron and exon boundaries on gene sequences. Read the description at the link above for a brief overview of how this works. The class attribute can take on 3 values: N, IE and EI. Now download the data sets below, converted into ARFF for you, and load the training set into Weka:
We'll also use a new classifier. Under the Classify tab, select classifiers > lazy > IBk. This is a K-nearest neighbour classifier.
In the Test options panel, select Use training set and hit Start.
Observe the output of the classifier and consider the following:
Now evaluate the classifier on the test set and check your expectations. In the Test options panel, select Supplied test set and load the file splice_test.arff. In the Result list panel, right-click on the classifier and select Re-evaluate model on current test set. Observe the output and consider the following:
Now explore the effect of the k parameter. To do this, train the classifier multiple times, each time setting the KNN option to a different value. Try 5, 10, 100, 1000 and 1000 and test the classifier on the test set. Hint: To change the KNN option you need to bring up the options panel of the classifier.
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: firstname.lastname@example.org
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh