In this lab we initially re-examine the spam filtering problem from Lab 1. This time, we train a Logistic Regression model and a linear Support Vector Machine for the spam or non-spam classification task. In the second part of the lab we examine both visualisation and more rigorous methods for feature selection.
Start up Weka, select the Explorer interface and load the preprocessed
Spambase data set from Lab 1, where all attributes are converted to Boolean.
Cheat: If you did not save this data set,
download it here.
To save us waiting while Weka works hard on a large data set, we first subsample the data points to reduce their number. First we need to re-arrange the data points in a random order. This is important because we want to remove a proportionally equal number of data points from each class and the default ordering has all the positive examples first. On the Preprocess tab select Choose > filters > unsupervised > instance > Randomize. Write down the random seed that is used to randomize the order of the data and click the Apply button on the right hand side of the window.
Hint: Once you've selected the Randomize filter you can click on the command line Randomize -S 42 (next to Choose) to bring up the options menu for this filter.
We will now select a subset of the data, using the RemoveFolds filter. We want to keep 10% of the examples in the data set. After selecting the filter, click the command line next to the Choose button to view and change its parameters. In fact the default options of creating 10 folds and selecting the first are ok, so click OK, and then Apply. Now save this dataset as spambase_binary_fold1.arff. It has 461 instances.
It makes sense to use the remaining data as a validation set. Reload the full binarized dataset spambase_binary.arff and again randomize it (using the same seed). Again select the RemoveFolds filter but this time set invertSelection to True in the filter's options menu and then click Apply. This produces 4140 instances. Now save this dataset as spambase_binary_fold2-10.arff.
Now it's time to train our classifiers. The task is to classify e-mails as spam or non-spam and we evaluate the performance of Logistic Regression and Support Vector Machines on this task. Reload the spambase_binary_fold1.arff data set, as this is what we will use during training. Go to the Classify tab and select Choose > functions > SimpleLogistic. In the Test options frame select Supplied test set, click the Set button and load spambase_binary_fold2-10.arff.
Click Start to train the model. Examine the Classifier output frame to view information for the model you've just trained and try to answer the following questions:
What is the percentage of correctly classified instances?
How do the regression coefficients for class 1 relate to the ones for class 0? Can you derive this result from the form of the Logistic Regression model?
Write down the coefficients for class 1 for the attributes [word_freq_hp_binarized] and [char_freq_$_binarized]. Generally, we would expect the string $ to appear in spam, and the string hp to appear in non-spam e-mails, as the data was collected from HP Labs. Do the regression coefficients make sense given that class 1 is spam? Hint: Consider the sigmoid function and how it transforms values into a probability between 0 and 1. Since our attributes are boolean, a positive coefficient can only increase the total sum fed through the sigmoid and thus move the output of the sigmoid towards 1. What can happen if we have continuous, real-valued attributes?
We will now train a Support Vector Machine (SVM) on our classification task. In the Classify tab, select Choose > functions > SMO (SMO stands for Sequential Minimal Optimization, which is an algorithm for training SVMs). Use the default parameters and click Start. This will train a linear SVM (which is quite similar to logistic regression). Again, examine the Classifier output frame and try answering the following:
What is the percent of correctly classified instances? How does it compare to the result from Logistic Regression?
What are the coefficients for the attributes [word_freq_hp_binarized] and [char_freq_$_binarized]? Compare these to the ones you found with Logistic Regression.
How does a linear SVM relate to Logistic Regression? Hint: Consider the classification boundary learnt in each model.
We will now look at a few ways of assessing the performance of a classifier. To do so we will introduce a new data set, the Splice data set. The classification task is to identify intron and exon boundaries on gene sequences. Read the description at the link above for a brief overview of how this works. The class attribute can take on 3 values: N, IE and EI. Now download the data sets below, converted into ARFF for you, and load the training set into Weka:
We'll also use a new classifier. Under the Classify tab, select classifiers > lazy > IBk. This is a K-nearest neighbour classifier.
In the Test options panel, select Use training set and hit Start.
Observe the output of the classifier and consider the following:
Now evaluate the classifier on the test set and check your expectations. In the Test options panel, select Supplied test set and load the file splice_test.arff. In the Result list panel, right-click on the classifier and select Re-evaluate model on current test set. Observe the output and consider the following:
Now explore the effect of the k parameter. To do this, train the classifier multiple times, each time setting the KNN option to a different value. Try 10, 100, 1200 and 2900 and test the classifier on the test set. Hint: To change the KNN option you need to bring up the options panel of the classifier.
|
Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk Please contact our webadmin with any comments or corrections. Logging and Cookies Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh |