DME Lab Class 1

The lab classes this term will focus on software available on .inf for data mining. The intention of this lab class is that you get your first introduction to Weka. In the coming lab classes you will explore the different functionalities of this tool, and also use some of the other software packages we have installed for use in this data mining module. It is important that you get used to these packages, as you may need to use them in your mini-project.

Note: This is an introductory lab session. If you've worked with Weka before (e.g. for the IAML course) some of the following tasks might seem too trivial. If so, feel free to skip the first few of them. Steps 6,7 and 8 should still be useful.

Please execute the following tasks on a DICE computer.

In the lab classes, we will often make use of the Landsat data. Read at least the short description to this dataset.
We have converted the landsat dataset to the ARFF data format, which is used by Weka. Read the short introduction to this data format and have a look at the landsat data. The dataset is divided into a training file sattrn.arff and a test file sattst.arff You can download these files by doing shift-leftclick (or using the right mouse button and using the "save link target as" option). (These instructions work for mozilla, similar things should apply to other browsers.) For your mini-project if you use Weka you will have to do the transformation to the ARFF format yourself.
Now read about Weka. Chapter 8 of Witten and Frank's book gives an introduction to the structure and the functionalities of the package. It is mainly focused on the command line interface, whereas we will mostly use the GUI version. There is hardly any documentation for the GUI, but after reading the explanation of the command line interface, it will be quite self-explanatory. You obviously don't have to read sections 8.4 and 8.5 about embedded machine learning.
Start up Weka by typing weka on the command line. Choose the 'Explorer' option, which brings you to the normal GUI (Simple CLI brings you in the command line interface). Click 'Open file' and load the sattrn.arff data.
The dataset is now visible on the left hand side of the screen. When you click on a variable name, you get some basic statistical information about the variable on the right bottom of the screen. The right top is where you apply filters to the data. The other tab sheets are the 'Classify' sheet (where classifiers and predictors are run), the 'Cluster' sheet (where clustering algorithms can be run), the 'Associate' sheet (where the Apriori algorithm is run), the 'Select attributes' sheet (where attribute selection methods can be run), and the 'Visualize' sheet (where attributes can be plot against each other).
Look at different plots of the variables in the 'Visualize' sheet to get a feel for this Weka functionality. In the central graph, 2 attributes can be plotted against each other, and a third is used as overlay color. The bands on the right are 1-dimensional plots for each attribute. Jitter is random noise that can be added to the data to prevent overlapping.
You can select variables for the 2-d scatterplot using either the pull-down menus for X:, Y: and Colour, or by clicking in the 1-d plots for each attribute (left mouse button for X, middle or right button for Y).
Try to find a 2-d scatterplot in which the classes are well separated. [Hint: Concentrate on pixel 5, as this is the central one, the one that actually has to be classified.] Judging from the plots, do you think this is a difficult classification task? Which classes do you think will be easiest to predict, and which will be more difficult?
Weka has a whole range of functionalities for attribute selection. Go to the 'Select attributes' tab sheet. Choose the Information Gain attribute selection method from the Attribute Evaluator box. Choose the Ranker search method in the Search Method box; you can indicate the number of attributes to retain to be 10 or so. Information Gain is another name for the calculation of mutual information between two variables; in attribute selection we are interested in the mutual information between the class and each of the other attributes.
Which attributes obtain the highest score? Would you logically expect pixel6_1 and pixel6_2 to be more important than pixel5_4? Go back to visualisation and plot pixel6_1 against pixel5_1, or pixel6_2 against pixel5_2. Can you spot a problem with information gain attribute selection here?
When you perform attribute selection in the 'Select attributes' tab sheet, you get as output the list of selected attributes. The data file that is used throughout the program is still the same, though. If you want the attribute selection to have an effect in the other tab sheets (that is, if you want to work on the reduced data set), you have to perform the attribute selection in the 'Preprocess' tab sheet. There on the right hand side you will find the 'Filters'. You can select the 'AttributeSelectionFilter', and essentially do the same as in the 'Select attributes' sheet. You then press 'Add', 'Apply Filters' and 'Replace'. You have now replaced the working relation for the weka environment.
To exit Weka, press control-c in the linux window you started Weka from.

Home : Teaching : Courses : Dme