DME Lab Class 1
The lab classes this term will focus on software available
on .inf for data mining.
The intention of this lab class is that you get your first introduction
to Weka. In the coming lab classes you will explore the different
functionalities of this tool, and also use some of the other
software packages we have installed for use in this data mining module. It is
important that you get used to these packages, as you may need
to use them in your mini-project.
Note: This is an introductory lab session. If you've worked with Weka
before (e.g. for the IAML course) some of the following tasks might seem too
trivial. If so, feel free to skip the first few of them. Steps 6,7 and 8 should
still be useful.
Please execute the following tasks on a DICE computer.
- In the lab classes, we will often make use of the Landsat data. Read at least the short
description to this dataset.
- We have converted the landsat dataset to the ARFF
data format, which is used by Weka. Read the short introduction to
this data format and have a look at the landsat data. The dataset is
divided into a training file
sattrn.arff
and a test file
sattst.arff
You can download these files by doing shift-leftclick (or using
the right mouse button and using the "save link target as" option).
(These instructions work for mozilla, similar things should apply
to other browsers.)
For your
mini-project if you use Weka
you will have to do the transformation to the ARFF format yourself.
- Now read about Weka. Chapter 8
of Witten and Frank's book gives an introduction to the structure
and the functionalities of the package. It is mainly focused on the
command line interface, whereas we will mostly use the GUI
version. There is hardly any documentation for the GUI, but after
reading the explanation of the command line interface, it will be quite
self-explanatory. You obviously don't have to read sections 8.4 and
8.5 about embedded machine learning.
- Start up Weka by typing weka on the command line.
Choose the 'Explorer'
option, which brings you to the normal GUI (Simple CLI brings you in
the command line interface). Click 'Open file' and load the
sattrn.arff data.
- The dataset is now visible on the left hand side of the
screen. When you click on a variable name, you get some basic
statistical information about the variable on the right bottom of the
screen. The right top is where you apply filters to the data. The
other tab sheets are the 'Classify' sheet (where classifiers and
predictors are run), the 'Cluster' sheet (where clustering algorithms
can be run), the 'Associate' sheet (where the Apriori algorithm is run),
the 'Select attributes' sheet (where attribute selection methods can be
run), and the 'Visualize' sheet (where attributes can be plot against
each other).
- Look at different plots of the variables in the 'Visualize' sheet
to get a feel for this Weka functionality. In the central graph, 2
attributes can be plotted against each other, and a third is used as
overlay color. The bands on the right are 1-dimensional plots for each
attribute. Jitter is random noise that can be added to the data to
prevent overlapping.
You can select variables for the 2-d scatterplot using
either the pull-down menus for X:, Y: and Colour, or by clicking
in the 1-d plots for each attribute (left mouse button for X,
middle or right button for Y).
Try to find a 2-d scatterplot in which the classes are well separated.
[Hint: Concentrate on pixel 5, as
this is the central one, the one that actually has to be classified.]
Judging from the plots, do you think this is a difficult
classification task? Which classes do you think will be easiest to
predict, and which will be more difficult?
- Weka has a whole range of functionalities for attribute
selection. Go to the 'Select attributes' tab sheet. Choose the Information
Gain attribute selection method from the Attribute Evaluator
box. Choose the Ranker search method in the Search Method box;
you can indicate the number of attributes to retain to be 10
or so. Information Gain is another name for the calculation of
mutual information between two variables; in attribute selection
we are interested
in the mutual information between the class and each of the other attributes.
Which attributes obtain the highest score? Would you logically
expect pixel6_1 and pixel6_2 to be more important than pixel5_4? Go
back to visualisation and plot pixel6_1 against pixel5_1, or pixel6_2
against pixel5_2. Can you spot a problem with information gain
attribute selection here?
- When you perform attribute selection in the 'Select attributes' tab
sheet, you get as output the list of selected attributes. The data
file that is used throughout the program is still the same, though. If
you want the attribute selection to have an effect in the other tab
sheets (that is, if you want to work on the reduced data set), you
have to perform the attribute selection in the 'Preprocess' tab
sheet. There on the right hand side you will find the 'Filters'. You
can select the 'AttributeSelectionFilter', and essentially do the same
as in the 'Select attributes' sheet. You then press 'Add', 'Apply
Filters' and 'Replace'.
You have now replaced the working relation for the weka environment.
- To exit Weka, press control-c in the linux window you started
Weka from.