DME Lab Class 2

In this tutorial we will look at exploratory data analysis. We will do data visualization in GGobi and clustering in Weka.

GGobi is a data visualization tool which offers much more possibilities than the simple 2-D scatter plots in Weka. It is obviously not the aim of this tutorial to make you GGobi experts, but just to familiarise you with the tool so you can use it for your project. GGobi uses different data formats than Weka. In particular, both XML and CSV files can be used to import data into GGobi.
Download the CSV version of the Landsat dataset here. You can download these files by doing shift-leftclick (or using the right mouse button and using the "save link target as" option). Start GGobi by typing 'ggobi satgobi.csv' in the directory where the dataset is.
When the program is started up, you get the traditional 2-D scatter plot which was also available in Weka (without the class-based colourization though). You can select a variable as the X- or Y-axis of the plot by clicking the corresponding buttons associated with each variable.
You can also get 3-D plots of the data by selecting 'Rotation' under GGobit's 'View' menu. Similarly with 2-D scatter plots, you assign a variable to each of the 3 axes. GGobit auto-rotates the 3-D plot so you can easily get a feel for the data's distribution across the 3 selected variables. You can change the rotation's speed or pause it completely using GGobit's interface. Play a bit with this functionality to get used to it. Try to discover what each element of GGobi's interface does in the 'Rotation' mode. If you have any difficulty, just ask!
An interesting feature in GGobi is 'brushing'. This allows you to colour some data points, and keep these colours active in other views of the data. We'll work through a simple example. Make a rotation of different attributes (eg. pixel5_1, pixel5_2 and pixel5_4) as described above. Pause the rotation when you can distinguish one group of data points which is more or less set apart from the rest, and select 'Brush' in the 'Interaction' menu. Select the desired colour for brushing by clicking the 'Choose color and glyph' button. Also, select the 'persistent' option. Color the data and then investigate other plots to see where the group of data is situated. If you plot the class variable on the Z-axis, you may find that your group of data points make up a big part of one class. This depends on the selected attributes and on the 'cluster' of data points you brushed, so results should vary.
Brushing can be used for outlier detection: you can mark points which are outliers in one view, and then investigate if they are also outliers in other views. Outlier detection can be used as a possible way to detect fraud or simply as a means of clearing up your dataset from faulty instances.
Now, let us return to Weka. Start up Weka and load the sattrn.arff file. In the 'Cluster' tab sheet you have a choice of a few different clustering algorithms. Choose the EM clustering scheme, and indicate that you want 6 clusters. This will fit as 6-component multivariate Gaussian mixture to the data; each component is assumed to have a diagonal covariance matrix. The data contains six classes (numbered 1,2,3,4,5,7). Then under 'Cluster mode' click the 'Classes to clusters evaluation' option. Start the algorithm and observe the results. As you can see, the clusters are compared to the classes. This gives you an idea of how well your classes cluster together naturally. The means and standard deviations of each component are output.
Repeat the exercise using the K-means clusterer.
Try running both clustering algorithms with less than 6 clusters, 3 or 4 for example. Which classes cluster together and which are clearly set apart? Which classes are spread over different clusters? What could this mean and how does this relate to the observations you made doing visualisation in GGobi and Weka?
Now go to the 'Preprocess' sheet, and perform information gain attribute selection (see Lab Class 1 if you don't remember how to do this). Keep only the 5 best attributes and save the reduced dataset with a different filename for later use. If you run the K-means clustering algorithm again now (with 6 classes), what do you think will happen? Will the class matching improve?
And what happens if you replace pixel4_1 by pixel5_4 in the reduced dataset? You can do this easily: Load the initial sattrn.arff dataset again. In the Preprocess tab, select the checkboxes next to attributes pixel5_1, pixel5_2, pixel5_4, pixel6_1, pixel6_2 and the class attribute. Click 'Invert', so that every other attribute apart from those 6 are selected and then click 'Remove'. Run K-means again (with 6 clusters) and observe the change in Incorrectly Clustered Instances. Does this confirm what we mentioned about the attribute selection method in the previous tutorial? If you want, use 'Undo' to return to the initial dataset and apply the same procedure again, this time keeping other combinations of 5 attributes.
If you have time left, you can try out the projection pursuit functionality in GGobi. Choose the '2D Tour' option under 'View'. Select some of the attributes by clicking on their corresponding X's. Selecting 3 attributes mimics the functionality of the 'Rotation' view. Selecting 4 or more illustrates the differences between the two views.
2D tour is a functionality which presents you with a continuous sequence of 2-dimensional projections of n-dimensional data (they are linear projections: both 2-D coordinates are linear combinations of all n attributes). These projections should be representative of all possible projections. When you see an interesting projection (where some classes are clearly set apart from others), you can pause the tour and, if you want, see the projection's coefficients (check the 'Show Projection Vals under the 'Tour2D' menu).
GGobit offers one extra functionality on top of this; projection pursuit looks for interesting 2-dimensional projections of the data. It does this by optimising a projection index (eg. by maximizing the 'empty' area between the data points). Click on 'Projection pursuit' and then on 'Optimize'. Different indices (under PP Index) will lead to different optimal projections. Also, bear in mind that these are local optima, so unselecting 'Optimize' and reselecting it later will give you a different view.

Home : Teaching : Courses : Dme