IRDS: Lab Session 1: MATLAB

MATLAB is a powerful system for numeric computation and visualization. It is especially commonly used in machine learning and computer vision, because of its good language and library support for numerical linear algebra.

As this lab is optional and is not marked, you may use any materials that you wish to help you and collaborate freely with other students in the class.

Pre-lab work

If you have not used MATLAB before, you should go through a basic tutorial to learn how to use it. You should do this before the lab session, so that you can make the best use of your time with the tutor during the lab.

Here is a tutorial describing basic MATLAB commands.
Here is a more compact cheat sheet
You can look at the MATLAB Onramp from MathWorks (the developers of MATLAB) at the MATLAB academy.

Lab work

In this lab, we will perform a simple image clustering task. We will use a small dataset of images of faces collected at AT&T Research in the early 90s. We will cluster the images and see if images of the same person tend to end up in the same cluster.

First we want to load the images from disk. Write code that loads all 400 images into a single 400 by 10304 dimensional matrix. In this matrix, each row will contain a vector of pixels of one image. (Each image is 92 x 112.) To do this, note that:
1. The image files live in a directory on DICE called /afs/inf.ed.ac.uk/group/ML/IRDS/lab1/orl_faces/. (They were downloaded from here.)
2. The imread function will load an image into a two-dimensional array. As with any MATLAB function, if you haven't used it before it's a good idea to try it out from the interpreter window first to make sure that its input and output is what you expect.
3. You will want to write a for loop to load all of the image files. One way to do this is to use the sprintf function to construct a file name, e.g. sprintf('s%d/%d.pgm', 1, 1) will print out one of the image files.
4. You will also want to use the reshape function to shape a matrix into a row vector.
5. It's easiest to create a blank matrix of the size that you need:
```
       X = zeros(400, 10304);
```
  and then write a for loop that fills each row of the matrix with one of the images.
Look at a few of the images. You can visualize an image using the imshow function. You can find good documentation for this using doc imshow. One caveat: imshow assumes the image values are 0 or 1, but the data contains numbers from 0 ... 255. To get around this, pass [1 256] as the second argument of imshow.
Using the subplot command along with imshow, view two images in the same plot window.
Now run $k$-means clustering on the images. The simplest way to do this is to use the kmeans function from the Statistics Toolbox. If you are experienced at MATLAB, you might prefer to implement the cluster alogrithm yourself; it is not too hard.
Whichever implementation of $k$-means you use, you should end up with a vector of length 400 in which each element indicates the cluster assignment for one training instance. Choose a few subjects in the training set, and manually examine which clusters their images were assigned to.
Use logical indexing to get a vector that contains the indices of all indices that are assigned to cluster 3. How many images are assigned to cluster 3? (Hint: The sum function works well on logical vectors). Use logical indexing to obtain and N by 10304 dimensional matrix that contains only the N images in cluster 3.
Especially with image data, it is often good to visualize the clusters directly. Using the subplot function, create a plot with 5 columns and as many rows that you need that shows all of the images in cluster 3.

Extra Credit

This lab contributes a total of zero points to your overall mark in the course. If you are really keen, you may also try some of the tasks below. If you suceed, the extra credit will be that we will double your mark for the lab. :-)

Rewrite the answer to your last question as a MATLAB function. That is, write a MATLAB function that takes as input (a) the data matrix, (b) the results of $k$-means, and (c) the cluster identity, and displays one plot with all of the images in the cluster.
It is often important to have a numerical measure of the quality of a clustering, for example, when you want to compare different clustering algorithms or different distances measures for clustering. The easiest way to do this is by comparing the clusters to a partition that is known to be meaningful. There are several measures that quantify the level of agreement between two partitions of the data, such as purity, precision and recall, normalized mutual information, and the Rand index.

The simplest is purity. Let $\Omega = (\omega_1, \omega_2, \ldots \omega_K)$ be a partition of the data set $D$, i.e., each $\omega_k \subset D,$ etc., that is returned by the clustering algorithm. Similarly, let $C = (c_1, c_2, \ldots c_j)$ be the partition of the data set induced by the gold standard labels.

Purity says: What accuracy would we get at predicting the gold standard labels from the cluster ids, assuming that we matched up clusters to labels in the best possible way? More formally, this is $$\mbox{purity}(\Omega, C) = \sum_{k=1}^K \max_{j} |\omega_k \cap c_j|$$ Compute the purity of the clustering that you have obtained. You may find the mode function useful, which returns the most common element in a vector.
Plot the purity as a function of $k$, the number of clusters that you allow $k$-means. Because $k$-means takes a bit of time, choose only a few values of $k$, e.g., 1, 2, 5, 25, 50, 100. What happens as $k$ becomes large? Why?
Run principal component analysis on the images. (MATLAB has a princomp function for this, or alternatively you can call svd directly.) There are several fun things that you can do with the results: a) You can plot the principal components as if they were images (these are called eigenfaces, mostly because the name sounds cool). b) You can plot the embeddings of the images in a lower dimensional space. Color each point by which human subject it is. Do photos of the same subject tend to be embedded near each other?