Main | Lectures | Labs | Projects

IRDS Lab X1: MALLET and Topic Modelling

January 2011. Krzysztof Gorgolewski
February 2012. Victor Hernandez-Urbina
School of Informatics, University of Edinburgh.

Latent Dirichlet Allocation with MALLET

In the first part of this lab we are going to have a look at MALLET - a machine learning toolbox that implements (among other algorithms) latent Dirichlet allocation (LDA), which is a probabilistic Bayesian analogue to principle component analysis. recall from the lectures. In the second part, we will do a small -but instructive- example of LSA in R. Let's start!

For some introductions to topic modelling, see

A less technical introduction: D. Blei. Probabilistic topic models. Communications of the ACM, 55(4): 77-84, 2012.
A more technical introduction: D. Blei and J. Lafferty. Topic Models. In A. Srivastava and M. Sahami, editors, Text Mining: Classification, Clustering, and Applications . Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, 2009

First you need to download mallet and create an alias to make your life easier

wget http://mallet.cs.umass.edu/dist/mallet-2.0.6.tar.gz
tar zxvf mallet-2.0.6.tar.gz
alias mallet={write here the path where you extracted the file}/mallet-2.0.6/bin/mallet

Also have a look the the MALLET documentation realted to topic analysis.

The data set you will be working on today can be found here. Remember to uncompress this file before moving on.
First you need to import the dataset into a format understandable by mallet (this may take a while):
```
mallet import-file --input imdb-reviews.txt --output imdb-reviews.mallet --keep-sequence
```
Now we are ready to train our model. We will use 50 topics for 100 iterations. Additionally we are going to save the final set of topics to imdb-reviews-topics.txt and the Inferencer object to inferencer.mallet. We are going to use it later for inferring topics from new documents (this may also take a while).
```
mallet train-topics --input imdb-reviews.mallet --num-topics 50 --inferencer-filename inferencer.mallet --num-iterations 100 --output-topic-keys imdb-reviews-topics.txt
```
From the generated topics you can see that clearly there is something wrong. Some of the words should not be included in the analysis. You can exclude common words during import by setting '--remove-stopwords' flag or provide your own list '--stoplist-file' (one word per line). So, you should import the file once again to continue.
Try playing with different number of topics and iterations. Do the topics make sense? Can you interpret them and assign labels?
Go to http://www.imdb.com and pick a movie synopsis. Save it (just the content) to a text file (everything should be in one line) and import into mallet by typing:
```
mallet import-file --input new_sample_review.txt --output new_sample_review.mallet  --keep-sequence --remove-stopwords --use-pipe-from imdb-reviews.mallet
```
Now you can try to infer which topics are prevalent in the review you have picked.
```
mallet infer-topics --input new_sample_review.mallet  --inferencer inferencer.mallet --output-doc-topics inferred_topics.txt
```
The results will be saved in inferred_topics.txt.

Latent Semantic Analysis with R

Now, let's turn our attention to LSA in R. Open the R console as you did in the previous lab. Once, you are there, you must install the proper package to do LSA in R.

In the console type the following commands to install and load the LSA package.
```
install.packages("lsa")
library(lsa)
```
When the system ask you if you would like to create a personal library for R, answer yes.
Next, download the documents to which you will perform LSA. Follow this link, then uncompress the file to a folder named "docs/" in your current working directory.
We will create a matrix containing the word frequencies in each of the documents of our data. Type:
```
matrix<-textmatrix("docs/", stopwords=c("the","a","an","in","of","for","to","and"))
```
And then, inspect the contents of this new variable. (Don't forget to specify -in the above command- the name of the folder in which you uncompressed the files!)
Now, create a latent semantic space by doing:
```
LSAspace<-lsa(matrix,dims=dimcalc_raw())
```
You should take a look at what the function dimcalc_raw() does. Also, inspect the contents of the LSA space. Do you understand what you see? If not, take a look at the lecture notes and to the help article of the function lsa(). Remember that LSA is performing a PCA transformation on the data.
The lsa() function is really a thin wrapper over a singular value decomposition. Let's try two ways to see this:
- Run:
```
svd(matrix)
```
  Compare this to the output of lsa(). What do you notice?
- The function for matrix multiplication in R is called %*%. If A and B are matrices, then A %*% B returns their product. Try:
```
round(LSAspace$tk %*% diag(LSAspace$sk) %*% t(LSAspace$dk))
```
  What do you notice here? What does the remind you from the lecture?
Next, try this slight change on the LSA space:
```
newLSAspace<-lsa(matrix, dims=2)
```
This last command is identical to the last one, however now we are specifying the dimension of the LSA space.
Compare both LSA spaces. What are their differences? How many topics can we find in each of them?
Now, we will reconstruct the original space based on this second LSA space. Do:
```
newMatrix<-round(as.textmatrix(newLSAspace),2)
```
And take a look at it and compare it to the first matrix. Do you notice anything strange in this reconstruction?
Next, we will find close terms in the textmatrix. The function associate() returns those terms above a threshold close to the input term, sorted in descending order of their closeness. Try:
```
associate(matrix,"computer")
```
And then,
```
associate(newMatrix,"computer")
```
What do you see? Try this with other terms.
Now, let's make a simple version of the plot that we saw in the lecture. Type:
```
t.locs<-newLSAspace$tk %*% diag(newLSAspace$sk)
```
What do you see in this plot? If this is not clear, then try running the following two commands and try again to interpret this plot.
```
> plot(t.locs,type="n")
> text(t.locs, labels=rownames(newLSAspace$tk))
```

Home : Teaching : Courses : Irds