Authors: | Luke Shrimpton
Sharon Goldwater Ida Szubert Henry S. Thompson |
---|---|
Date: | 2014-11-01, 2015-11-10, 2016-11-05, 2017-11-05, 2018-10-24 |
Copyright: | This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License: You may re-use, redistribute, or modify this work for non-commercial purposes provided you retain attribution to any previous author(s). |
This lab is available as a web page or pdf document.
The goals of this lab are two-fold:
As usual, create a directory for this lab inside your labs directory:
cd ~/anlp/labs mkdir lab8 cd lab8
Download the files lab8.py and load_map.py into your lab8 directory: From the Lab 8 web page, right-click on the link and select Save link as..., then navigate to your lab8 directory to save.
Do the following section of the lab before starting to look at the code.
This lab is based on work by Turney et al. (2003) [1], who propose that the "semantic orientation" of a word (whether it has a positive or negative connotation) can be measured by looking at whether that word co-occurs more with clearly positive words (e.g., good, love) or clearly negative words (e.g., bad, hate). Here, we will measure co-occurrence strength using PMI (Pointwise Mutual Information) and average it over a set of positive/negative words to get a positive/negative sentiment (semantic orientation) score.
[1] | Turney, Peter D., and Michael L. Littman. "Measuring praise and criticism: Inference of semantic orientation from association." ACM Transactions on Information Systems (TOIS) 21.4 (2003): 315-346. |
As in last week's tutorial, we'll compute PMI here using MLE for the probability estimates. In the tutorial you should have found the expression for computing PMI from counts, and used it on some toy examples. We'll use this information below so if you can't remember it, please look back at the tutorial and solutions now.
The data we will be using in this lab is a 1% sample of all tweets sent during 2011, about 100 million tweets. We have preprocessed this data for you using the following steps:
[2] | Lui, Marco and Timothy Baldwin (2012) langid.py: An Off-the-shelf Language Identification Tool, In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), Demo Session, Jeju, Republic of Korea |
Even after all this filtering, the data set is extremely large, and the amount of computer memory required to hold it all can be prohibitive. So, we will be using a trick to reduce the amount of memory we need: we will represent each word using a unique integer ID number, and store a mapping between the words and the numbers.
This trick works because a standard 32 bit integer can represent any number up to 231-1 [*], or about 2.1 billion distinct word IDs, and still requires only 32 bits of memory. In contrast, each character in a word requires 1 byte (8 bits) of memory, so any word longer than four characters will require more memory to store than an integer ID. Using the integer mapping saves space for any word longer than four characters (which is most words).
[*] | An unsigned integer (i.e., non-negative values only) can represent values up to 232-1, but a signed integer uses one bit to represent the sign. |
Before moving on to the rest of the lab, make sure you understand what the data files look like. There are two data files, which are located in /afs/inf.ed.ac.uk/group/teaching/anlp/lab8/. Before doing anything with these files, please read the following important
WARNINGS:
Now that you have read the warnings, we will take a look at these files using our handy Unix commands, which do not require loading anything into a text editor. First:
cd /afs/inf.ed.ac.uk/group/teaching/anlp/lab8/
Then use less to take a look at the two files below, and make sure you understand what is in them:
wid_word: This file stores two tab separated columns listing each word ID and the word it represents.
counts: The first line of the file is the total number of tweets in the dataset (number of observations). The following lines contain more tab separated data, in the following format:
w0 <tab> c0 <tab> w1 <space> c1 <tab> w2 <space> c2 ... wn <space> cn
where w0 is a word ID and c0 is the number of tweets it occurs in. Then, each of the pairs (wi <space> ci) is some other word ID and the number of times it co-occurs in the same tweet with w0.
What word's cooccurrence statistics are listed on the second line of the counts file (the line that starts with 0)? You will need to grep for the right information in the wid_word file.
After all our preprocessing, how many distinct words are left in our data set? Hint: use wc. Try this on each file and notice the difference in how long it takes to get a result.
Get back to your lab directory:
cd ~/anlp/labs/lab8
Start up Spyder and open lab8.py in the editor. To run the lab, type these commands in the iPython console:
%run lab8.py # This file may take a few seconds to run, especially the first time.
Look at the code in lab8.py. Currently it does the following:
WARNING: Be careful what variables you try to print. For example, if you type print(wid2word) or just wid2word in the interpreter, it will take a very long time to format and print to the screen. You will probably want to cancel the command by pressing Ctrl-c.
After running the initial command there are four variables loaded that will be very useful:
To make sure you understand how to access the information you will need in the follwing sections, answer the following questions (using Python now, not Unix shell commands):
To answer this question, you will need to do the following in lab8.py:
Fill in the correct definition of the PMI function. Before doing so, look back at the number you computed for PMI(y,z) in last week's tutorial. Notice that our error-checking code (defined right after the PMI function) is based on the same input numbers, so you should have gotten the same answer that the error check is looking for. If you didn't, and you are not sure why, please ask a demonstrator for help. If you did, then go ahead and implement the PMI function.
Note: When writing your own code, always think about how to put this kind of error check in! Checking once by hand is a good start, but if you later change the code it is easy to introduce bugs. Having the error check run automatically every time helps catch these.
Fill in the code in the two for loops at the bottom, which are intended to compute the PMI between each target word and each positive/negative word. The loops should build up the list of positive and negative PMI values for each target word (currently, the lists will only have one item each). Once you have the correct values in the lists, ucomment the last line of the file, which prints out the average value of each list for each target item.
Remember all the counts are stored using word ids so you will need to use word2wid to look up the id for the words.
When you have finished these tasks, you should be able to determine whether Justin Bieber is seen positively on Twitter, at least if we only consider two possible sentiment words. What is the answer?
Now add the words husband and wife to the target word list and re-run the file. Before considering their sentiment, answer the following questions:
Which of these two words occurs more in this dataset? By how much? What are some possible explanations for this difference? (There are quite a few!)
Now look at the positive vs. negative sentiment scores of these words. Are there noticeable differences in the sentiment of tweets in which people refer to husbands or wives? If so, what are they?
What about other family members like son and daughter, or some of your own choice? Do people like themselves? (Answering the last question isn't entirely straightforward. What target words might help?)
Words that are synonyms supposedly have the same literal meaning. However they often have different connotations (associated ideas or emotions). What would you predict about the sentiment of the synonymous words kid and child? Does the Twitter data support your prediction?
PMI gives us a very high-level overview of the data, but doesn't give us a lot of detail. If we're trying to explain why sentiment differs for different words, it helps to look at examples from the raw data. If you want to do that, here are two example queries to search the Twitter website for Tweets with the same date range as our corpus. The first brings back Tweets with the word husband, and the second looks for both husband and hate:
You can modify these to look for other words if you want--just change the relevant part of the URL. (Warning: reading too many random Tweets may melt your brain.)
You can also add more words to the target list to try to see what other opinions Twitter users have. Can you think of some words that you would expect to have negative connotations? What about words that have high (or low) PMI with both positive and negative sentiment words? What could this indicate?
Feel free to choose words that interest you, though if you are unsure try: facebook, school, @stephenfry, @comcast, @oprah, food, toothbrush, bf, gf . Do any of the results surprise you?
The current system is simplistic as it only uses one positive/negative word. Try adding some other positive and negative words. Do the results change? You may wish to consult some existing positive and negative word lists.
Note: you may find that some words you would expect to be in the dictionaries are not, and cause key errors. The most likely reason is stemming: for example, terrible is not in the dictionary but the stemmed version terribl is. You can check word2wid to see if each word is there, though in some cases you may still get an error from co_counts if the number of co-occurrences was too low and got filtered out.