Text Technologies Assessment 4: Link Analysis
Victor Lavrenko and Philipp Petrenz
November 7, 2011
The goal of this assessment is to discover the social structure of a
large corporation from its email graph. Please check the
Q/A section below before asking questions.
What to turn in
The assessment is due on
Monday, November 21, 16:00 . On a DICE
machine, create a directory called
tts4, and place the
following files into it:
- complete source code
- hubs.txt, auth.txt, pr.txt containing the results for task 1
- report.pdf, containing a 2-page summary for task 3 and a visualization of key connections for task 2
- please do not include the dataset in your submission
Once the files are in place, use the following DICE command to submit everything:
submit tts 4
tts4
Please make sure you name the files exactly as stated
above and use the correct formats. You may lose a significant fraction
of the marks if we cannot process your submission.
Dataset
The year is 2001 and you are helping the US Securities and Exchange
Commission investigate the
scandal
surrounding the Enron Corporation. The SEC has secured access to all
communications of the relevant persons, but they are having a hard
time understanding the flow of information within the company. Your
goal is to analyse the pattern of email communications and discover
who emails whom and why.
You decide to treat employees as nodes in the social graph and emails
as links between them. Any time person A sends an email to person B,
you will interpret this as a directed link from A to B. Emails that
have more than one recipient should count as several directed links,
one for each recipient. For example, if person A sends an email to
persons B and C, you have two directed links: A→B and
A→C. Repeated links should count multiple times. In other words,
if A emails B ten times, you should treat A→B as ten identical
links (or as a link with a weight of 10). Links of the form A→A
(i.e. emails sent to oneself) should be excluded from the graph.
The email graph is provided here:
graph.txt
(103MB).
.
Each email is represented by one or more sender-recipient lines.
Each line contains three fields: message id, sender and recipient.
For example:
0001049b71b185c9d9605b9eeb1c44f6 vince.kaminski@enron.com mark.palmer@enron.com
0001049b71b185c9d9605b9eeb1c44f6 vince.kaminski@enron.com vince.kaminski@enron.com
0001103de1123d6dd538aba7eabd6580 drew.fossum@enron.com martha.benner@enron.com
The above example contains two emails. The first
(id:0001049b71b185c9d9605b9eeb1c44f6) was sent by Vince
Kaminski and had two recipients: Mark Palmer and Vince Kaminski
himself. The second email
(id:0001103de1123d6dd538aba7eabd6580) was sent by Drew Fossum
to Martha Benner. The file contains a total 1,356,626 lines,
corresponding to 242,047 distinct emails from 19,802 distinct senders
to 77,660 distinct recipients. Some email addresses are malformed but
you should include them as nodes in the graph.
Tasks
1. Run PageRank and HITS
Implement the PageRank and the Hubs and Authorities
algorithms. The algorithms were discussed in lecture
12. For PageRank set λ=0.8. Run both algorithms for 10
iterations on the above graph. Save the results in three files:
- hubs.txt -- 10 persons with the highest hub score. Each
line should contain an email address, preceded by its hub score (show 8
significant digits), for example:
0.00100596 jeff.dasovich@enron.com
0.00401388 kate.symes@enron.com
- auth.txt -- 10 persons with the highest authority
score (same format as hubs).
- pr.txt -- 10 persons with the highest PageRank
score (same format as hubs).
2. Visualize key connections
Compare your results from task 1 to the following list of
Enron employees. Is PageRank useful for
identifying influential persons? What about the hub/authority
scores?
Select 10-20 key persons and visualize the flow of information
among them using
Graphviz.
 |
Create a file called graph.dot in the following format:
digraph G {
"kate.symes" -> "tana.jones" ;
"kate.symes" -> "kysa.alport" [label = "trading"];
"kysa.alport" -> "kate.symes" ;
}
And then use the following command on DICE:
dot -Tpng graph.dot > graph.png
If you don't like the layout, try man dot.
|
It is up to you how you select the influential persons,
which connections you include and what labels (if any)
you assign to the connections. The complete text of all
emails is available here:
enron.xml.bz2
(
500MB uncompressed, download only if you feel you need it).
3. Report
Provide a 2-page report. For task 1 outline the major
implementation decisions you made and any interesting
observations. For task 2, discuss the implications of
using PageRank or HITS for identifying influential
figures in a company. Also, state precisely what
criteria you used to select the key persons and to
decide which links to include in the visualization.
Include the graph visualization in your report.
Please do not produce reports longer than 2 pages.
Any material beyond the first two pages will not be
considered and you may lose marks.
Marking
Remember that plagiarism is a university offence. Please read the
policy
[
here].
The assignment is worth
7.5% of your total course mark and points will be given as follows:
4 points for correct implementation of
PageRank and
HITS algorithms (task 1)
2 points for a creative visualization of key connections (task 2)
1.5 points for clear and detailed 2-page report (task 3)
Note that anything beyond 5 points can be regarded excellent work.
Sanity checks: a correct implementation should produce the following scores:
hub score: 0.00100596 jeff.dasovich@enron.com
authority: 0.00021004 jeff.dasovich@enron.com
pagerank: 0.00060603 jeff.dasovich@enron.com
pagerank: 0.00373347 jeff.dasovich@enron.com -- if you re-normalize PageRanks to sum to 1 after each iteration
Either normalised or unnormalised pageranks will be accepted as a solution, as long as they match one of the sanity checks.
Restrictions
You must use
Python as the programming
language for this assignment. If you don't know
Python, please
consider the following tutorials:
[part1],
[part2]
.
You can use other libraries and packages subject to the following conditions:
- You should not use any software packages or libraries that provide out-of-the-box network/graph analysis.
- The core logic of the algorithms should be clearly and identifiably your own work.
- You can reuse any of the code in the TTS lab exercises as long as you cite the origin.
- You are free to use any libraries or external programs that perform generic tasks such as reading / writing of files, manipulation of lists, vectors, matrices, hash-tables and priority queues, as well as sorting and merging methods. Please cite the libraries you use.
- If you are uncertain as to whether a package qualifies as "generic" or not -- please ask.
- If you consult external sources, please cite them (this includes personal communication).
- Q: Should we ignore links of the form: A→A?
A: Yes, it's an artefact of email from that era: some
people would cc themselves on every message. Just think about what
self-links mean in terms of hubs and authorities and it should be
obvious why they are not useful.
- Q: How do we handle duplicate entries: A→B?
A:There should be no duplicate lines in the graph, in
the sense of the same message with the same sender and same
recipient appearing twice in graph.txt. However, you can have two
different messages from A to B. The way to treat these is
spelled out in the first paragraph of task 1.
- Q: Should we run the algorithm iteratively?
A: Yes, this was discussed in the lecture.
- Q: My hub/authority scores jump up and down over iterations.
A: Look for a bug in your implementation, they should converge.
Please visit the
discussion forum for more answers.