Text Technologies Assessment 1
Victor Lavrenko and Philipp Petrenz
September 24, 2012
The goal of this assessment is to develop a web crawler capable of harvesting a set of hyper-linked news stories from a web-server.
What to turn in
- Please submit a paper copy of the report (task 2) to the
ITO (Appleton Tower room 4.02).
- Put your complete source code (task 1) in a folder called tts1 on a DICE
machine and run:
submit tts 1 tts1
For help, run: man submit
the report and the source are due
16:00 on Monday, 8th October
- Using Python, implement a web crawler with the agent name TTS. The general algorithm was outlined in
lecture 2 and is covered in chapter 3 of the textbook. Your
crawler should only fetch local links that occur in the
body of the webpage after
the <!-- CONTENT --> tag and
before the <!-- /CONTENT --> tag.
As a quick reminder, your implementation should involve the
Your crawler will start with a single seed URL:
- Find out which areas on the server are accessible to
crawlers, and what restrictions are in place.
- Fetch web pages from the server using the HTTP 1.0/1.1
- Extract a set of outgoing hyperlinks that occur between
<!-- CONTENT -->
and <!-- /CONTENT -->
- Discard any non-local links (i.e. the ones that point to any hosts other than ir.inf.ed.ac.uk)
- Detect which of the extracted links have already been
- Assign priorities to the links and insert them into the
frontier queue. In this case, document names are numeric, and you should give priority to high numbers.
Please substitute your 7-digit matriculation number in place
of MatricNo. For example, if your matriculation number
were 0123456, you would start your crawl at the following
It is important to use your own matriculation number, as this will
affect the correctness of your results. If you get an error when
accessing your seed url, please contact the
TA as soon as possible.
- Provide a short (max 2-page) report that outlines the decisions
you made in designing your crawler. For example, how did you
implement the frontier? How do you keep track of the links already
processed? What patterns do you use to define URLs? What software
packages did you use in your implementation.
Your report must contain a short statistical summary of
your crawl, including, but not limited to:
Note that only the first two pages of the report will be
assessed: any material beyond this will be ignored, and you may lose points.
- How long did your crawler run?
- How many links did you extract in total?
- How many distinct URLs did you encounter during the crawl?
- How many pages have you fetched?
- Did you encounter any links pointing outside of the domain?
- Did you get any errors during fetching? What kind?
The assignment is worth 7.5%
of your total course mark and points will be given as follows:
(60%) for correct crawler implementation and statistics.
(20%) for a clear report answering the questions above.
(20%) for particularly creative solutions or work beyond what is expected.
Note that anything beyond 5 points can be regarded excellent work.
The crawler must be implemented in Python. If you don't know the language, please consider the following tutorials: [part 1], [part 2]
. You may find the following Python packages particularly helpful in completing this assignment:
You can use other libraries and packages subject to the following conditions:
- You should not use any software packages or libraries that provide out-of-the-box web crawling
- The core logic of the algorithm (i.e. the loop that contains steps listed under task 1) should be your own work.
- You are free to use any libraries or external programs that perform generic tasks such as client/server networking, HTTP protocol handling, URL manipulation, HTML parsing, hashtables, generic priority queues etc. Please cite the libraries you use.
- If you are uncertain as to whether a package qualifies as "generic" or not -- please ask.
- If you consult external sources, please cite them (this includes personal communication).
Questions and Answers
- The following regular expression may be useful for matching tags: <[^>]*>
- You should only follow anchor tags of the form: <a ...> ... </a>
- You should not follow any links pointing outside of the Informatics network.
- Q: As we are fetching all the pages from the same server is it important how do we assign the priorities in priority queue?
A: Please set the priority to the numeric value of the page name.
- Q: I'd like to know if the BeautifulSoup (python) qualifies as generic.
A: Using BeautifulSoup as an HTML parser is fine. Please
don't use extensions that combine it with URL fetching / link
traversal. Please keep in mind: if BeautifulSoup fails to parse a page, you should attempt to extract links from that page by other means.
- Q: Is it possible to have an archive zip or rar, because it is faster to parse web pages in local?
A: No. The whole point of the assessment is to work in a live environment. You should parse the pages as you crawl.
- Q:Will you disclose the expected statistical details (number of links etc.)about the example pages so that we can test our crawlers against those.
A: Uncertainty is a big part of this assessment, so no, we can't disclose the stats.
- Q: Could you please give me some idea how many web pages should be fetched?
A: No. Part of the assessment is working with an unknown target.
- Q: Is python's htmllib allowed to use?
A: Yes as long as no link-traversal extensions are used. Please keep in mind: if htmllib fails to parse a page, you should attempt to extract links from that page by other means.
- Q: Should we follow the links that have the attribute "rel=nofollow"?
A: Yes. The attribute is a not intended to prevent crawlers from gathering the content. [details]
Please visit the discussion
for more answers.
|Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail:
Please contact our webadmin with
any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright ©
The University of Edinburgh