Text Technologies Assessment 1

Victor Lavrenko and Philipp Petrenz

September 24, 2012

The goal of this assessment is to develop a web crawler capable of harvesting a set of hyper-linked news stories from a web-server.

What to turn in

  1. Please submit a paper copy of the report (task 2) to the ITO (Appleton Tower room 4.02).
  2. Put your complete source code (task 1) in a folder called tts1 on a DICE machine and run:
    submit tts 1 tts1
    For help, run: man submit
Both the report and the source are due 16:00 on Monday, 8th October.

Tasks

  1. Using Python, implement a web crawler with the agent name TTS. The general algorithm was outlined in lecture 2 and is covered in chapter 3 of the textbook. Your crawler should only fetch local links that occur in the body of the webpage after the <!-- CONTENT --> tag and before the <!-- /CONTENT --> tag. As a quick reminder, your implementation should involve the following steps:
    1. Find out which areas on the server are accessible to crawlers, and what restrictions are in place.
    2. Fetch web pages from the server using the HTTP 1.0/1.1 protocol.
    3. Extract a set of outgoing hyperlinks that occur between <!-- CONTENT --> and <!-- /CONTENT -->
    4. Discard any non-local links (i.e. the ones that point to any hosts other than ir.inf.ed.ac.uk)
    5. Detect which of the extracted links have already been processed.
    6. Assign priorities to the links and insert them into the frontier queue. In this case, document names are numeric, and you should give priority to high numbers.
    Your crawler will start with a single seed URL:

    http://ir.inf.ed.ac.uk/tts/A1/MatricNo/MatricNo.html

    Please substitute your 7-digit matriculation number in place of MatricNo. For example, if your matriculation number were 0123456, you would start your crawl at the following URL: http://ir.inf.ed.ac.uk/tts/A1/0123456/0123456.html
    It is important to use your own matriculation number, as this will affect the correctness of your results. If you get an error when accessing your seed url, please contact the TA as soon as possible.
  2. Provide a short (max 2-page) report that outlines the decisions you made in designing your crawler. For example, how did you implement the frontier? How do you keep track of the links already processed? What patterns do you use to define URLs? What software packages did you use in your implementation.
    Your report must contain a short statistical summary of your crawl, including, but not limited to:
    1. How long did your crawler run?
    2. How many links did you extract in total?
    3. How many distinct URLs did you encounter during the crawl?
    4. How many pages have you fetched?
    5. Did you encounter any links pointing outside of the domain?
    6. Did you get any errors during fetching? What kind?
    Note that only the first two pages of the report will be assessed: any material beyond this will be ignored, and you may lose points.

Grading

The assignment is worth 7.5% of your total course mark and points will be given as follows:

4.5 points (60%) for correct crawler implementation and statistics.
1.5 points (20%) for a clear report answering the questions above.
1.5 points (20%) for particularly creative solutions or work beyond what is expected.

Note that anything beyond 5 points can be regarded excellent work.

Restrictions:

The crawler must be implemented in Python. If you don't know the language, please consider the following tutorials: [part 1], [part 2]. You may find the following Python packages particularly helpful in completing this assignment: urllib, robotparser, re, heapq. You can use other libraries and packages subject to the following conditions:
  1. You should not use any software packages or libraries that provide out-of-the-box web crawling
  2. The core logic of the algorithm (i.e. the loop that contains steps listed under task 1) should be your own work.
  3. You are free to use any libraries or external programs that perform generic tasks such as client/server networking, HTTP protocol handling, URL manipulation, HTML parsing, hashtables, generic priority queues etc. Please cite the libraries you use.
  4. If you are uncertain as to whether a package qualifies as "generic" or not -- please ask.
  5. If you consult external sources, please cite them (this includes personal communication).

Questions and Answers

  1. The following regular expression may be useful for matching tags: <[^>]*>
  2. You should only follow anchor tags of the form: <a ...> ... </a>
  3. You should not follow any links pointing outside of the Informatics network.
  4. Q: As we are fetching all the pages from the same server is it important how do we assign the priorities in priority queue?
    A: Please set the priority to the numeric value of the page name.
  5. Q: I'd like to know if the BeautifulSoup (python) qualifies as generic.
    A: Using BeautifulSoup as an HTML parser is fine. Please don't use extensions that combine it with URL fetching / link traversal. Please keep in mind: if BeautifulSoup fails to parse a page, you should attempt to extract links from that page by other means.
  6. Q: Is it possible to have an archive zip or rar, because it is faster to parse web pages in local?
    A: No. The whole point of the assessment is to work in a live environment. You should parse the pages as you crawl.
  7. Q:Will you disclose the expected statistical details (number of links etc.)about the example pages so that we can test our crawlers against those.
    A: Uncertainty is a big part of this assessment, so no, we can't disclose the stats.
  8. Q: Could you please give me some idea how many web pages should be fetched?
    A: No. Part of the assessment is working with an unknown target.
  9. Q: Is python's htmllib allowed to use?
    A: Yes as long as no link-traversal extensions are used. Please keep in mind: if htmllib fails to parse a page, you should attempt to extract links from that page by other means.
  10. Q: Should we follow the links that have the attribute "rel=nofollow"?
    A: Yes. The attribute is a not intended to prevent crawlers from gathering the content. [details]
Please visit the discussion forum for more answers.


Home : Teaching : Courses : Tts 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh