Text Technologies for Data Science

The course deals with retrieval technologies behind search engines, such as Google.

Lecturer: Dr. Victor Lavrenko TA: Dominik Wurzer
Lectures: 12:10 Mondays and Thursdays in David Hume Tower, Faculty Room South

Python tutorial: Appleton Tower room 4.12, 4-5:30pm Thursday 18/09 or 4-5:30pm Friday 19/09 [part 1] [part 2]

Lab sessions: Appleton Tower room 4.12, please sign up

Assessment: Final exam: 70% of the mark. Courseworks: 30%. Policy on late submissions and plagiarism.

  1. Ranking algorithms, due 4pm Monday 6th October [questions] [results]
  2. News Clustering, due 4pm Monday 20th October [questions]
  3. Plagiarism Detection, due 4pm Monday 3rd November [questions]

Readings:

Discussion forum: use this to ask questions about lectures, coursework etc. Please sign up here.

Syllabus

  1. [Sep.15] Introduction: documents, queries, bag-of-words trick [slides] [notes] Readings: SE Ch. 1 and 2
  2. [Sep.18] Laws of text: Zipf, Heaps, clumpting, index size [slides] [pdf] Readings: SE Ch.4 [pdf]
  3. [Sep.22] Vector space: term weighting, similarity functions. [slides] [pdf] Readings: SE Ch. 7.1
  4. [Sep.25] Vocabulary mismatch 1: tokenization [slides 1-6] [pdf] Readings: SE Ch. 5 (except 5.7)
  5. [Sep.29] Vocabulary mismatch 2: stemming, synonyms [slides 7-14, 22-27]
  6. [Oct.2] Vocabulary mismatch 3: relevance feedback, pseudo-feedback [slides 28-38] Readings: SE Ch. 6 and 7.3.2
  7. [Oct.6] Indexing 1: Inverted lists, compression [slides 1-19] [pdf] Readings: SE Ch. 5.2, 5.3, 5.4
  8. [Oct.9] Indexing 2: Query execution, optimisation [slides 21-32]
  9. [Oct.13] Indexing 3: Index construction, MapReduce [slides 33-44] Readings: SE Ch. 5.6, 5.7
  10. [Oct.16] Web crawling: XML feeds, crawling, expected age [slides] [pdf] Readings: SE Ch. 3
  11. [Oct.20] Content Extraction: Finn's method, Adler32 [slides 1-15] [pdf] Readings: SE Ch. 3.7
  12. [Oct.23] Duplicate detection: Simhash/LSH, Error bounds [slides 16-22] [Simhash] Readings: SE Ch. 3.8
  13. [Oct.27] Evaluation 1: Cranfield paradigm, Recall, Precision, F1 [slides 1-18] [pdf] Readings: SE Ch.8 [pdf]
  14. [Oct.30] Evaluation 2: R/P plots, MAP, nDCG, query logs, significance testing [slides 19-33]

This page is maintained by Victor Lavrenko.


Home : Teaching : Courses 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh