Text Technologies for Data Science

The course deals with retrieval technologies behind search engines, such as Google.

Lecturer: Dr. Victor Lavrenko TA: Dominik Wurzer
Lectures: 12:10 Mondays and Thursdays in David Hume Tower, Faculty Room South

Python tutorial: Appleton Tower room 4.12, 4-5:30pm Thursday 18/09 or 4-5:30pm Friday 19/09 [part 1] [part 2]

Lab sessions: Appleton Tower room 4.12, please sign up

Assessment: Final exam: 70% of the mark. Courseworks: 30%. Policy on late submissions and plagiarism.

  1. Ranking algorithms, due 4pm Monday 6th October [questions] [results]
  2. News Clustering, due 4pm Monday 20th October [questions] [results]
  3. Plagiarism Detection, due 4pm Monday 3rd November [questions]
  4. Link Analysis, due 4pm Monday 17th November [questions]



  1. [Sep.15] Introduction: documents, queries, bag-of-words trick [slides] [notes] Readings: SE Ch. 1 and 2
  2. [Sep.18] Laws of text: Zipf, Heaps, clumpting, index size [slides] [pdf] Readings: SE Ch.4 [pdf] [video]
  3. [Sep.22] Vector space: term weighting, similarity functions. [slides] [pdf] Readings: SE Ch. 7.1
  4. [Sep.25] Vocabulary mismatch 1: tokenization [slides 1-6] [pdf] Readings: SE Ch. 5 (except 5.7)
  5. [Sep.29] Vocabulary mismatch 2: stemming, synonyms [slides 7-14, 22-27] Readings: SE Ch. 6 and 7.3.2
  6. [Oct.2] Vocabulary mismatch 3: relevance feedback, pseudo-feedback [slides 28-38] [video]
  7. [Oct.6] Indexing 1: Inverted lists, compression [slides 1-19] [pdf] Readings: SE Ch. 5.2, 5.3, 5.4
  8. [Oct.9] Indexing 2: Query execution, optimisation [slides 21-32] Readings: SE Ch. 5.6, 5.7
  9. [Oct.13] Indexing 3: Index construction, MapReduce [slides 33-44] [video]
  10. [Oct.16] Web crawling: XML feeds, crawling, expected age [slides] [pdf] Readings: SE Ch. 3 [video]
  11. [Oct.20] Content Extraction: Finn's method, Adler32 [slides 1-15] [pdf] Readings: SE Ch. 3.7
  12. [Oct.23] Duplicate detection: Simhash/LSH, Error bounds [slides 16-22] [Simhash] Readings: SE Ch. 3.8
  13. [Oct.27] Evaluation 1: Cranfield paradigm, Recall, Precision, F1 [slides 1-18] [pdf] Readings: SE Ch.8 [pdf]
  14. [Oct.30] Evaluation 2: R/P plots, MAP, nDCG, query logs, significance testing [slides 19-33] [video]
  15. [Nov.3] Web search 1: massive data, Question Answering [slides 1-4] Readings: SE Ch. 10.3, 4.5 [pdf]
  16. [Nov.6] Web search 2: PageRank, Hubs and Authorities, Link spam [slides 5-15] [pdf] [video]
  17. [Nov.10] Probabilistic model 1: Probability Ranking Principle, derivation [slides 1-11] readings: SE Ch. 7.2
  18. [Nov.13] Probabilistic model 2: word independence, 2-Poisson model, BM25 [slides 12-26] [pdf] [video]
  19. [Nov.17] Language modeling for IR: intuition, estimation, smoothing as idf. [slides] [pdf] readings: SE Ch. 7.3
  20. [Nov.20] Machine learning in IR: PA algorithm, SVM, SMO algorithm [slides] [pdf] readings: SE Ch. 7.1, 7.6 [video]
[video] links are from 2013. Topics may differ.
This page is maintained by Victor Lavrenko.

Home : Teaching : Courses 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh