Text Technologies

The course deals with retrieval technologies behind search engines, such as Google. The course will aim to strike a balance between theoretical and system-related aspects of Information Retrieval. Course descriptor is here.

Lecturer: Dr. Victor Lavrenko TA: Dominik Wurzer
Lectures: 12:10-13:00 Monday, Thursday: 7 George Square, room F.21 [map]

Lab sessions: Tuesdays 5pm or Fridays 1-3pm in AT5.05:

Assessment: Final exam: 70% of the mark. Courseworks: 30%. Policy on late submissions and plagiarism.

  1. Ranking algorithms, due 4pm Monday 7th October [notes] [results]
  2. Fast search engine, due 4pm Monday 21st October [notes]
  3. Plagiarism detection, due 4pm Monday 4th November [notes]
  4. Link Analysis, due 4pm Monday 18th November [notes]

Readings:

Syllabus

  1. [Sep.16] Introduction: documents, queries, bag-of-words trick [notes] [pdf] Readings: SE Ch. 1 and 2
  2. [Sep.19] Laws of text: Zipf, Heaps, clumpting, index size [notes] [pdf] Readings: SE Ch.4 [pdf]
  3. [Sep.23] Vector space model: term weighting, similarity functions. [notes] [pdf] Readings: SE Ch. 7.1
  4. [Sep.26] Vocabulary mismatch 1: tokenization, stemming, stopping (slides 1-14) [notes] [pdf] Readings: SE Ch. 5 (except 5.7)
  5. [Sep.30] Vocabulary mismatch 2: synonyms, relevance feedback (slides 21-37) [notes] [pdf]
  6. [Oct.3] Vocabulary mismatch 3: LSI, spelling, Soundex (slides 38-46, 14-20) [notes] [pdf] Readings: SE Ch. 6 and 7.3.2
  7. [Oct.7] Indexing 1: Inverted lists, compression (slides 1-19) [notes] [pdf] Readings: SE Ch. 5.2, 5.3, 5.4
  8. [Oct.10] Indexing 2: Query execution, optimisation (slides 21-32) [notes] [pdf]
  9. [Oct.14] Indexing 3: Index construction, MapReduce (slides 33-44) [notes] [pdf] Readings: SE Ch. 5.6, 5.7
  10. [Oct.17] Getting text: XML feeds, web-crawling, expected age [notes] [pdf] Readings: SE Ch. 3
  11. [Oct.21] Duplicate detection: Finn's method, Simhash/LSH (slides 1-16) [notes] [pdf] Readings: SE Ch. 3.7-3.8
  12. [Oct.24] Duplicate detection: hashing and error bounds (slides 17-22) [notes] [pdf]
  13. [Oct.28] Evaluation 1: Cranfield paradigm, Recall, Precision, F1 (slides 1-16) [notes] [pdf] Readings: SE Ch.8 [pdf]
  14. [Oct.31] Evaluation 2: R/P plots, MAP, nDCG, BPREF, significance (slides 17-33) [notes] [pdf] Readings: SE Ch.8 [pdf]
  15. [Nov.4] Web search 1: massive data, Question Answering (slides 1-4) [notes] [pdf]
  16. [Nov.7] Web search 2: PageRank, Hubs and Authorities, Link spam (slides 5-15) [notes] [pdf] Readings: SE Ch. 10.3, 4.5 [pdf]
  17. [Nov.11] Probabilistic model 1: Probability Ranking Principle, derivation (slides 1-14) [notes] [pdf] readings: SE Ch. 7.2
  18. [Nov.14] Probabilistic model 2: word independence, 2-Poisson model, BM25 (slides 15-26) [notes] [pdf] readings: SE Ch. 7.2
  19. [Nov.18] Relevance models: small-sample estimation, cross-language search. [notes] [pdf] readings: SE Ch. 7.3
  20. [Nov.21] Machine learning in IR: classification (PA, SVM), learning to rank [pdf] readings: SE Ch. 7.1, 7.6

This page is maintained by Victor Lavrenko.


Home : Teaching : Courses 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh