Text Technologies

The course deals with retrieval technologies behind search engines, such as Google. The course will aim to strike a balance between theoretical and system-related aspects of Information Retrieval. Course descriptor is here.

Lecturer: Dr. Victor Lavrenko TA: Philipp Petrenz
Lectures: 12:10-13:00 Monday, Thursday: 7 George Square, room F.21 [map]

Lab sessions: Tuesdays 5pm or Thursdays 9am or 10am in AT5.05: [times and groups]

Assessment: Final exam is 70% of the mark. Four programming assignments are worth an aggregate 30%:

  1. Web Crawling, due 4pm Monday 8th October
  2. Image Search, due 4pm Monday 22nd October
  3. Plagiarism Detection, due 4pm Monday 5th November
  4. Link Analysis, due 4pm Monday 19th November

Please read the Informatics policy on late submissions and plagiarism.

Questions: Please ask all questions in the discussion forum.

Readings:

Syllabus

  1. [Sep.17] Introduction: documents, queries, bag-of-words trick [slides] Readings: SE Ch. 1 and 2
  2. [Sep.20] Getting text: XML feeds, web-crawling, expected age [slides] Readings: SE Ch. 3
  3. [Sep.24] Laws of text: Zipf, Heaps, clumpting, index size [slides] Readings: SE Ch.4 [pdf]
  4. [Sep.27] Vector space model: term weighting, similarity functions. [slides] Readings: SE Ch. 7.1
  5. [Oct.1] Vocabulary mismatch 1: tokenization, stemming, stopping, n-grams. Readings: SE Ch. 5 (except 5.7) [slides 1-13]
  6. [Oct.4] Vocabulary mismatch 2: spelling, Soundex, synonyms. Readings: SE Ch. 6 and 7.3.2 [slides 14-24]
  7. [Oct.8] Vocabulary mismatch 3: statistical synonyms, relevance feedback. [slides 25-37]
  8. [Oct.11] Vocabulary mismatch 4: latent semantic indexing, [slides 38-46]
                  Evaluation 1: Cranfield paradigm. [slides 1-9]
  9. [Oct.15] Evaluation 2: Relevance, Recall, Precision, F1, R/P plots. Readings: SE Ch.8 [slides 10-22]
  10. [Oct.18] Evaluation 3: MAP, nDCG, ROC, BPREF, significance. Readings: SE Ch.8 [pdf] [slides 23-33]
  11. [Oct.22] Duplicate detection: Finn's method, Simhash/LSH. Readings: SE Ch. 3.7-3.8 [slides 1-16]
  12. [Oct.25] Duplicate detection: hashing and error bounds [slides 17-20]
  13. [Oct.29] Web search: PageRank, Hubs and Authorities, Link spam. [slides] Readings: SE Ch. 10.3, 4.5 [pdf]
  14. [Nov.1] Indexing 1: Inverted lists, proximity, structure, compression. [slides 1-19] Readings: SE Ch. 5.2, 5.3, 5.4
  15. [Nov.5] Indexing 2: construction, MapReduce, query execution. [slides 20-34] Readings: SE Ch. 5.6, 5.7
  16. [Nov.8] Indexing 3: complexity, structured queries, optimisation. [slides 35-42]
  17. [Nov.12] Probabilistic model 1: Probability Ranking Principle, derivation, estimation. [slides 1-14] readings: SE Ch. 7.2
  18. [Nov.15] Probabilistic model 2: word independence, 2-Poisson model, BM25 [slides 15-24] readings: SE Ch. 7.2
  19. [Nov.19] Relevance models: Small-sample estimation, cross-language search. [slides] readings: SE Ch. 7.3
  20. [Nov.22] Learning to rank: Boolean model, LETOR, Inference Network model. [slides] readings: SE Ch. 7.1, 7.6

This page is maintained by Victor Lavrenko.


Home : Teaching : Courses 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh