Text Technologies

The course deals with retrieval technologies behind search engines, such as Google. The course will aim to strike a balance between theoretical and system-related aspects of Information Retrieval. Course descriptor is here. No tutorials are planned for this course.

Lecturer: Dr. Victor Lavrenko TA: Philipp Petrenz
Lectures:
12:10-1:00 Monday: Old College, Lecture Theatre 183 [map]
12:10-1:00 Thursday: Hugh Robson Building, Lecture Theatre [map]

Lab sessions: Tuesdays 3pm or 4pm in AT5.04 or Fridays 1pm in AT5.05 [lab groups]

Assessment: Final exam is 70% of the mark. Four programming assignments are worth an aggregate 30%:

  1. Web Crawling, due 4pm Monday 10th October
  2. Image Search, due 4pm Monday 24th October
  3. Plagiarism Detection, due 4pm Monday 7th November
  4. Link Analysis, due 4pm Monday 21st November

Questions: Please ask all questions in the discussion forum.

Readings:

Syllabus

  1. [Sep.19] Introduction: documents, queries, bag-of-words trick [slides] Readings: SE Ch. 1 and 2
  2. [Sep.23] Getting text: XML feeds, web-crawling, expected age [slides] Readings: SE Ch. 3
  3. [Sep.26] Laws of text: Zipf, Heaps, clumpting, content extraction [slides] Readings: SE Ch.4 [pdf]
  4. [Sep.29] Vector space model: term weighting, similarity functions. [slides] Readings: SE Ch. 7.1
  5. [Oct.3] Vocabulary mismatch 1: tokenization, stemming, stopping, n-grams. Readings: SE Ch. 5 (except 5.7)
  6. [Oct.6] Vocabulary mismatch 2: spelling, Soundex, synonyms. Readings: SE Ch. 6 and 7.3.2
  7. [Oct.10] Vocabulary mismatch 3: relevance feedback, semantic indexig. [slides]
  8. [Oct.13] Evaluation 1: Cranfield paradigm, Recall, Precision, F1. Readings: SE Ch.8 [pdf]
  9. [Oct.17] Evaluation 2: R/P plots, MAP.
  10. [Oct.21] Evaluation 3: R/P ROC, BPREF, significance. [slides]
  11. [Oct.27] Duplicate detection: Finn's method, Simhash/LSH. [slides]
  12. [Oct.31] Web search: PageRank, Hubs and Authorities, Link spam. [slides] Readings: SE Ch. 10.3, 4.5 [pdf]
  13. [Nov.7] Indexing 1: Inverted lists, word proximity, document structure. [slides] Readings: SE Ch. 5.2, 5.3
  14. [Nov.10] Indexing 2: compression, construction, MapReduce. [slides] Readings: SE Ch. 5.4, 5.6
  15. [Nov.14] Indexing 3: query execution, structured queries, optimization. [slides] Readings: SE Ch. 5.7
  16. [Oct.17] Boolean retrieval: Retrieval models, Westlaw, Medline. [slides] readings: SE Ch. 7.1
  17. [Nov.21] Probabilistic model: Probability Ranking Principle, BM25. [slides] readings: SE Ch. 7.2
  18. [Nov.24] Relevance models: Small-sample estimation, cross-language search. [slides] readings: SE Ch. 7.3

This page is maintained by Victor Lavrenko.


Home : Teaching : Courses 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections.
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh