Text Technologies
The course deals with retrieval technologies behind search engines,
such as Google. The course will aim to strike a balance between
theoretical and system-related aspects of Information Retrieval.
Course descriptor
is
here. No
tutorials are planned for this course.
Lecturer: Dr. Victor Lavrenko
TA: Philipp Petrenz
Lectures:
12:10-1:00 Monday: Old College, Lecture Theatre 183
[map]
12:10-1:00 Thursday: Hugh Robson Building, Lecture Theatre
[map]
Lab sessions: Tuesdays 3pm
or 4pm in AT5.04
or Fridays 1pm in AT5.05
[lab groups]
Assessment: Final exam is 70% of the mark. Four programming assignments are worth an aggregate 30%:
- Web Crawling, due 4pm Monday 10th October
- Image Search, due 4pm Monday 24th October
- Plagiarism Detection, due 4pm Monday 7th November
- Link Analysis, due 4pm Monday 21st November
Questions: Please ask all questions in the
discussion forum.
Readings:
Syllabus
- [Sep.19] Introduction: documents, queries, bag-of-words trick
[slides]
Readings: SE Ch. 1 and 2
- [Sep.23] Getting text: XML feeds, web-crawling, expected age
[slides]
Readings: SE Ch. 3
- [Sep.26] Laws of text: Zipf, Heaps, clumpting, content extraction
[slides]
Readings: SE Ch.4
[pdf]
- [Sep.29] Vector space model: term weighting, similarity functions.
[slides]
Readings: SE Ch. 7.1
- [Oct.3] Vocabulary mismatch 1: tokenization, stemming, stopping, n-grams.
Readings: SE Ch. 5 (except 5.7)
- [Oct.6] Vocabulary mismatch 2: spelling, Soundex, synonyms.
Readings: SE Ch. 6 and 7.3.2
- [Oct.10] Vocabulary mismatch 3: relevance feedback, semantic indexig.
[slides]
- [Oct.13] Evaluation 1: Cranfield paradigm, Recall, Precision, F1.
Readings: SE Ch.8
[pdf]
- [Oct.17] Evaluation 2: R/P plots, MAP.
- [Oct.21] Evaluation 3: R/P ROC, BPREF, significance.
[slides]
- [Oct.27] Duplicate detection: Finn's method, Simhash/LSH.
[slides]
- [Oct.31] Web search: PageRank, Hubs and Authorities, Link spam.
[slides]
Readings: SE Ch. 10.3, 4.5 [pdf]
- [Nov.7] Indexing 1: Inverted lists, word proximity, document structure.
[slides]
Readings: SE Ch. 5.2, 5.3
- [Nov.10] Indexing 2: compression, construction, MapReduce.
[slides]
Readings: SE Ch. 5.4, 5.6
- [Nov.14] Indexing 3: query execution, structured queries, optimization.
[slides]
Readings: SE Ch. 5.7
- [Oct.17] Boolean retrieval: Retrieval models, Westlaw, Medline.
[slides]
readings: SE Ch. 7.1
- [Nov.21] Probabilistic model: Probability Ranking Principle, BM25.
[slides]
readings: SE Ch. 7.2
- [Nov.24] Relevance models: Small-sample estimation, cross-language search.
[slides]
readings: SE Ch. 7.3
This page is maintained by
Victor Lavrenko.