Text Technologies
The course deals with retrieval technologies behind search engines,
such as Google. The course will aim to strike a balance between
theoretical and system-related aspects of Information Retrieval.
Course descriptor is
here.
Lecturer: Dr. Victor Lavrenko
TA: Philipp Petrenz
Lectures:
12:10-13:00 Monday, Thursday: 7 George Square, room F.21
[map]
Lab sessions: Tuesdays 5pm
or Thursdays 9am
or
10am in AT5.05:
[times and groups]
Assessment: Final exam is 70% of the mark. Four programming assignments are worth an aggregate 30%:
- Web Crawling, due 4pm Monday 8th October
- Image Search, due 4pm Monday 22nd October
- Plagiarism Detection, due 4pm Monday 5th November
- Link Analysis, due 4pm Monday 19th November
Please read the Informatics policy on late submissions and plagiarism.
Questions: Please ask all questions in the
discussion forum.
Readings:
Syllabus
- [Sep.17]
Introduction: documents, queries, bag-of-words trick
[slides]
Readings: SE Ch. 1 and 2
- [Sep.20]
Getting text: XML feeds, web-crawling, expected age
[slides]
Readings: SE Ch. 3
- [Sep.24]
Laws of text: Zipf, Heaps, clumpting, index size
[slides]
Readings: SE Ch.4
[pdf]
- [Sep.27]
Vector space model: term weighting, similarity functions.
[slides]
Readings: SE Ch. 7.1
- [Oct.1]
Vocabulary mismatch 1: tokenization, stemming, stopping, n-grams.
Readings: SE Ch. 5 (except 5.7)
[slides 1-13]
- [Oct.4]
Vocabulary mismatch 2: spelling, Soundex, synonyms.
Readings: SE Ch. 6 and 7.3.2
[slides 14-24]
- [Oct.8]
Vocabulary mismatch 3: statistical synonyms, relevance feedback.
[slides 25-37]
- [Oct.11]
Vocabulary mismatch 4: latent semantic indexing,
[slides 38-46]
Evaluation 1: Cranfield paradigm.
[slides 1-9]
- [Oct.15]
Evaluation 2: Relevance, Recall, Precision, F1, R/P plots.
Readings: SE Ch.8
[slides 10-22]
- [Oct.18]
Evaluation 3: MAP, nDCG, ROC, BPREF, significance.
Readings: SE Ch.8
[pdf]
[slides 23-33]
- [Oct.22]
Duplicate detection: Finn's method, Simhash/LSH.
Readings: SE Ch. 3.7-3.8
[slides 1-16]
- [Oct.25]
Duplicate detection: hashing and error bounds
[slides 17-20]
- [Oct.29] Web search: PageRank, Hubs and Authorities, Link spam.
[slides]
Readings: SE Ch. 10.3, 4.5 [pdf]
- [Nov.1] Indexing 1: Inverted lists, proximity, structure, compression.
[slides 1-19]
Readings: SE Ch. 5.2, 5.3, 5.4
- [Nov.5] Indexing 2: construction, MapReduce, query execution.
[slides 20-34]
Readings: SE Ch. 5.6, 5.7
- [Nov.8] Indexing 3: complexity, structured queries, optimisation.
[slides 35-42]
- [Nov.12] Probabilistic model 1: Probability Ranking Principle, derivation, estimation.
[slides 1-14]
readings: SE Ch. 7.2
- [Nov.15] Probabilistic model 2: word independence, 2-Poisson model, BM25
[slides 15-24]
readings: SE Ch. 7.2
- [Nov.19] Relevance models: Small-sample estimation, cross-language search.
[slides] readings: SE Ch. 7.3
- [Nov.22] Learning to rank: Boolean model, LETOR, Inference Network model.
[slides] readings: SE Ch. 7.1, 7.6
This page is maintained by
Victor Lavrenko.