Text Technologies for Data Science

The course is delivered as a series of video lectures and do-it-yourself practicals.



  1. [notes] Introduction: documents, queries, bag-of-words trick [SE chapetrs 1 & 2] [python tutorials: part 1, part 2]
  2. [lecture] [topics] [notes] Laws of text: Zipf, Heaps, clumpting, index size. [SE 4] [practical]
  3. [lecture] [topics] [notes] Vector space: term weighting, similarity functions. [SE 7.1] [practical]
  4. [lecture] [topics] [notes] Vocabulary mismatch: tokenization, stemming, synonyms. [SE 5, 6, 7.3.2] [practical]
  5. [lecture] [topics] [notes] Indexing: inverted lists, compression, query execution. [SE 5] [practical]
  6. [lecture] [topics] [notes] Web crawling: XML feeds, crawling, expected age. [SE 3] [practical]
  7. [lecture] [topics] [notes] Content Extraction: XML tags, DOM, Finn's method. [SE 3.7]
  8. [lecture] [topics] [notes] Locality Sensitive Hashing: duplicates, Simhash. [SE 3.8] [practical]
  9. [lecture] [topics] [notes] Evaluation: recall, precision, F1, MAP, nDCG, query logs. [SE 8]
  10. [lecture] [topics] [notes] Web search: PageRank, hubs and authorities, link spam. [SE 4.5, 10.3] [practical]
  11. [lecture] [topics] [notes] Probabilistic model: probability ranking principle, BM25. [SE 7.2] [practical]
  12. [lecture] [topics] [notes] Relevance models: exchangeability, cross-language search. [SE 7.3]
  13. [lecture] [topics] [notes] Language models for IR: query likelihood, smoothing. [SE 7.3]
  14. [lecture] [topics] [notes] Machine learning in IR: PA, SVM, SMO algorithms, LeToR. [SE 7.1, 7.6]

This page was maintained by Victor Lavrenko.

Home : Teaching : Courses 

Informatics Forum, 10 Crichton Street, Edinburgh, EH8 9AB, Scotland, UK
Tel: +44 131 651 5661, Fax: +44 131 651 1426, E-mail: school-office@inf.ed.ac.uk
Please contact our webadmin with any comments or corrections. Logging and Cookies
Unless explicitly stated otherwise, all material is copyright © The University of Edinburgh