Data Mining and Exploration, Spring 2017

Table of Contents

In the course, we will discuss modern techniques for analysing, interpreting, visualising and exploiting the data that are captured in scientific and commercial environments. The course develops the ideas taught in other machine learning courses and discusses the issues in applying them to real-world data sets.

The course consists of lectures, supporting computer labs, student presentations on research papers, and a practical mini-project on a real-world dataset.

The presentations and the mini-project contribute together 50% to the grade. The remaining 50% is the exam (see here).

Lecturer: Michael Gutmann
Teaching Assistant: Agamemnon Krasoulis
DME catalogue pages: DRPS | Informatics | Timetable


Semester week Date Activity Date Activity
wk1     Thu 19/01 lecture 1
wk2 Tue 24/01 lab 1 Thu 26/01 lecture 2
wk3 Tue 31/01 lab 2 Thu 02/02 lecture 3
wk4 Tue 07/02 lab 3 Thu 09/02 lecture 4
wk5 Tue 14/02 lab 4 Thu 16/02 lecture 5
wk6     Thu 02/03 student presentations
wk7     Thu 09/03 student presentations
wk8     Thu 16/03 student presentations
wk9     Thu 23/03 student presentations
wk10     Thu 30/03 student presentations
wk11     Thu 06/04 student presentations

15:10 - 17:00
Forrest Hill, room 3.D01

15:10 - 17:00
Medical School, BLT (Basement Lecture Theatre) - Doorway 6

Important dates

Miniproject interim report deadline Fri 3 March 2017, 4pm
Miniproject final report deadline Fri 7 April 2017, 4pm
Exam see here


Lecture notes are here (they will be updated as we progress).

  • Lecture 1
    Introduction to the data analysis process, simple descriptions and preprocessing of data
    opening slides
    Chapter 1 in the lecture notes
  • Lecture 2
    Principal component analysis by variance maximisation, by minimisation of approximation error, by matrix approximation
    Chapter 2 in the lecture notes
  • Lecture 3
    Dimensionality reduction by PCA, by kernel PCA, by multidimensional scaling
    Chapter 3 in the lecture notes
  • Lecture 4
    (preliminary) evaluating the performance of classification and regression algorithms, techniques for choosing hyper-parameters
  • Lecture 5
    (preliminary) handling missing data

Computer labs

The course has four computer labs on topics introduced in the lecture. The labs will allow you to play with different methods to gain some intuitive understanding and provide you with practical tools for the miniproject. The GitHub repository for the labs is here.

  • Lab 1 on simple data descriptions and preprocessing
  • Lab 2 on principal component analysis
  • Lab 3 on dimensionality reduction
  • Lab 4 on hyperparameter selection (preliminary)

Student presentations

In the second half of the course, we will have presentations on some of the papers listed here. Feel free to propose papers yourself. Check with the lecturer about suitability.

For each paper, we have presenters and enquirers. The presenters present a paper, and the enquires prepare questions about the paper. Both will be assessed.

Group work will be possible. More detailed instructions to come.


The goal of the project is to apply machine learning methods to a real data set. A list of potential data sets is available here (same as for the IRDS course). For each dataset, the web page gives a description of the task to be undertaken. You will produce a project report that will be assessed.

Group work will be possible. More detailed instructions to come.

Author: Michael Gutmann

Created: 2017-01-19 Thu 12:15