$$\notag \newcommand{\D}{\mathcal{D}} \newcommand{\I}{\mathbb{I}} \newcommand{\N}{\mathcal{N}} \newcommand{\ba}{\mathbf{a}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} \newcommand{\g}{\,|\,} \newcommand{\nth}{^{(n)}} \newcommand{\te}{\!=\!} $$

MLPR Tutorial1 Sheet 3

Reminders: Attempt the tutorial questions, and ideally discuss them before your tutorial — for instance at ML-Base. You can seek clarifications and hints on Hypothesis. Full answers will be released.


  1. A Gaussian classifier:

    A training set consists of one-dimensional examples from two classes. The training examples from class 1 are \(\{0.5, 0.1, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.35, 0.25\}\) and the examples from class 2 are \(\{0.9, 0.8, 0.75, 1.0\}\).

    1. Fit a one-dimensional Gaussian to each class by matching the mean and variance. Also estimate the class probabilities \(\pi_1\) and \(\pi_2\) by matching the observed class fractions. (This procedure fits the model with maximum likelihood: it selects the parameters that give the training data the highest probability.) Sketch a plot of the scores \(p(x,y) \te P(y)\,p(x\g y)\) for each class \(y\), as functions of input location \(x\).

    2. What is the probability that the test point \(x \te 0.6\) belongs to class 1? Mark the decision boundary/ies on your sketch, the location(s) where \(P(\text{class 1}\g x) = P(\text{class 2}\g x) = 0.5\). You are not required to calculate the location(s) exactly.

    3. Are the decisions that the model makes reasonable for very negative \(x\) and very positive \(x\)? Are there any changes we could consider making to the model if we wanted to change the model’s asymptotic behaviour?

  2. More practice with Gaussians:

    \(N\) noisy independent observations are made of an unknown scalar quantity \(m\): \[\notag x\nth \sim \N(m, \sigma^2). \]

    1. We don’t give you the raw data, \(\{x\nth\}\), but tell you the mean of the observations: \[\notag \bar{x} = \frac{1}{N} \sum_{n=1}^N x\nth. \] What is the likelihood2 of \(m\) given only this mean \(\bar{x}\)? That is, what is \(p(\bar{x}\g m)\)?3

    2. A sufficient statistic is a summary of some data that contains all of the information about a parameter.

      1. Show that \(\bar{x}\) is a sufficient statistic of the observations for \(m\), assuming we know the noise variance \(\sigma^2\). That is, show that \(p(m\g \bar{x}) \te p(m\g \{x\nth\}_{n=1}^N)\).

      2. Optional part: If we don’t know the noise variance \(\sigma^2\) or the mean, is \(\bar{x}\) still a sufficient statistic in the sense that \(p(m\g \bar{x}) \te p(m\g \{x\nth\}_{n=1}^N)\)? Explain your reasoning.

        Note: In i) we were implicitly conditioning on \(\sigma^2\): \(p(m\g \bar{x}) = p(m\g \bar{x}, \sigma^2)\). In this part, \(\sigma^2\) is unknown, so \(p(m\g \bar{x}) = \int p(m,\sigma^2\g \bar{x})\,\mathrm{d}\sigma^2\). Although no detailed mathematical manipulation (or solving of integrals) is required.

  3. Bayesian regression with multiple data chunks:

    This question involves simulations on a computer. You will generate datasets and calculate posterior distributions. If you have trouble, at least try to work out roughly what should happen for each part.

    We will use the probabilistic model for regression models from the lectures: \[\notag p(y\g \bx,\bw) = \N(y;\,f(\bx;\bw),\,\sigma_y^2). \] In this question, we set \(f(\bx;\bw) = \bw^\top \bx\), where \(\bw = [w_1 ~~ w_2]^\top\) and \(\bx = [x_1 ~~ x_2]^\top\), and \(\sigma_y^2 = 1^2 = 1\). We assume the following prior distribution: \[\notag p(\bw) = \N(\bw;\,[-5 ~~ 0]^\top,\,\sigma_{\bw}^2\I), \] where \(\sigma_{\bw}^2 = 2^2 = 4\).

    Generating data: Generate some synthetic dataset as below. Only generate the data once and do not change it when working on the parts of this question.

    First, draw a \(\tilde{\bw}\) from the prior distribution (i.e. draw a sample from the Gaussian). In the whole dataset generation process, draw this \(\tilde{\bw}\) only once and keep it constant.

    We will generate two chunks of data \(\D_1\) and \(\D_2\). For chunk \(\D_1 = \{\bx^{(n)},y^{(n)}\}_{n=1}^{15}\), generate 15 input-output pairs where each \(\bx^{(n)}\) is drawn from \(\N(\bx^{(n)};\,\mathbf{0},0.5^2\I)\). For chunk \(\D_2 = \{\bx^{(n)},y^{(n)}\}_{n=16}^{45}\), generate 30 input-output pairs where each \(\bx^{(n)}\) is drawn from \(\N(\bx^{(n)};\,[0.5 ~~ 0.5]^\top,0.1^2\I)\). For both chunks, draw the \(y^{(n)}\) outputs using the \(p(y\g \bx,\bw)\) observation model above.

    Visualising distributions on the weights: For each part that asks you to visualise a distribution, draw 400 weight vectors from the distribution and show the samples in a 2D scatter plot. In each plot, show \(\tilde{\bw}\) as a bold cross.

    1. Visualise the prior distribution and discuss the relationship between the parameters and the belief expressed by the prior.

    2. Using equations from the lecture notes, calculate and visualise the posterior distributions that you get after observing:

      1. dataset \(\D_1\), i.e. \(p(\bw\g \D_1)\)
      2. dataset \(\D_2\) but not \(\D_1\), i.e. \(p(\bw\g \D_2)\)
      3. datasets \(\D_1\) and \(\D_2\), i.e. \(p(\bw\g \D_1,\D_2)\)
      4. no observations, i.e. \(p(\bw\g \{\})\)
    3. Take the posterior that you got in b) for \(p(\bw\g \D_1)\) and now use it as a prior. Calculate the new posterior that you get after observing \(\D_2\). Numerically compare the mean and covariance with those of \(p(\bw\g \D_1,\D_2)\).

    4. Is it important that the inputs were drawn from Gaussian distributions? What would happen if the \(\bx^{(n)}\) in \(\D_2\) were drawn from a uniform distribution on the unit square?

    5. What distribution is \(p(\bw\g \D_1)\) approaching when we let \(\sigma_{\bw}^2\) approach infinity?

    6. Now assume that our observations \(y\) are corrupted by additional Gaussian noise: \[\notag p(z\g \bx,\bw) = \N(z;\, y,\sigma_z^2). \] So we now observe datasets \(\D_z = \{\bx^{(n)},z^{(n)}\}_{n=1}^{N}\). What is the new posterior distribution \(p(\bw \g \D_z)\)? Hint: Recall how you did Question 2.a) of this sheet.


  1. Parts of this tutorial sheet are based on previous versions by Amos Storkey, Charles Sutton, and Chris Williams

  2. We’re using the traditional statistics usage of the word “likelihood”: it’s a function of parameters given data, equal to the probability of the data given the parameters. You should avoid saying “likelihood of the data” (Cf p29 of MacKay’s textbook), although you’ll see that usage too.

  3. The sum of Gaussian outcomes is Gaussian distributed; you only need to identify a mean and variance.