MLPR w6c - Machine Learning and Pattern Recognition

$$ \newcommand{\D}{\mathcal{D}} \newcommand{\E}{\mathbb{E}} \newcommand{\N}{\mathcal{N}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} \newcommand{\g}{\,|\,} \newcommand{\noo}[1]{\frac{1}{#1}} \newcommand{\nth}{^{(n)}} \newcommand{\te}{\!=\!} $$

Bayesian inference and prediction

The previous note introduced a probabilistic model for regression. Nothing changed when we did maximum likelihood: we still did least squares. However, Bayesian inference on this model reveals how uncertain we are about the underlying model. We no longer fit a model exactly, we constructed a distribution over all possible models. Now we’ll look at how to use that distribution. But first, a simple concrete prediction problem as a work-up exercise.

A toy prediction problem

In lectures I’ll show you 3 cards, each with two sides: Card 1 has one white and one black side, card 2 has two black sides, and card 3 has two white sides.

I shuffle the cards and turn them over randomly. I select a card and way-up uniformly at random and place it on a table.

Question¹: You see a black side. What is the probability that the other side of the same card is white? \[P(x_2 \te \texttt{W} \g x_1 \te \texttt{B}) =~~ \noo{3}, ~~\noo{2}, ~~ \frac{2}{3}, ~~\text{other?}\] Hopefully the process by which a card is selected should be clear: $P(c) = 1/3$ for $c = 1,2,3$, and the side you see first is chosen at random: e.g., $P(x_1\te\texttt{B}|c\te1) = 0.5$.

Many people get this puzzle wrong on first viewing (it’s easy to mess up). Although regardless of whether the answer is obvious or not, it’s a simple example on which to demonstrate some formalism.

When I first tried to use probabilities for prediction problems, I often found myself going in circles using Bayes’ rule: \[ P(x_2\te\texttt{W} \g x_1\te\texttt{B}) ~=~ \frac{{\fbox{$P(x_1\te\texttt{B}\g x_2\te\texttt{W})$}}\,P(x_2\te\texttt{W})}{P(x_1\te\texttt{B})} \] The $\fbox{boxed}$ term is no more obvious than the answer! Bayes’ rule can’t be used blindly to obtain any conditional probability. It is used specifically to `invert’ processes that we understand. The first step to solve inference problems is to write down a model of the data.

A model of the card game states that we first picked one of the three cards uniformly at random: \[ P(c) = \begin{cases} \noo{3} & c = 1,2,3\\ 0 & \text{otherwise}.\end{cases} \] Then given the card, we picked one of the faces at random: \[ P(x_1\te\texttt{B}\g c) = \begin{cases} \noo{2} & c = 1\\ 1 & c = 2\\ 0 & c = 3\end{cases} \] Bayes rule can ‘invert’ this process to tell us $P(c\g x_1\te\texttt{B})$. It infers a posterior distribution over explanations of how we got the data that we observed.

Inferring the card: \[\begin{aligned} P(c\g x_1\te\texttt{B}) &= \frac{P(x_1\te\texttt{B}\g c)\,P(c)}{P(x_1\te\texttt{B})}\\ &\propto\begin{cases} \noo{2}\cdot\noo{3}=\noo{6} & c = 1\\ 1\cdot\noo{3}=\noo{3} & c = 2\\ 0 & c = 3\end{cases}\\ &= \begin{cases} \noo{3} & c = 1\\ \frac{2}{3} & c = 2.\end{cases} \end{aligned} \]

This unbalanced posterior distribution over cards doesn’t match some people’s intuition. “Aren’t there two possible cards given a black side, so it’s 50–50?”. While there are two possible cards, the likelihood for one of these options is $2\times$ larger. Analogy: I pick a dice at random from a 6-sided dice, a 10-sided dice and a 100-sided dice. I roll the dice and tell you I got an 8. You know it’s not the 6-sided dice, so there are two options, the 10-sided dice or the 100-sided dice. But the 10-sided dice is more likely.

We can now spot the answer to the card problem. The other side will be white only if it’s card 1, and we have now computed the probability representing our belief that we’re looking at card 1. For more complex problems we want a more mechanical approach.

The probability for predicting the next outcome that we want is $P(x_2\g x_1\te\texttt{B})$. We need to introduce the card $c$ into the maths, so that we can use our model and its posterior distribution. We use the sum rule to introduce the card choice $c$ as a dummy variable, and then split up the joint probability with the product rule: \[\begin{aligned} P(x_2\g x_1\te\texttt{B}) &= \sum_{c\in{1,2,3}} P(x_2,c\g x_1\te\texttt{B})\\ &= \sum_{c\in{1,2,3}} P(x_2\g x_1\te\texttt{B}, c)\,P(c\g x_1\te\texttt{B}) \end{aligned} \] This equation says that we consider each of the predictions that we would make if we knew the card, and weight it by the posterior probability of that card. (And adding up the options we get $\noo{3}$.)

General strategy for solving prediction problems:

When we want to predict some quantity $y$, we often find that we can’t immediately write down mathematical expressions for $P(y \g \text{data})$.

So we introduce stuff (model parameters and/or latent variables), $z$, that helps us define the problem by using the sum rule: \[ P(y \g \text{data}) = \sum_z P(y,z \g \text{data}). \] If the stuff is real-valued, we’ll have an integral over a PDF instead of a sum over probabilities. Either way, we split up the joint probability using the product rule: \[ P(y \g \text{data}) = \sum_z P(y\g z, \text{data})\,P(z\g\text{data}). \] If we would know how to predict $y$ if we were told the extra stuff $z$, we can find the predictive distribution: weight the predictions for each possible stuff, $P(y\g z, \text{data})$, by the posterior probability of the stuff, $P(z\g\text{data})$. We get the posterior from Bayes’ rule.

Sometimes the extra stuff summarizes everything we need to know to make a prediction: $P(y\g z, \text{data}) = P(y\g z)$. Although not in the card game example above.

Not convinced?

Not everyone believes the answer to the card game question. Sometimes probabilities are counter-intuitive. Here is an Octave/Matlab simulator I wrote for the card game question:

cards = [1 1;
         0 0;
         1 0];
num_cards = size(cards, 1);

N = 0; % Number of times first side is black
kk = 0; % Out of those, how many times the other side is white

for trial = 1:1e6
    card = ceil(num_cards * rand());
    side = 1 + (rand < 0.5);
    other_side = (side==1) + 1;
    x1 = cards(card, side);
    x2 = cards(card, other_side);

    if x1 == 0
        N = N + 1; % just seen another black face
        kk = kk + (x2 == 1); % count if other side was white
    end
end

approx_probability = kk / N

An implementation without the for loop would probably be faster, but here I wanted to convince you, so didn’t want the code to be at all cryptic.

Predictions for Bayesian linear regression

Compared to the statistics literature, the machine learning literature places a greater emphasis on prediction than inferring unknown parameters. Here we explore how to make a Bayesian prediction given a posterior over the parameters.

The prediction of an unknown output $y$ at a test location $\bx$ is expressed as a probability. We condition on training data $\D = \{\bx\nth, y\nth\}$. We want to refer to our model, in particular its regression weights, so we introduce them using the sum rule: \[ p(y\g \bx, \D) = \int p(y, \bw\g \bx, \D)\;\mathrm{d}\bw. \] Then we split up the joint probability using the product rule: \[ p(y\g \bx, \D) = \int p(y\g \bx, \bw)\,p(\bw\g\D) \;\mathrm{d}\bw. \] Here the first term isn’t conditioned on the data, because if we know the parameters, the data doesn’t help make the prediction. The second term isn’t conditioned on the test input location, because we assume the test location doesn’t tell us anything about the weights².

What are the probabilities in this integral? As explained in the previous note, the posterior over the weights is Gaussian for a linear regression model with Gaussian noise. Following Murphy’s notation, I’ll write: \[ p(\bw\g\D) = \N(\bw; \bw_N, V_N), \] where $\bw_N$ is the posterior mean after observing $N$ datapoints, and $V_N$ is the posterior covariance. The other term in the integrand is the predictive distribution for known parameters. That is, a Gaussian with noise variance $\sigma^2$ centred around the function value for the test input:³ \[ p(y\g \bx, \bw) = \N(y;\, \bw^\top\bx, \sigma^2). \] So now we ‘just’ need to do the integral and we have our answer.

Only a little probability theory is needed for Bayesian inference and prediction in this course: just the sum and product rule. We follow the rules in the general strategy above. However, actually solving the sums or integrals required is usually hard.

Linear and Gaussian models are one of the rare cases where we can do the integrals in closed form. Multiplying the terms in the integrand together, and carefully tracking terms we can factor out a Gaussian in $\bw$ that integrates to 1. A Gaussian in $y$ remains. It’s some work, but can be done.

Noticing that our predictive distribution is Gaussian, we can look for an easier way to identify its mean and variance. The test output is a noisy version of the underlying function: \[ y = f(\bx) + \nu = \bx^\top\bw + \nu, \quad \nu\sim N(0, \sigma^2). \] The function value is a linear transformation of the weights, so our beliefs about the function value are Gaussian distributed with mean: \[ \mu_f = \E[\bx^\top\bw] = \bx^\top\bw_N \] and variance⁴ \[ \text{var}[f] = \bx^\top V_N \bx. \] So our belief about the function value is: \[ p(f\g \D, \bx) = \N(f;\, \bx^\top\bw_N,\, \bx^\top V_N \bx). \] Our belief about the output is also Gaussian. The independent zero mean noise doesn’t change the mean, but adds $\sigma^2$ to the variance: \[ p(y\g \D, \bx) = \N(y;\, \bx^\top\bw_N,\, \bx^\top V_N \bx + \sigma^2). \] This answer matches the answer quoted for the integral in Murphy.

Reading

Murphy covers prediction for Bayesian linear regression: Section 7.6.2 and Figure 7.12.

Check your understanding

MacKay’s book has more toy prediction exercises like the card game: Ex. 3.12 and 3.14, p58. You might have intuitive answers, but try to solve them mechanically by writing down an explicit model and applying rules of probability.

More detail for those who love linear algebra

A brute force way to get the predictive distribution for linear regression is to write the integrand as a joint distribution, as a standard multivariate Gaussian using block matrices: \[ p(\bw, y\g \bx, \D) \,=\, \N\left( % variable \left[ \begin{array}{cc} \bw \\ y \\ \end{array} \right] \!;\; % mu \left[ \begin{array}{cc} \bw_N \\ \mu_y \\ \end{array} \right] \!,\; % Sigma \left[ \begin{array}{cc} V_N & \Sigma_{\bw,y} \\ \Sigma_{y,\bw} & \sigma^2_y \\ \end{array} \right] \right). \] Written is this form, the marginal mean and variance of $y$ are immediately available, and we’d write $p(y\g \bx, \D) = \N(y;\,\mu_y,\,\sigma^2_y)$. But if you find that intimidating, I agree! Examining the quadratic form involving $y$ and $\bw$ in the joint distribution gives us the joint inverse covariance, and we need to do quite a few lines of linear algebra to identify $\mu_y = \bx^\top \bw_N$ and $\sigma^2_y = \bx^\top V_N \bx + \sigma^2$. That seems like a lot of work. Sources like the matrix cookbook can help. But in this case there was an easier way.

This card problem is Ex. 8.10a), MacKay, p142. It is not the same as the famous “Monty Hall” or “three doors” puzzle (MacKay Ex. 3.8–9). The Monty Hall problem is also worth understanding. Although the card problem is (hopefully) less controversial and more straightforward. Also fewer students have seen it before and memorized the answer.↩
Non-examinable: There are transductive learning systems that do base inferences on the locations of test inputs.↩
I’m assuming that the noise level $\sigma^2$ is part of our general background knowledge. I won’t consider different values or infer $\sigma^2$ in this note, so I haven’t bothered to explicitly condition on it in any of the probabilities.↩
$\text{var}[f] = \E[(f-\mu_f)(f-\mu_f)] = \E[\bx^\top(\bw - \bw_N)(\bw - \bw_N)^\top\bx] = \bx^\top \text{cov}[\bw] \bx$↩