$$ \newcommand{\D}{\mathcal{D}} \newcommand{\E}{\mathbb{E}} \newcommand{\N}{\mathcal{N}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} \newcommand{\g}{\,|\,} \newcommand{\intd}{\;\mathrm{d}} \newcommand{\noo}[1]{\frac{1}{#1}} \newcommand{\nth}{^{(n)}} \newcommand{\pdd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\te}{\!=\!} $$

Bayesian inference and prediction

The previous note introduced a probabilistic model for regression. Nothing changed when we did maximum likelihood: we still did least squares. However, Bayesian inference on this model reveals how uncertain we are about the underlying model. We no longer fit a single model, instead we constructed a distribution over all possible models. Now we’ll look at how to use that distribution. But first, a simple concrete prediction problem as a work-up exercise.

A toy prediction problem

In lectures I’ll show you 3 cards, each with two sides: Card 1 has one white and one black side, card 2 has two black sides, and card 3 has two white sides.

I shuffle the cards and turn them over randomly. I select a card and way-up uniformly at random and place it on a table.

Question1: You see a black side. What is the probability that the other side of the same card is white? \[P(x_2 \te \texttt{W} \g x_1 \te \texttt{B}) =~~ \noo{3}, ~~\noo{2}, ~~ \frac{2}{3}, ~~\text{other?}\] Hopefully the process by which a card is selected should be clear: \(P(c) = 1/3\) for \(c = 1,2,3\), and the side you see first is chosen at random: e.g., \(P(x_1\te\texttt{B}\g c\te1) = 0.5\).

Many people get this puzzle wrong on first viewing (it’s easy to mess up). Although regardless of whether the answer is obvious or not, it’s a simple example on which to demonstrate some formalism.

When I first tried to use probabilities for prediction problems, I often found myself going in circles using Bayes’ rule: \[ P(x_2\te\texttt{W} \g x_1\te\texttt{B}) ~=~ \frac{{\fbox{$P(x_1\te\texttt{B}\g x_2\te\texttt{W})$}}\,P(x_2\te\texttt{W})}{P(x_1\te\texttt{B})} \] The \(\fbox{boxed}\) term is no more obvious than the answer! Bayes’ rule can’t be used blindly to obtain any conditional probability. It is used specifically to `invert’ processes that we understand. The first step to solve inference problems is to write down a model of the data.

A model of the card game states that we first picked one of the three cards uniformly at random: \[ P(c) = \begin{cases} \noo{3} & c = 1,2,3\\ 0 & \text{otherwise}.\end{cases} \] Then given the card, we picked one of the faces at random: \[ P(x_1\te\texttt{B}\g c) = \begin{cases} \noo{2} & c = 1\\ 1 & c = 2\\ 0 & c = 3\end{cases} \] Bayes rule can ‘invert’ this process to tell us \(P(c\g x_1\te\texttt{B})\). It infers a posterior distribution over explanations of how we got the data that we observed.

Inferring the card: \[\begin{aligned} P(c\g x_1\te\texttt{B}) &= \frac{P(x_1\te\texttt{B}\g c)\,P(c)}{P(x_1\te\texttt{B})}\\ &\propto\begin{cases} \noo{2}\cdot\noo{3}=\noo{6} & c = 1\\ 1\cdot\noo{3}=\noo{3} & c = 2\\ 0 & c = 3\end{cases}\\ &= \begin{cases} \noo{3} & c = 1\\ \frac{2}{3} & c = 2.\end{cases} \end{aligned} \]

This unbalanced posterior distribution over cards doesn’t match some people’s intuition. “Aren’t there two possible cards given a black side, so it’s 50–50?”. While there are two possible cards, the likelihood for one of these options is \(2\times\) larger. Analogy: I pick a dice at random from a 6-sided dice, a 10-sided dice and a 100-sided dice. I roll the dice and tell you I got an 8. You know it’s not the 6-sided dice, so there are two options, the 10-sided dice or the 100-sided dice. But the 10-sided dice is more likely.

We can now spot the answer to the card problem. The other side will be white only if it’s card 1, and we have now computed the probability representing our belief that we’re looking at card 1. For more complex problems we want a more mechanical approach.

The probability for predicting the next outcome that we want is \(P(x_2\g x_1\te\texttt{B})\). We need to introduce the card \(c\) into the maths, so that we can use our model and its posterior distribution. We use the sum rule to introduce the card choice \(c\) as a dummy variable, and then split up the joint probability with the product rule: \[\begin{aligned} P(x_2\g x_1\te\texttt{B}) &= \sum_{c\in{1,2,3}} P(x_2,c\g x_1\te\texttt{B})\\ &= \sum_{c\in{1,2,3}} P(x_2\g x_1\te\texttt{B}, c)\,P(c\g x_1\te\texttt{B}). \end{aligned} \] This equation says that we consider each of the predictions that we would make if we knew the card, and weight each prediction by the posterior probability of that card. (And adding up the options we get \(\noo{3}\).)

General strategy for solving prediction problems:

When we want to predict some quantity \(y\), we often find that we can’t immediately write down mathematical expressions for \(P(y \g \text{data})\).

So we introduce “stuff” \(z\), model parameters and/or latent variables, that helps us define the problem, by using the sum rule: \[ P(y \g \text{data}) = \sum_z P(y,z \g \text{data}). \] If the stuff is real-valued, we’ll have an integral over a PDF instead of a sum over probabilities. Either way, we split up the joint probability using the product rule: \[ P(y \g \text{data}) = \sum_z P(y\g z, \text{data})\,P(z\g\text{data}). \] If we would know how to predict \(y\) if we were told the extra stuff \(z\), we can find the predictive distribution: weight the predictions for each possible stuff, \(P(y\g z, \text{data})\), by the posterior probability of the stuff, \(P(z\g\text{data})\). We get the posterior from Bayes’ rule.

Sometimes the extra stuff summarizes everything we need to know to make a prediction: \(P(y\g z, \text{data}) = P(y\g z)\). Although not in the card game example above.

Not convinced?

Not everyone believes the answer to the card game question. Sometimes probabilities are counter-intuitive. Here is an Octave/Matlab simulator I wrote for the card game question:

cards = [1 1;
         0 0;
         1 0];
num_cards = size(cards, 1);

N = 0; % Number of times first side is black
kk = 0; % Out of those, how many times the other side is white

for trial = 1:1e6
    card = ceil(num_cards * rand());
    side = 1 + (rand < 0.5);
    other_side = (side==1) + 1;
    x1 = cards(card, side);
    x2 = cards(card, other_side);

    if x1 == 0
        N = N + 1; % just seen another black face
        kk = kk + (x2 == 1); % count if other side was white

approx_probability = kk / N

An implementation without the for loop would probably be faster, but here I wanted to convince you, so didn’t want the code to be at all cryptic.

Predictions for Bayesian linear regression

Compared to the statistics literature, the machine learning literature places a greater emphasis on prediction than inferring unknown parameters. Here we explore how to make a Bayesian prediction given a posterior over the parameters.

The prediction of an unknown output \(y\) at a test location \(\bx\) is expressed as a probability distribution. We condition on training data \(\D = \{\bx\nth, y\nth\}\). We want to refer to our model, in particular its regression weights, so we introduce them using the sum rule: \[ p(y\g \bx, \D) = \int p(y, \bw\g \bx, \D)\;\mathrm{d}\bw. \] Then we split up the joint probability using the product rule: \[ p(y\g \bx, \D) = \int p(y\g \bx, \bw)\,p(\bw\g\D) \;\mathrm{d}\bw. \] Here the first term isn’t conditioned on the data, because if we know the parameters, the data doesn’t help make the prediction. The second term isn’t conditioned on the test input location, because we assume the test location doesn’t tell us anything about the weights2.

What are the probabilities in this integral? As explained in the previous note, the posterior over the weights is Gaussian for a linear regression model with Gaussian noise. Following Murphy’s notation, I’ll write: \[ p(\bw\g\D) = \N(\bw; \bw_N, V_N), \] where \(\bw_N\) is the posterior mean after observing \(N\) datapoints, and \(V_N\) is the posterior covariance. The other term in the integrand is the predictive distribution for known parameters. That is, a Gaussian with noise variance \(\sigma_y^2\) centred around the function value for the test input:3 \[ p(y\g \bx, \bw) = \N(y;\, \bw^\top\bx, \sigma_y^2). \] So now we ‘just’ need to do the integral and we have our answer.

Only a little probability theory is needed for Bayesian inference and prediction in this course: just the sum and product rule. We follow the rules in the general strategy above. However, actually solving the sums or integrals required is usually hard.

Linear and Gaussian models are one of the rare cases where we can do the integrals in closed form. Multiplying the terms in the integrand together, and carefully tracking terms we can factor out a Gaussian in \(\bw\) that integrates to 1. A Gaussian in \(y\) remains. It’s some work, but can be done.

Anticipating that our predictive distribution is Gaussian, we can look for an easier way to identify its mean and variance. The test output is a noisy version of the underlying function: \[ y = f(\bx) + \nu = \bx^\top\bw + \nu, \quad \nu\sim \N(0, \sigma_y^2). \] The function value is a linear transformation of the weights, so our beliefs about the function value are Gaussian distributed with mean: \[ \mu_f = \E[\bx^\top\bw] = \bx^\top\bw_N \] and variance4 \[ \text{var}[f] = \bx^\top V_N \bx. \] So our belief about the function value is: \[ p(f\g \D, \bx) = \N(f;\, \bx^\top\bw_N,\, \bx^\top V_N \bx). \] Adding Gaussian noise, our belief about a new output is also Gaussian. The independent zero mean noise doesn’t change the mean, but adds \(\sigma_y^2\) to the variance: \[ p(y\g \D, \bx) = \N(y;\, \bx^\top\bw_N,\, \bx^\top V_N \bx + \sigma_y^2). \] This answer matches the answer quoted for the integral in Murphy.

Decision making

Given a test input \(\bx\), the Bayesian approach says that we don’t know what the output \(y\) will be, and so doesn’t return a single value. However, in most prediction competitions (such as a Kaggle competition, but not always5) we need to provide a point-estimate or guess of the output, \(\hat{y}\). In real-world applications, we need to make decisions. Usually each guess or decision should be chosen to minimize our expected loss, then over many guesses our average performance will be good.

For a given application we need to write down a loss function \(L(y, \hat{y})\). The loss says how bad it is to guess \(\hat{y}\) if the actual output is \(y\). Our expected loss given training data \(\D\) give us our cost function: \[ c = \E_{p(y\g\D)}[L(y,\hat{y})] = \int L(y,\hat{y})\,p(y\g\D)\intd{y}. \] For square loss, \(L(y,\hat{y})=(y-\hat{y})^2\), we can differentiate the cost with respect to our point-estimate: \[ \pdd{c}{\hat{y}} = \E_{p(y\g\D)}[2(y-\hat{y})] = 2(\E_{p(y\g\D)}[y] - \hat{y}). \] Setting the derivative to zero, the optimal guess is the posterior mean, \(\hat{y} = \E_{p(y\g\D)}[y]\). For Bayesian linear regression with a Gaussian prior and noise model, it can be shown that the posterior mean corresponds to using an L2 regularized model. If all we care about is mean squared error, the Bayesian approach doesn’t change what we do.

However, the uncertainty can still be useful. If we identify predictions that are uncertain, it could warn us to manually intervene, or gather data relevant to those cases. That is, making a single point estimate is often not the only decisions we make based on a model. Moreover, in real-world applications, sensible loss functions are often asymmetric.

Imagine that a bakery makes an estimate \(\hat{y}\) of tomorrow’s bread sales \(y\). If it were possible to make perfect predictions, so we knew that \(\hat{y}\te y\), then we would bake exactly \(\hat{y}\) loaves of bread. We’d sell all our bread, and send no customers away. Sadly we can’t make perfect predictions. My local baker sells out nearly every day. Presumably they really don’t like throwing bread away, so set \(\hat{y}\) smaller than the average possible value of \(y\). Other bakers seem to attach a different loss to waste, and make decisions that regularly result in throwing bread away6.

As another example, a business-to-business supplier of non-perishable goods will pay warehouse fees to keep excess stock, if failing to fulfill orders will lose customers in the long term. For them, the cost of underestimating sales is much higher than the cost of overestimating.

The Bayesian approach separates modelling data from the application-specific loss function. Multiple decisions, with different losses, can be made from the same model. An alternative and popular approach, “empirical risk minimization”, fits a model function to the training data to directly optimize an application-specific loss function.7

Further Reading

Murphy covers prediction for Bayesian linear regression: Section 7.6.2 and Figure 7.12.

Murphy Section 5.7–, and MacKay Chapter 36 (p451), have more detail on Bayesian decision making.

Check your understanding

MacKay’s book has more toy prediction exercises like the card game: Ex. 3.12 and 3.14, p58. You might have intuitive answers, but try to solve them mechanically by writing down an explicit model and applying rules of probability.

What happens to the uncertainty of predictions as \(\bx\) grows large. Why?

More detail for those who love linear algebra

A brute force way to get the predictive distribution for linear regression is to write the integrand as a joint distribution, as a standard multivariate Gaussian using block matrices: \[ p(\bw, y\g \bx, \D) \,=\, \N\left( % variable \left[ \begin{array}{cc} \bw \\ y \\ \end{array} \right] \!;\; % mu \left[ \begin{array}{cc} \bw_N \\ m \\ \end{array} \right] \!,\; % Sigma \left[ \begin{array}{cc} V_N & \Sigma_{\bw,y} \\ \Sigma_{y,\bw} & r^2 \\ \end{array} \right] \right). \] Written in this form, the marginal mean and variance of \(y\) are immediately available, and we’d write \(p(y\g \bx, \D) = \N(y;\,m,r^2)\). But if you find that intimidating, I agree! Examining the quadratic form involving \(y\) and \(\bw\) in the joint distribution gives us the joint inverse covariance, and we need to do quite a few lines of linear algebra to identify \(m = \bx^\top \bw_N\) and \(r^2 = \bx^\top V_N \bx + \sigma_y^2\). That seems like a lot of work. Sources like the matrix cookbook can help. But in this case there was an easier way.

  1. This card problem is Ex. 8.10a), MacKay, p142. It is not the same as the famous “Monty Hall” or “three doors” puzzle (MacKay Ex. 3.8–9). The Monty Hall problem is also worth understanding. Although the card problem is (hopefully) less controversial and more straightforward. Also fewer students have seen it before and memorized the answer.

  2. Non-examinable: There are transductive learning systems that do base inferences on the locations of test inputs.

  3. I’m assuming that the noise level \(\sigma_y^2\) is part of our general background knowledge. I won’t consider different values or infer \(\sigma_y^2\) in this note, so I haven’t bothered to explicitly condition on it in any of the probabilities.

  4. \(\text{var}[f] = \E[(f-\mu_f)(f-\mu_f)] = \E[\bx^\top(\bw - \bw_N)(\bw - \bw_N)^\top\bx] = \bx^\top \text{cov}[\bw] \bx\)

  5. https://link.springer.com/chapter/10.1007/11736790_1

  6. Or if more responsible, they find someone to take the bread who wouldn’t purchase it anyway.

  7. Warning: You will find dogmatic advocacy for each of these approaches over the other. Personally, I’ve used both approaches. As usual, the optimal decision depends upon the circumstances.