$$ \newcommand{\M}{\mathcal{M}} \newcommand{\N}{\mathcal{N}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} \newcommand{\by}{\mathbf{y}} \newcommand{\g}{\,|\,} \newcommand{\intd}{\;\mathrm{d}} \newcommand{\te}{\!=\!} $$

Bayesian model choice

Fully Bayesian procedures can’t suffer from “overfitting”, because parameters aren’t fitted: Bayesian statistics only involves integrating or summing over uncertain parameters, not maximizing. The predictions can depend heavily on the model, and choice of prior however.

Previously we chose models — and parameters that controlled the complexity of models — by cross-validation. The Bayesian framework offers an alternative, using marginal likelihoods.

The problems we have instead of overfitting

In the Bayesian framework, we should use any knowledge we have. So if we knew (for some reason) that some points were observations of a 9th order polynomial, we would ideally use that model regardless of how much data we had. For example, given only 5 data points, we can’t know where the underlying function is. However, predictions use a weighted average over all possible fits: our predictive distribution would be broad/uncertain, and centred around a sensible regularized interpolant.

That’s not to say we have no problems when using Bayesian methods.

If the model is too simple, we saw in a previous lecture that the posterior distribution becomes sharply peaked around the least bad fit. As a result, we can be very confident about properties of a model, even when running some checks (such as looking at residuals) would show that the model is obviously in strong disagreement with the data.

We can also have problems when we use a model that’s too complicated. As an extreme example for illustration, we could imagine fitting a function on \(x\in[0,1]\) with a million RBFs spaced evenly over that range with bandwidths \(\sim10^{-6}\). We can closely represent any reasonable function with this representation. However, given (say) 20 observations, most of the basis functions will be many bandwidths away from all of the observations. Thus, the posterior distribution over most of the coefficients will be similar to the prior (check: can you see why?). Except at locations that are nearly on top of the observed data, our predictions will be nearly the same as under the prior. We will learn very slowly with this model.

Simple cases of probabilistic model choice

We have already used a simple form of probabilistic model comparison in Bayes classifiers. Given two fixed models \(p(\bx\g y\te1)\) and \(p(\bx\g y\te0)\), we could evaluate a feature vector under each and use Bayes’ rule to express our beliefs about which model it came from, \(P(y\g\bx)\).

With Gaussian class models we have seen that there are different ways that a model can win a comparison. If a model has a tight distribution, then it will usually be the most probable model when observations are close to its mean, even if those observations could also have come from a broad distribution centred nearby. On the other hand, broad distributions become the most probable model for extreme feature vectors or outliers. Finally, the prior probabilities of the class models have some effect, but the likelihoods of the models often dominate.

My other favourite example of comparing broad and narrow models is dice with different numbers of sides. If I said that I chose a dice at random from a million-sided dice and a 10-sided dice and got a 5, you’d be pretty sure I’d rolled the 10-sided dice. That’s partly because of priors: you don’t think I could possibly own a million sided dice. But even if you thought that million-sided dice were just as common, you’d have the same view. For example, I could implement this game on a computer and show you the code:

sides = [10, 1e6];
dice = (0.5<rand()) + 1;
outcome = floor(rand()*sides(dice)) + 1

Every time you see this code output a number between 1 and 10, you’ll assume that it came from dice=1. While it’s rare to get a small outcome like 5 with a million sided dice, it’s no rarer than any other outcome, such as 63,823. But the small outcomes are more easily generated under the alternative narrow model, so it wins the comparison.

Application to regression models

Why do we usually favour a simple fit over the model with a million narrow basis functions described above? It could be because of priors: we might favour simple models. But actually the world is complicated, and my prior beliefs are that many functions have many degrees of freedom. What would happen if we gave half of our prior mass to the model with a million narrow basis functions? It would still usually lose a model comparison.

A regression model has to distribute its prior mass over all of the possible regression surfaces that it can represent, or over all of its parameters. If a model can represent many different regression surfaces, only some of these will match the data. The probability that a model \(\M\) assigns to some observations in a training set is: \[ p(\by\g X, \M) = \int p(\by, \bw\g X, \M) \intd\bw = \int p(\by, \g X, \bw, \M)\,p(\bw\g\M) \intd\bw. \] Narrowly focussed models, where the mass of the prior distribution \(p(\bw\g\M)\) is concentrated on simple curves, will assign higher density to the outputs \(\by\) observed in many natural datasets than the million narrow basis function model. The narrow basis function model can model smooth functions, but can also fit highly oscillating data — it’s a broader model that can explain outliers, but that will usually lose to simpler models for well-behaved data.

The marginal probability of the data under the model above is the model’s marginal likelihood, and can be used to score different models, instead of a cross-validation score. For Gaussian models, or models we can approximate with Gaussians, we can compute the integral.

Application to hyper-parameters

Our main challenge is often selecting real-valued parameters like the noise level, the typical size of the weights (their prior standard deviation), and the widths of some radial basis functions. These are harder to cross-validate than simple discrete choices, and if we have too many of these parameters, we can’t cross-validate them all.

Incidentally, in the million narrow RBFs example, the main problem wasn’t that there were a million RBFs, it was that they were narrow. Linear regression will make reasonable predictions with many RBFs and only a few datapoints if the bandwidth parameter is broad and we regularize. (When we cover Gaussian processes, you will see we can have an infinite number of basis functions and still make good predictions!) So we often don’t worry about picking a precise number of basis functions, or hidden units in a neural network.

The full Bayesian approach to prediction was described in the previous note. We integrate over all parameters we don’t know. In an honest model that integral would include noise levels, the standard deviation of the weights in the prior, and the widths of basis functions. However, computing these integrals can be difficult (we will return to computational methods after studying Gaussian processes).

A simpler approach is to maximize some parameters, according to their marginal likelihood, the likelihood with most of the parameters integrated out. For example, in a linear regression model with prior \[ p(\bw\g \alpha) = \N(\bw; \mathbf{0}, \alpha^{-1}I), \] where the prior standard deviation of the weights is \(\sigma_\text{prior}\te1/\sqrt{\alpha}\), and likelihood \[ p(y\g \bx, \bw, \sigma) = \N(y; \bw^\top\bx, \sigma^2), \] we can fit the hyperparameters \(\alpha\) and \(\sigma\), parameters which specify the model, to maximize their marginal likelihood: \[ p(\by\g X, \alpha, \sigma) = \int p(\by, \bw\g X, \alpha, \sigma) \intd\bw = \int p(\by \g X, \bw, \sigma)\,p(\bw\g\alpha) \intd\bw. \] No held-out validation set is required.

Fitting a small number of parameters to the marginal likelihood is less prone to overfitting than fitting everything to the joint likelihood. However, overfitting is still quite possible.

Further Reading

Murphy 7.6.4 is on Bayesian model selection, with earlier sections on mathematical detail of Bayesian treatment of the noise variance.

For keen students: Chapter 28 of MacKay’s book has a lengthier discussion of Bayesian model comparison. Some time ago, I wrote a note discussing one of the figures in that chapter.

For very keen students: It can be difficult to put sensible priors on models with many parameters. In these situations it can sometimes be better to start out with a model class that we know is too simple, and only swap to a complex model when we have a lot of data. Bayesian model comparison can fail to tell us the best time to switch to a more complex model. The paper Catching up faster by switching sooner (Erven et al., 2012) has a nice language modelling example, and my thoughts are in the discussion of the paper.

For very keen students, Gelman et al.’s Bayesian Data Analysis book is a good starting point for reading about model checking and criticism. All models are wrong, but we want to improve parts of a model that are most strongly in disagreement with the data.

Coming up, GPs

The next topic will be Gaussian processes, which are Bayesian kernel methods. Reading: Murphy Chapter 15, to section 15.2.4 inclusive. Alternatively, Barber Chapter 19 to section 19.3 inclusive. Or the dedicated Rasmussen and Williams book1 up to section 2.5.

  1. http://gaussianprocess.org/gpml/