In the previous note we approximated the logistic regression posterior with a Gaussian distribution. By comparing to the joint probability, we immediately obtained an approximation for the marginal likelihood \(P(\D)\) or \(P(\D\g\M)\), which can be used to choose between alternative model settings \(\M\).

Now we return to the question of how to make Bayesian predictions (all implicitly conditioned on a set of model choices \(\M\)): \[\begin{aligned} P(y\g \bx, \D) &= \int p(y,\bw\g \bx, \D)\intd\bw\\ &= \int P(y\g \bx,\bw)\,p(\bw\g\D)\intd\bw. \end{aligned}\] If we approximate the posterior with a Gaussian, \(p(\bw\g\D)\approx \N(\bw;\,\bm, V)\), we still have an integral with no closed form solution: \[\begin{aligned} P(y\te1\g \bx, \D) &\approx \int \sigma(\bw^\top\bx)\,\N(\bw;\,\bm, V)\intd\bw\\ &= \E_{\N(\bw;\,\bm, V)}\left[ \sigma(\bw^\top\bx)\right].\\ \end{aligned}\] However, this expectation can be simplified. Only the inner product \(a\te\bw^\top\bx\) matters, so we can take the average over this scalar quantity instead. The linear combination \(a\) is a linear combination of Gaussian beliefs, so our beliefs about it are also Gaussian. By now you should be able to show that \[ p(a) = \N(a;\; \bm^\top\bx,\; \bx^\top V\bx). \] Therefore, the predictions given the approximate posterior, are given by a one-dimensional integral: \[\begin{aligned} P(y\te1\g \bx, \D) &\approx \E_{\N(a;\;\bm^\top\bx,\; \bx^\top V\bx)}\left[ \sigma(a)\right]\\ &= \int \sigma(a)\,\N(a;\;\bm^\top\bx,\; \bx^\top V\bx)\intd{a}.\\ \end{aligned}\] One-dimensional integrals can computed numerically to high precision.

Murphy Section 8.4.4.2 reviews a further approximation (derivation non-examinable), which results in a closed form expression: \[ P(y\te1\g \bx, \D) \approx \sigma(\kappa\, \bm^\top\bx), \qquad \kappa = \frac{1}{\sqrt{1+\frac{\pi}{8}\bx^\top V\bx}}. \] These predictions use the mean weights \(\bm\) under the Gaussian approximation. If we used the Laplace approximation, we’re using the most probable or MAP weights. However, the activation is scaled down (with \(\kappa\)) when the activation is uncertain, so that predictions will be less confident far from the data (as they should be).