$$\newcommand{\D}{\mathcal{D}} \newcommand{\E}{\mathbb{E}} \newcommand{\M}{\mathcal{M}} \newcommand{\N}{\mathcal{N}} \newcommand{\bm}{\mathbf{m}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} \newcommand{\g}{\,|\,} \newcommand{\intd}{\;\mathrm{d}} \newcommand{\te}{\!=\!}$$

Computing logistic regression predictions

In the previous note we approximated the logistic regression posterior with a Gaussian distribution. By comparing to the joint probability, we immediately obtained an approximation for the marginal likelihood $$P(\D)$$ or $$P(\D\g\M)$$, which can be used to choose between alternative model settings $$\M$$.

Now we return to the question of how to make Bayesian predictions (all implicitly conditioned on a set of model choices $$\M$$): \begin{aligned} P(y\g \bx, \D) &= \int p(y,\bw\g \bx, \D)\intd\bw\\ &= \int P(y\g \bx,\bw)\,p(\bw\g\D)\intd\bw. \end{aligned} If we approximate the posterior with a Gaussian, $$p(\bw\g\D)\approx \N(\bw;\,\bm, V)$$, we still have an integral with no closed form solution: \begin{aligned} P(y\te1\g \bx, \D) &\approx \int \sigma(\bw^\top\bx)\,\N(\bw;\,\bm, V)\intd\bw\\ &= \E_{\N(\bw;\,\bm, V)}\left[ \sigma(\bw^\top\bx)\right].\\ \end{aligned} However, this expectation can be simplified. Only the inner product $$a\te\bw^\top\bx$$ matters, so we can take the average over this scalar quantity instead. The linear combination $$a$$ is a linear combination of Gaussian beliefs, so our beliefs about it are also Gaussian. By now you should be able to show that $p(a) = \N(a;\; \bm^\top\bx,\; \bx^\top V\bx).$ Therefore, the predictions given the approximate posterior, are given by a one-dimensional integral: \begin{aligned} P(y\te1\g \bx, \D) &\approx \E_{\N(a;\;\bm^\top\bx,\; \bx^\top V\bx)}\left[ \sigma(a)\right]\\ &= \int \sigma(a)\,\N(a;\;\bm^\top\bx,\; \bx^\top V\bx)\intd{a}.\\ \end{aligned} One-dimensional integrals can computed numerically to high precision.

Murphy Section 8.4.4.2 reviews a further approximation (derivation non-examinable), which results in a closed form expression: $P(y\te1\g \bx, \D) \approx \sigma(\kappa\, \bm^\top\bx), \qquad \kappa = \frac{1}{\sqrt{1+\frac{\pi}{8}\bx^\top V\bx}}.$ These predictions use the mean weights $$\bm$$ under the Gaussian approximation. If we used the Laplace approximation, we’re using the most probable or MAP weights. However, the activation is scaled down (with $$\kappa$$) when the activation is uncertain, so that predictions will be less confident far from the data (as they should be).