MLPR w8b - Machine Learning and Pattern Recognition

$$ \newcommand{\D}{\mathcal{D}} \newcommand{\E}{\mathbb{E}} \newcommand{\M}{\mathcal{M}} \newcommand{\N}{\mathcal{N}} \newcommand{\bm}{\mathbf{m}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} \newcommand{\g}{\,|\,} \newcommand{\intd}{\;\mathrm{d}} \newcommand{\te}{\!=\!} $$

Computing logistic regression predictions

In the previous note we approximated the logistic regression posterior with a Gaussian distribution. By comparing to the joint probability, we immediately obtained an approximation for the marginal likelihood $P(\D)$ or $P(\D\g\M)$, which can be used to choose between alternative model settings $\M$.

Now we return to the question of how to make Bayesian predictions (all implicitly conditioned on a set of model choices $\M$): \[\begin{aligned} P(y\g \bx, \D) &= \int p(y,\bw\g \bx, \D)\intd\bw\\ &= \int P(y\g \bx,\bw)\,p(\bw\g\D)\intd\bw. \end{aligned}\] If we approximate the posterior with a Gaussian, $p(\bw\g\D)\approx \N(\bw;\,\bm, V)$, we still have an integral with no closed form solution: \[\begin{aligned} P(y\te1\g \bx, \D) &\approx \int \sigma(\bw^\top\bx)\,\N(\bw;\,\bm, V)\intd\bw\\ &= \E_{\N(\bw;\,\bm, V)}\left[ \sigma(\bw^\top\bx)\right].\\ \end{aligned}\] However, this expectation can be simplified. Only the inner product $a\te\bw^\top\bx$ matters, so we can take the average over this scalar quantity instead. The linear combination $a$ is a linear combination of Gaussian beliefs, so our beliefs about it are also Gaussian. By now you should be able to show that \[ p(a) = \N(a;\; \bm^\top\bx,\; \bx^\top V\bx). \] Therefore, the predictions given the approximate posterior, are given by a one-dimensional integral: \[\begin{aligned} P(y\te1\g \bx, \D) &\approx \E_{\N(a;\;\bm^\top\bx,\; \bx^\top V\bx)}\left[ \sigma(a)\right]\\ &= \int \sigma(a)\,\N(a;\;\bm^\top\bx,\; \bx^\top V\bx)\intd{a}.\\ \end{aligned}\] One-dimensional integrals can computed numerically to high precision.

Murphy Section 8.4.4.2 reviews a further approximation (derivation non-examinable), which results in a closed form expression: \[ P(y\te1\g \bx, \D) \approx \sigma(\kappa\, \bm^\top\bx), \qquad \kappa = \frac{1}{\sqrt{1+\frac{\pi}{8}\bx^\top V\bx}}. \] These predictions use the mean weights $\bm$ under the Gaussian approximation. If we used the Laplace approximation, we’re using the most probable or MAP weights. However, the activation is scaled down (with $\kappa$) when the activation is uncertain, so that predictions will be less confident far from the data (as they should be).