Lecture 13, Tuesday w7, 2014-10-28
==================================

Markov chain Monte Carlo (MCMC)
-------------------------------

We finished last time noticing that picking good step-sizes is important for
Metropolis methods.

MCMC methods can apply a series of different transition operators in sequence,
or at random. For example, Gibbs sampling uses a sequence (or random choice) of
transition operators, each of which only updates one particular variable. We can
also use a random choice of step size.

Another way to make us more robust to a fixed choice of step size is slice
sampling. See one of the short readings below for a detailed explanation.


**Running MCMC methods:**

It can be a good idea to do a few multiple runs. 1) if chains get trapped in
different local modes of the posterior, we might notice that multiple runs
disagree. 2) If we have independent estimates from different runs, it's easy to
estimate a standard error on the mean of their estimates. (For more
sophisticated estimates, see the R-CODA R package.)

Look at trace plots of parameters or predictions over time, to spot any
problems. Some practitioners will discard any transient "burn-in" period they
see, before the Markov chain reached equilibrium. Strictly speaking this
procedure isn't necessary: any estimates using the whole run are still
consistent.

Knowing whether our Markov chain has adequately explored the space of plausible
parameters in an inference problem is usually hard. A lot of hope is involved in
many applications. As a minimum, we can run our code on synthetic data, to see
if it works correctly when we know the true underlying explanation of the data.


**Examples (pp1--19 of "Deterministic approximations" slides)**

I showed you an example of MCMC (slice sampling) working in practice. This
example is non-examinable.

Take home message: In complicated posteriors, it may be hard to summarize the
distribution in any simpler way than a set of posterior samples. For simple
posteriors, a maximum likelihood or MAP estimate of the parameters may be pretty
good, and an optimizer would return that much quicker than MCMC.

We'd like an error bar on any estimates we make though: we should represent at
least the width of our posterior. Many posteriors are approximately Gaussian
(but not all! I showed a couple of plots of exceptions). An optimizer can find
the mode of that approximate Gaussian. Now we also want the width of the
Gaussian. We can estimate the width much more cheaply than MCMC (next lecture!).


Recommended reading
-------------------

For slice sampling, read one of:

My PhD thesis, pp39--40, attaches more words to the same diagrams as seen in
lectures. But has little other detail.
http://homepages.inf.ed.ac.uk/imurray2/pub/07thesis/

MacKay's textbook: pp365--377 (down to 'Computer-friendly slice sampling')

Murphy's textbook: Section 24.5.2 pp864--866.


Optional extras
---------------

Previous years have covered "Hamiltonian Monte Carlo" (HMC), (see e.g., MacKay
p387). This method uses gradients of the log posterior to construct a dynamical
system that will make larger updates than typically possible with the methods
discussed in lectures. I thought it was a bit much to include, so it's
non-examinable. If you wanted to do MCMC on the parameter posterior of a neural
network of moderate size, this method is what you'd need though. The method
isn't hard to implement, but making some of the choices required to run it can
need some experience.

Software that makes HMC work well on Bayesian neural networks is:
http://www.cs.toronto.edu/~radford/fbm.software.html