Extra questions for MLPR
========================

Lecturers are often asked to provide more questions. At this level you are
supposed to be moving towards doing independent research: you need to be
able to come up with questions, *and answers*, yourself.

Strategy for 4th/5th year level courses: First, go over all the material.
For each part, imagine trying to explain it to someone else. Can you say
how it relates to other parts of the course? Where there are explanatory
diagrams you should be able to label, explain, and reproduce them. In
general try to create a small example, application or extreme case and play
with it. If you have trouble, find the material in another reference, or
ask a friend. If/when you understand the material, imagine what questions
could be asked about it. Ask yourself how you might check the answers to
those questions yourself. (Your dissertations should demonstrate these
skills!)

That said, there are some extra questions that may be useful below. We will
*not* provide detailed worked answers to these questions. Talk them over
with your class-mates and if necessary ask specific questions on NB. Iain
can give feedback on any written work that you send him before January
(after which he is on leave).

Other textbook questions may of course be interesting. There are also
questions in the lecture materials. It's surprising how few answers or
questions about questions in the lecture materials tend to be posted to NB.
You can get our feedback there...


Links to books
--------------

[Murphy](https://www.cs.ubc.ca/~murphyk/MLbook/)
Free to view online via the University library web site. Search there, or
*maybe* [this link](http://search.ebscohost.com.ezproxy.is.ed.ac.uk/login.aspx?direct=true&db=nlebk&AN=480968&site=ehost-live) will work.

MacKay ([free pdf online](http://www.inference.phy.cam.ac.uk/mackay/itila/book.html))

Barber ([free pdf online](http://web4.cs.ucl.ac.uk/staff/D.Barber/pmwiki/pmwiki.php?n=Brml.HomePage?from=Main.Textbook), page refs won't match hard copy)

[Bishop](http://research.microsoft.com/en-us/um/people/cmbishop/prml/), not freely available in electronic form. Contains many informative but mathy exercises.


Monte Carlo
-----------

Murphy Ex 23.2, p835. Further questions I'd ask myself include: 1) Would
other proposal distributions work (e.g., Gaussian?). 2) Why is this
exercise possible, and what is the barrier to constructing rejection
samplers in other cases? 3) If I was just interested in some expectation
under the Gamma distribution, such as $E[x^3]$, what else might I do, and
would it be better/worse?

For simplicity assume we have a situation where we can normalize both our target
and proposal distributions, and we know $c = \max P(x)/Q(x)$. Describe how
importance sampling works, and derive the probability of accepting a proposal.
If proposals $Q$ don't work well with importance sampling, will it necessarily
be bad for rejection sampling and why?

MacKay Ex 29.2, p363. A question on importance sampling (with an answer).

MacKay Ex 29.3, p367. On speed of progress of Metropolis as a function of
step-size. What's worse, getting a step size too big or too small?

(MacKay Ex 29.8 and 29.9, p374 are good for people wondering why I
introduced the $R$ reverse transition operator rather than just talking
about detailed balance. Although this stuff is getting into technical
minutiae.)

MacKay Ex 29.10, p377 on slice sampling.

Murphy Ex 24.1, p873 on Gibbs sampling. If you're prepared to do some work
to understand a model, you could further check your understanding on Ex
24.2, Ex 24.3. Or MacKay Ex 29.17 p383.

If keen, there's a
[Summer School lab](http://homepages.inf.ed.ac.uk/imurray2/teaching/09mlss/)
where you can compare MH and slice sampling on a more complicated example.


Gaussian approximations
-----------------------

Most of the textbook questions are far too mathy or involved for this
course.

Have you done the extra Laplace approximation question from tutorial 6?

Why does the second derivative of the `energy' with respect to each
variable have to be positive to apply the Laplace approximation? Why does
the Hessian matrix have to be positive definite (or at least
semi-definite)? Might these constraints be violated given some target
posterior distribution? If so, how?

The Monte Carlo methods rejection sampling, importance sampling and
Metropolis--Hastings all contained distributions called $Q$. Which, if any,
of the methods we've seen for fitting Gaussian approximations (Laplace, and
KL-divergence each way around) might be useful for each of these
algorithms? For each of the combinations, why or why not?

(Murphy Ex  21.2, p764 has a multivariate Laplace approximation if you want
to do more maths.)

Imagine we want to choose between two models. Example 1: two logistic
regression models where one has more fetures than the others. Example 2:
logistic regression models with different priors on the parameters $p(w) =
N(w;0,1)$ and $p(w) = N(w;0,100)$. Explain why we can't compare training
set performance for the most probable (MAP) weights to pick models in these
cases. How could we use the Laplace approximation to modify this idea
(Answer: Eq 28.10, p350 MacKay)? What other ideas have you seen in this
course that you could use to pick between the models?


Gaussian processes (GPs)
------------------------

(I'm not the only one who thinks coming up with your own questions is useful.
Compare to MacKay Ex 45.1, p534! I'm afraid neither MacKay nor Murphy has any
useful GP exercises.)

Barber Chapter 19, and Bishop p320-- have a lot of quite mathematical exercises
related to kernels and GPs. The MLPR exam is likely to ask questions based on a
more intuitive understanding of the model and how inference works, rather than
some complex manipulation under exam conditions.

For questions more aligned with what is expected for MLPR, see the lecture log
and the tutorial self-study sheet 8.


PCA
---

See tutorial self-study sheet 8, which uses PCA as a vehicle to revisit topics
from earlier in the course.

Imagine a dataset containing the heights of children in nanometers (billionths
of a metre) and their masses in kg. If we were to reduce this data to one
dimension by PCA, what would happen and why? What could we do to fix the
problem?

Fred points out that if we could scale up a child uniformly in all directions to
be twice as tall, they'd have eight times the volume and mass. Would it be
sensible to apply a non-linear transformation to the features before applying
PCA?