MLPR → Tutorials → tut2 PDF of this page
Previous: MLPR Tutorial Sheet 1 Next: MLPR Tutorial Sheet 3

MLPR Tutorial¹ Sheet 2

Reminders: Attempt the tutorial questions before your tutorial. Many of the best-performing students discuss the class and tutorial work with their peers during the week. You can seek clarifications and hints on the class forum. Full answers will be released after the tutorial.

This tutorial is largely just maths. While I ask for a small numerical experiment in the middle, there’s no real data, and no machine learning. However, throughout the course we will derive models and algorithms that use multivariate Gaussian distributions. And other machine learning methods share some of the same maths. I’ve put this material on the tutorial, because it’s useful stuff, and you need to work through it at your own pace. You’ll get an assignment soon that involves some data!

Warm-up exercise:

If \(\ba\) and \(\bb\) are \(D\ttimes1\) column vectors and \(M\) is a \(D\ttimes D\) symmetric matrix, show that \[\ba^\top M \bb = \bb^\top M \ba.\] You wouldn’t need to show this result in an exam. In some working (like for the next question) you could just state that it’s true for symmetric \(M\).
Identifying a Gaussian:

As part of a derivation, we may need to identify the probability density function of a vector up to a constant. For example: \[ p(\bx) \propto \exp\left( -\bx^\top A \bx - \bx^\top\bc\right), \] where \(A\) is a symmetric invertible matrix. As this distribution is proportional to the exponential of a quadratic in \(\bx\), it is a Gaussian: \(p(\bx) = \N(\bx; \bmu, \Sigma)\).

Identify which Gaussian \(\bx\) comes from by identifying the mean \(\bmu\) and covariance \(\Sigma\) in terms of \(A\) and \(\bc\). The easiest method is to compare \(p(\bx)\) to the standard form for the multivariate Gaussian PDF (given in class).

The answer you should be able to show is: \[ \Sigma = \frac{1}{2}A^{-1}, ~~ \bmu = -\frac{1}{2}A^{-1}\bc. \]
Creating a 2D multivariate Gaussian, and a simple experiment:

The first element of a vector has \(p(x_1) = \N(x_1; m, \sigma^2)\).

A second element is generated independently according to the following process: \[ x_2 = \alpha x_1 + \nu, \quad \nu\sim \N(0, n^2). \]

Recall that a linear combination of Gaussian values is Gaussian distributed.
1. The joint distribution of the vector \(\bx\te[x_1~x_2]^\top\) is Gaussian, and so takes the form \(p(\bx)=\N(\bx;\bmu,\Sigma)\). Identify \(\bmu\) and \(\Sigma\).
2. Turning to a computer: pick a setting for each of the parameters \(m\), \(\sigma^2\), \(\alpha\), and \(n\), and simulate samples from the above process. Estimate the mean and covariance from the samples. Do your estimates and the theoretical values agree?
  
  Putting a standard error on your estimates of the means should be straightforward. You may have to use some creativity to put error bars on your estimates of the covariances.
Sampling Gaussians, and matrix decompositions:

[This question has quite a lot of detail on computation and linear algebra, which you can skim over on first reading. You don’t actually need to understand the decompositions in detail, or know how to call them in Matlab, to be able to answer some of the questions.]

In lectures we saw that we can sample from a multivariate Gaussian \(\bx\sim\N(\mathbf{0},\Sigma)\) by drawing a vector of standard normals, \(\bnu\sim\N(\mathbf{0},\I)\), and setting \(\bx\te A\bnu\), for a matrix \(A\) where \(AA^\top \te \Sigma\).

The lower-triangular Cholesky decomposition will decompose a positive-definite covariance into \(\Sigma = LL^\top\).
Matlab/Octave: L = chol(Sigma, 'lower');
Numpy: L = np.linalg.cholesky(Sigma)
This decomposition can be used to draw samples, with \(A\te L\) above.

A triangular decomposition makes computing most things we might want to know about a covariance quick and easy. Cholesky decompositions are widely used. We can quickly find the determinant: \(|L| = \prod_d L_{dd}\) where \(|\Sigma|=|L|^2\), or more frequently \(\log |L| = \sum_d \log L_{dd}\). We can also solve linear systems²: \(L^{-1}\bb\) takes similar time to a matrix-vector multiply \(L\bb\). In Matlab/Octave replace inv(L)*b with L\b and inv(Sigma)*b with L'\(L\b). In Python use scipy.linalg.solve_triangular, and scipy.linalg.cho_solve.
1. Sometimes instead of decomposing the covariance matrix, we have the Cholesky decomposition of the precision matrix, \(\Sigma^{-1} = CC^\top\), where \(C\) is lower-triangular. How would we use \(C\) to sample from \(\N(\mathbf{0},\Sigma)\)?
2. Real symmetric matrices, like covariance matrices, also have a decomposition of the following form³: \[ \Sigma = Q\Lambda Q^\top, \] where \(\Lambda\) is a diagonal matrix of eigenvalues, and the columns of \(Q\) are the eigenvectors of \(\Sigma\).
  1. Describe how to sample from \(\N(\mathbf{0},\Sigma)\) using this decomposition.
  2. \(Q\) is an orthogonal matrix, corresponding to a rigid rotation (and possibly a reflection). Describe geometrically (perhaps in 2D) how your sampling process transforms a cloud of points drawn from a standard normal.
3. Yet another possible decomposition is the principal square root⁴: \(\Sigma = \Sigma^{1/2}\Sigma^{1/2}\), where \(\Sigma^{1/2}\) is symmetric. None of the decompositions discussed so far are the same. In this part we’ll try to understand how they’re related.
  1. Consider two different decompositions \(\Sigma = AA^\top = BB^\top\). We’ll assume the matrices are full rank so that we can write \(B = AU\). Show that \(UU^\top \te \I\), the identity matrix, which means that \(U\) is an orthogonal matrix.
  2. Explain geometrically why if computing \(A\bnu\) from \(\bnu\sim N(\mathbf{0},\I)\) is a way to sample from \(N(\mathbf{0},\Sigma)\), computing \(B\bnu \te AU\bnu\) will be as well.

Parts of this tutorial sheet are based on previous versions by Amos Storkey, Charles Sutton, and Chris Williams↩
We do not usually evaluate an expression \(A^{-1}\bc\) by inverting \(A\) and then multiplying \(\bc\) by \(A^{-1}\). There are faster and more numerically stable way to solve \(A^{-1}\bc\). The method you should use depends on the properties of \(A\). In common situations, Matlab’s A\c does something sensible and should be preferred to inv(A)*c. But if you’ve cached a decomposition of A, you should probably make use of it.↩
https://en.wikipedia.org/wiki/Eigendecomposition_of_a_matrix#Real_symmetric_matrices ↩
Non-examinable: \(\Sigma^{1/2} \te Q\Lambda^{1/2}Q^\top\), using \(Q\) and \(\Lambda\) from the eigendecomposition, and \(\Lambda^{1/2}\) simply replaces each eigenvalue on the diagonal with its square root. However, it would be better to compute it with sqrtm, and you are unlikely to use it at all. I have only once found it useful.↩

MLPR → Tutorials → tut2 PDF of this page
Previous: MLPR Tutorial Sheet 1 Next: MLPR Tutorial Sheet 3

Notes by Iain Murray

MLPR Tutorial1 Sheet 2

MLPR Tutorial¹ Sheet 2