MLPR w2e - Machine Learning and Pattern Recognition

$$ \newcommand{\E}{\mathbb{E}} \newcommand{\I}{\mathbb{I}} \newcommand{\N}{\mathcal{N}} \newcommand{\bm}{\mathbf{m}} \newcommand{\bmu}{{\boldsymbol{\mu}}} \newcommand{\bx}{\mathbf{x}} \newcommand{\by}{\mathbf{y}} \newcommand{\bz}{\mathbf{z}} \newcommand{\te}{\!=\!} \newcommand{\tm}{\!-\!} $$

Multivariate Gaussians

We’re going to use Gaussian distributions as parts of models of data, and to represent beliefs about models. Most models and algorithms in machine learning involve more than one scalar variable however. (A scalar meaning a single number, rather a vector of values.) Multivariate Gaussians generalize the univariate Gaussian distribution to multiple variables, which can be dependent.

[This note assumes that you know the background material on expectations of random variables.]

Independent Standard Normals

We could sample a vector $\bx$ by independently sampling each element from a standard normal distribution, $x_d\sim\N(0,1)$. Because the variables are independent, the joint probability is the product of the individual or marginal probabilities: \[ p(\bx) = \prod_{d=1}^D p(x_d) = \prod_{d=1}^D \N(x; 0,1). \] Usually I recommend that you write any Gaussian PDFs in your maths using the $\N(x;\mu,\sigma^2)$ notation unless you have to expand them. It will be less writing, and clearer. Here, I want to combine the PDFs, so will substitute in the standard equation: \[\begin{aligned} p(\bx) &= \prod_{d=1}^D \frac{1}{\sqrt{2\pi}}\, e^{-x_d^2/2} = \frac{1}{(2\pi)^{D/2}}\, e^{-\frac{1}{2}\sum_d x_d^2}\\ &= \frac{1}{(2\pi)^{D/2}}\, e^{-\frac{1}{2}\bx^\top\bx}.\\ \end{aligned}\] The PDF is proportional to the Radial Basis Functions (RBFs) we’ve used previously. Here the normalizer $1/(2\pi)^{D/2}$ means that the PDF integrates to one.

Like an RBF centred at the origin, this density function only depends on the (square) distance or radius of $\bx$ from the origin. Any point in a spherical shell (or a circular shell in 2-dimensions) is equally probable. Therefore if we simulate points in 2-dimensions and draw a scatter plot:

    % Matlab/Octave
    N = 1e4; D = 2;
    X = randn(N, D);
    plot(X(:,1), X(:,2), '.');
    axis('square');

    # Python
    N = int(1e4); D = 2
    X = np.random.randn(N, D)
    plt.plot(X[:,0], X[:,1], '.')
    plt.axis('square')

we will see a diffuse circular spray of points. The spherical symmetry is a special property of Gaussians. If you were to draw independent samples from, say, a Laplace distribution you would see a non-circular distribution, which has more density close to the axes.

Covariance

The multivariate generalization of variance, is covariance, which is represented with a matrix. While a variance is often denoted $\sigma^2$, a covariance matrix is often denoted $\Sigma$ — not to be confused with a summation $\sum_{d=1}^D \ldots$

The elements of the covariance matrix for a random vector $\bx$ are: \[ \mathrm{cov}[\bx]_{ij} = \E[x_ix_j] - \E[x_i]\E[x_j]. \] On the diagonal, where $i\te j$, you will see that this definition gives the (marginal) variance $\mathrm{var}[x_i]$ of the elements of the vector. We can write the whole matrix with a linear algebra expression: \[ \mathrm{cov}[\bx] = \E[\bx\bx^\top] - \E[\bx]\E[\bx]^\top. \]

Question: What is the covariance $S$ of the spherical distribution of the previous section? (I will reserve $\Sigma$ for the covariance of the general Gaussian in the next section.)

Answer: The first term is $\E[x_ix_j] = \E[x_i]\E[x_j]$ if $x_i$ and $x_j$ are independent, which they are if $i\!\ne\!j$. Thus $S_{i\ne j} = 0$. The diagonal elements $S_ii$ are equal to the variances of the individual variables, which are all equal to one. Therefore, $S_{ij} = \delta_{ij}$, where $\delta_{ij}$ is a Kroneker delta. Or as a matrix, $S = \I$, the identity matrix.

Transforming and Rotating: general Gaussians

As with one-dimensional Gaussians we can generalize the standard zero-mean, unit-variance Gaussian by a linear transformation and a shift.

If we generated $\bx$ from independent $\N(0,1)$ draws as above, we could form a linear combination of these outcomes: \[ \by = A\bx. \] To keep the discussion simpler, I will assume that $A$ is square and invertible, so $\by$ has the same dimensionality as $\bx$.

Answer: Simply substitute $\by$ into the definition: \[\begin{aligned} \mathrm{cov}[\by] &= \E[\by\by^\top] - \E[\by]\E[\by]^\top\\ &= \E[A\bx\bx^\top A^\top] - \E[A\bx]\E[A\bx]^\top\\ &= A\E[\bx\bx^\top]A^\top - A\E[\bx](A\E[\bx])^\top. \end{aligned}\] Because $\E[\bx]$ is zero, the second term is zero, and the expectation in the first term is equal to $\mathrm{cov}[\bx] = \I$. Therefore, \[ \mathrm{cov}[\by] = \Sigma = AA^\top. \]

As we’re assuming $A$ is invertible, we can compute the original vector from the transformed one: $\bx = A^{-1}\by$. Substituting that expression into the PDF for $\bx$ we can see the shape of the new PDF: \[\begin{aligned} p(\by) &\propto e^{-\frac{1}{2} (A^{-1}\by)^\top(A^{-1}\by)}\\ &\propto e^{-\frac{1}{2} \by^\top A^{-\top}A^{-1}\by}. \end{aligned}\] However, if we have stretched out the PDF we must scale it down to maintain probability mass. If we apply a linear transformation $A$ to a volume of points, then the volume is multiplied by $|A|$, the determinant of the matrix.¹ Therefore, \[ p(\by) = \frac{1}{|A|(2\pi)^{D/2}}\, e^{-\frac{1}{2} \by^\top A^{-\top}A^{-1}\by}. \] Usually this expression is re-written in terms of the covariance of the vector. Noticing that \[\begin{aligned} \Sigma^{-1} &= A^{-\top}A^{-1}, ~\text{and}\\ |\Sigma| &= |AA^\top| = |A||A^\top| = |A|^2, \end{aligned}\] we can write: \[ p(\by) = \frac{1}{|\Sigma|^{1/2}(2\pi)^{D/2}}\, e^{-\frac{1}{2} \by^\top \Sigma^{-1}\by} = |2\pi\Sigma|^{-1/2}\, e^{-\frac{1}{2} \by^\top \Sigma^{-1}\by}. \] As demonstrated above, there are different equivalent ways to write the normalizing constant, and different books will choose different forms.

Finally, we can shift the distribution to have non-zero mean: \[ \bz = \by + \bmu. \] Shifting the PDF does not change its normalization, so we can simply substitute $\by=\bz\tm\bmu$ into the PDF for $\by$: \[ \fbox{$p(\bz) = \N(\bz; \bmu, \Sigma) = \frac{1}{|\Sigma|^{1/2}(2\pi)^{D/2}} \, e^{-\frac{1}{2} (\bz-\bmu)^\top \Sigma^{-1}(\bz-\bmu)}.$} \] Here we’ve generalized the $\N$ notation for Gaussian distributions to take a mean vector and a matrix of covariances. In one-dimension, these still correspond to the scalar mean, and the variance.

It’s a common mistake to forget the matrix inverse inside the exponential. The inverse covariance matrix $\Sigma^{-1}$, is also known as the precision matrix.

Check your understanding

A special case of a general transformation $A$, is a diagonal matrix, that simply stretches each variable independently: $\Lambda_{ij} = \delta_{ij}\sigma_i$. For three dimensions the transformation would be: \[ \Lambda = \left[ \begin{array}{ccc} \sigma_1 & 0 & 0 \\ 0 & \sigma_2 & 0 \\ 0 & 0 & \sigma_3 \\ \end{array} \right]. \] What is the covariance matrix $\Sigma$? If $\bx$ is generated with standard normals, and $\bz = \Lambda \bx$, can you write $p(\bz)$ as a product of univariate Gaussian distributions? Are the $z_i$ variables all independent?

You could use Matlab/Octave or Python to sample some points from different multivariate Gaussians, and see how the covariance affects the cloud of points.

For example, you could use a family of transformations parameterized by $a$: \[ A = \left[ \begin{array}{cc} 1 & 0 \\ a & 1\tm a \\ \end{array} \right], \] What does this transformation do? Is it clear why the variables are dependent for $a\!\ne\!0$? When are the variables maximally dependent? What happens to the PDF as $a\rightarrow1$ and why? Does the covariance matrix have an inverse when $a\te1$?

The shape of a two-dimensional Gaussian is often sketched using a contour of its PDF. The contours of the radially symmetric Gaussian are circular. So if you compute the $(x_1,x_2)$ coordinates on a circle, these can be joined up to plot a contour of the Gaussian with identity covariance. You can then transform these points, just like sample positions, with a matrix $A$, to plot a contour of $N(\bx;\mathbf{0},AA^\top)$.

A puzzle: Given a covariance matrix $\Sigma$, the transformation $A$ is not uniquely defined. For example \[ A = \left[ \begin{array}{cc} 2 & 0 \\ 0 & 2 \\ \end{array} \right], \quad\text{and}\quad A = \left[ \begin{array}{cc} \sqrt{3} & 1 \\ -1 & \sqrt{3} \\ \end{array} \right], \] both give the same product $AA^\top$. Can you explain how to generate an arbitrary number of transformations $A$ that share the same $\Sigma=AA^\top$?

Multivariate Gaussians

Independent Standard Normals

Covariance

Transforming and Rotating: general Gaussians

Check your understanding

Further reading