Variational methods used to be complicated. After writing down the standard KL-divergence objective (previous note), researchers would try to derive clever fixed-point update equations to optimize it. For some models, including simple ones like logistic regression, this strategy didn’t work out. Special-case variational objectives would be crafted for particular models. As a result, most text-book treatments of the applications of variational methods are fairly complicated, and beyond what’s required for this course.

Fortunately the stochastic variational inference (SVI) methods developed in the last few years are simpler to understand, more general, and scale to enormous datasets. This note will outline enough of the idea to explain the demonstration code on the website. The demonstration is applied to logistic regression, but in principle its log-likelihood and gradients could be replaced with any other model with real-valued parameters.

As a reminder, we wish to minimize \[
J(\bm,V) = \mathbb{E}_q[\log q(\bw)] -\mathbb{E}_q[\log p(\D\g\bw)] -\mathbb{E}_q[\log p(\bw)],
\] with respect to our variational parameters \(\{\bm, V\}\), the mean and covariance of our Gaussian approximate posterior. For any \(\{\bm, V\}\) we have a lower bound on the log-marginal likelihood, or the model evidence^{1}: \[
\log p(\D)\ge -J(\bm,V).
\] We would like to maximize the model likelihood \(p(\D)\) with respect to any hyperparameters, such as the prior variance on the weights, \(\sigma_w^2\). We can’t do that exactly, but we can instead minimize \(J\) with respect to these parameters. So, we will jointly minimize \(J\) with respect to the variational distribution and model hyperparameters, aiming for a tight bound^{2} and a large model likelihood.

Stochastic gradient descent on parameters \(V\) and \(\sigma_w^2\) will sometimes set negative variances and covariances that aren’t positive definite. Instead we should optimize unconstrained quantities, such as \(\log \sigma_w\).

To optimize a covariance matrix, we can first write it in terms of its Cholesky decomposition: \(V = LL^\top\). The diagonal elements of \(L\) are positive^{3}, the other elements are unconstrained. We take the log of the diagonal elements of the Cholesky decomposition, leaving the other elements equal to the values in the Cholesky decomposition, and optimize that unconstrained matrix.

The negative entropy term \(\mathbb{E}_q[\log q(\bw)]\) and cross-entropy term \(\mathbb{E}_q[\log p(\bw)]\) can both be computed in closed form, if the prior is Gaussian. Substituting in the definition of a general multivariate Gaussian we get: \[ \E_{\N(\bw;\, \bm, V)}\left[ \log\N(\bw;\,\mu,\,\Sigma)\right] =\E_{\N(\bw;\, \bm, V)}\left[ -\frac{1}{2}(\bw-\mu)^\top \Sigma^{-1}(\bw-\mu)\right] -\frac{1}{2} \log\left|2\pi\Sigma\right|. \] The remaining expectation is of a quadratic form, and so can be expressed in terms of \(\bm\) and \(V\). For the negative entropy term, things simplify considerably: \[ \E_{\N(\bw;\, \bm, V)}\left[ \log\N(\bw;\,\bm,V)\right] = -\frac{D}{2} - \frac{1}{2}\log|2\pi V|. \] For a spherical Gaussian prior, the cross entropy term turns out to be: \[ -\E_{\N(\bw;\, \bm, V)}\left[ \log\N(\bw;\,\mathbf{0},\,\sigma_w^2 I)\right] = \frac{1}{2\sigma_w^2} (\text{Trace}(V) + \bm^\top\bm) + \frac{D}{2}\log(2\pi\sigma_w^2). \] We can evaluate both entropy terms, and they are both differentiable. The terms involving \(V\) are simple functions of the Cholesky decomposition: \(|V| = \sum_i \log L_{ii}\), and \(\text{Trace}(V) = \sum_{ij} L_{ij}^2\), making it easy to get the gradients we need.

The final term, the average negative log-likelihood under the variational posterior, \[ -\E_{\N(\bw;\, \bm, V)}\left[\log p(\D\g\bw) \right] = -\E_{\N(\bw;\, \bm, V)}\left[\sum_{n=1}^N \log p(y\nth\g\bx\nth,\bw) \right] \] cannot be computed in closed form. We could convert the integral of each term into a 1D integral and compute it numerically. However, that is expensive.

We only need unbiased estimates of the gradients to perform stochastic gradient descent. We can get a get a simple “Monte Carlo” unbiased estimate by sampling a random weight from the variational posterior: \[ -\E_{\N(\bw;\, \bm, V)}\left[\log p(\D\g\bw) \right] \approx -\sum_{n=1}^N \log p(y\nth\g\bx\nth,\bw), \quad \bw\sim \N(\bm, V). \] We can also replace the sum by scaling up the contribution of a random example, and still get an unbiased estimate: \[ -\E_{\N(\bw;\, \bm, V)}\left[\log p(\D\g\bw) \right] \approx -N\log p(y\nth\g\bx\nth,\bw), \quad \bw\sim \N(\bm, V), n \sim \text{Uniform}\{1..N\}. \] Alternatively we could take the average log-likelihood for a random minibatch of \(M\) examples, and scale it up by \(N\). The demonstration code just uses all of the data in each update, as the number of datapoints was small.

We could obtain gradients of the log-likelihood term by differentiating the expectation symbolically, and then finding a Monte Carlo approximation of the result. However, it’s easier, and lower-variance to use a “reparameterization trick”.

The standard way to sample a random weight \(\bw\) from the variational posterior is to sample a vector of standard normals \(\bnu\sim\N(\mathbf{0}, I)\) and transform it: \(\bw = \bm + L\bnu\). Similarly, the expectation can be rewritten: \[ \E_{\N(\bw;\, \bm, V)}\left[ f(\bw) \right] = \E_{\N(\bnu;\, \mathbf{0}, I)}\left[ f(\bm + L\bnu) \right]. \] The expectation is now under a constant distribution, so it’s easy to write down derivatives with respect to the variational parameters: \[\begin{aligned} \nabla_{\bm} \E_{\N(\bw;\, \bm, V)}\left[ f(\bw) \right] &= \E_{\N(\bnu;\, \mathbf{0}, I)}\left[ \nabla_{\bm} f(\bm + L\bnu) \right]\\ &\approx \nabla_{\bm} f(\bm + L\bnu), \quad \bnu\sim\N(\mathbf{0},I), \end{aligned}\] and \[\begin{aligned} \nabla_L \E_{\N(\bw;\, \bm, V)}\left[ f(\bw) \right] &\approx \nabla_{L} f(\bm + L\bnu), \quad \bnu\sim\N(\mathbf{0},I)\\ & = [\nabla_\bw f(\bw)]\bnu^\top, \quad \bw = \bm + L\bnu,~~\bnu\sim\N(\mathbf{0},I). \end{aligned}\] To estimate both gradients, we just need to be able to evaluate gradients of the log-likelihood function with respect to the weights, which we already know how to do if we can do maximum likelihood fitting.

Unbiased estimates of gradients of “the ELBO”, the variational lower bound on the marginal likelihood, don’t require any new derivations for new models. Tricks were required to make the parameterization of the Gaussian unconstrained, and to derive the working above. However, all of that work only needs doing once. In the demonstration code these details are put into one standard function. This routine can be used as a “black box” — like a generic optimizer. We pass the routine a function that evaluates our log likelihood and its gradients, and we can perform variational inference in the new model.

If we have derivatives of a cost with respect to the prior variance \(\sigma_w^2\), how do we convert them into derivatives with respect to \(s_w = \log\sigma_w\)? (Converting derivatives for the unconstrained version of the Cholesky factor is similar, just more details to keep track of.)

When attempting to fit \(\bm\) and \(V\) by stochastic gradient descent, you might find that the updates are quite noisy, and the variational parameters don’t converge. What are two ways that you might fix this issue?

Do you recall how to use \(\bm\) and \(V\) to make probabilistic predictions for logistic regression? If you just wanted to make hard decisions, reporting whether \(P(y\te1\g\bx,\D)>0.5\), explain why you don’t need \(V\). Is there any point doing variational inference in this case?

Shakir Mohamed has recent tutorial slides on variational inference. The final slide has a reading list of both the classic and modern variational inference papers that discovered the theory in this note.

Black-box stochastic variational inference in five lines of Python, David Duvenaud, Ryan P. Adams, NIPS Workshop on Black-box Learning and Inference, 2015. The associated python mode and neural net demo require autograd.

Where this “Evidence Lower BOund” is often called the ELBO.↩

The bound won’t become tight in the sense of exact unless the true posterior is Gaussian.↩

The diagonal elements could be negative in the same sense that we could set \(\sigma_w\) negative and get a positive \(\sigma_w^2\). However, optimizing the Cholesky decomposition \(L\) directly, allowing negative diagonal values, is not a good idea: the gradients become extreme when a diagonal element approaches zero.↩