$$\notag \newcommand{\bff}{\mathbf{f}} \newcommand{\br}{\mathbf{r}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} \newcommand{\by}{\mathbf{y}} \newcommand{\g}{\,|\,} \newcommand{\pdd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\te}{\!=\!} $$

Regression and Gradients

So far we have managed to adapt our models to data using only least squares linear regression and manipulating Gaussian distributions. However, as we built more interesting models, you’ve probably had several questions that we haven’t been able to answer.

To answer these questions, we’ll need to fit non-linear optimization problems. These won’t have closed-form solutions, and so we will have to approach them numerically. Many, but not all, optimization problems are solved by gradient-based methods, which we begin to look at in this note. We start by looking at how linear least squares regression can be solved with gradient-based methods, so we can generalize it to other cases.

Where we’re going: In the next note we’ll fit a better regression model for classification. In later notes we’ll cover models for other types of data, and models that learn representations rather than using fixed basis functions. At first we will be fitting parameters, rather than taking the Bayesian approach of computing a posterior distribution. Towards the end of the course, we will return to the Bayesian approach, and apply it to non-Gaussian models.

1 Gradients for linear least squares regression

We can create a vector of residuals (differences between observed values and function values) for a linear regression model as follows: \[ \br = \by - X\bw. \] And previously we noticed that Matlab and NumPy know how to minimize the sum of the square residuals: \[\begin{align} \br^\top\br &= (\by - X\bw)^\top(\by - X\bw)\\ &= \by^\top\by - 2\bw^\top X^\top\by + \bw^\top X^\top X \bw. \end{align}\] We’ll now look at different ways that can be done.

Our first task is to find the gradient of this cost function with respect to the weights. That is, the vector of partial derivatives: \(\nabla_\bw \br^\top\br\). The gradient vector is a function of the weights. At a given position in weight-space, it points in the direction in which a small movement will increase the cost the most. You were asked to show this fact in the background self-test, question 6ii), which has an answer available if you need to review this material.

We can differentiate small matrix/vector expressions by writing them as sums, and using the elementary differentiation rules for scalars. For example: \[ \pdd{\bx^\top\by}{x_i} = \pdd{\sum_j x_j y_j}{x_i} = y_i, \quad \Rightarrow \quad \nabla_\bx [\bx^\top\by] = \by. \] and \[\begin{align} \pdd{\bx^\top A \bx}{x_i} &= \pdd{\sum_{jk} x_j A_{jk} x_k}{x_i} = \sum_{k} A_{ik} x_k + \sum_{j} x_j A_{ji},\\ \Rightarrow \quad \nabla_\bx [\bx^\top A \bx] &= A\bx + A^\top \bx, \quad \text{or $2A\bx$ if $A$ is symmetric}. \end{align}\] After some experience, you might remember some of these matrix/vector rules. Other such rules can be found in references like The Matrix Cookbook. I will discuss a more systematic approach to differentiate large functions in a later note.

For now, we can use the two rules we’ve derived to differentiate the cost function above: \[ \nabla_\bw [\br^\top\br] = -2 X^\top\by + 2X^\top X \bw. \]

2 A closed form solution

The partial derivatives are all zero at an optimum weight vector. If there is unique best setting of the weights, we can solve for where that happens: \[\begin{align} -2 X^\top\by + 2X^\top X \bw &= \mathbf{0}\\ \Rightarrow\quad X^\top X \bw &= X^\top\by\\ \Rightarrow\quad \bw &= (X^\top X)^{-1}X^\top\by. \end{align}\] If (and only if) there isn’t a unique solution for the weights that give the minimum square error, then \((X^\top X)\) is not invertible and the equation isn’t defined. The above expression is known as the normal equation solution of the least squares problem. You could implement this solution as follows:

    % Matlab/Octave
    ww = (X'*X)\(X'*yy)
    
    # NumPy (or use the Cholesky solvers in scipy):
    ww = np.linalg.solve(np.dot(X.T, X), np.dot(X.T, yy))

I have attempted to avoid numerical problems by solving the linear system directly rather than inverting the matrix \(X^\top X\). However this code is not the most accurate way to solve least squares problems, which can be accessed by calling a dedicated routine (through \ or lstsq, as previously demonstrated).1

3 Iterative methods

There are also generic algorithms that iteratively improve an initial guess of some model parameters using a cost function and its gradient vectors. While linear regression has specialist solvers, we could apply these generic algorithms anyway. If they don’t work in simple cases, they’re not likely to be useful more generally!

A naive way to use the gradient \(\nabla_\bw [\br^\top\br]\) is the steepest-descent method: \[ \bw \leftarrow \bw - \eta \nabla_\bw [\br^\top\br], \] which uses a small step size \(\eta\). The parameters are moved in the direction that makes the biggest immediate change. The rule is applied repeatedly, with the gradient re-evaluated before each update. There are several methods (found in Matlab’s optimization toolbox and scipy) that will converge faster. Sometimes much faster.

It may help to rewrite the gradient to understand what the steepest descent rule is doing. \[\begin{align} \nabla_\bw [\br^\top\br] &= 2X^\top (X \bw) -2 X^\top\by\\ &= 2X^\top(\bff -\by)\\ &= 2\sum_n \bx^{(n)} (f^{(n)} - y^{(n)}). \end{align}\] For each example we look at the ‘prediction error’, \((f-y)\). The weights are pulled most in the direction of inputs that had large prediction errors, to reduce misfit in those directions.

In Stochastic Gradient Descent (SGD) we take just one example at a time (perhaps at random, or visit the examples in order). Each example gives a crude one-sample Monte Carlo approximation of the gradient sum: \[ \frac{1}{N}\nabla_\bw [\br^\top\br] \approx 2 \bx^{(n)} (f^{(n)} - y^{(n)}). \] We take a small step in minus this direction. Then pick another example and repeat. Each time we see an example, we move the weights in a direction proportional to just one of the input vectors.

If we have 100,000 datapoints, we perform 100,000 updates in one sweep through the dataset (a ‘training epoch’). In the traditional (batch) steepest gradient descent method, we only perform one update after looking at the whole dataset once. In the limit of an infinite stream of data, SGD can fit a model as the data comes in. A traditional batch method never gets started, as it can never compute the gradient.

Once we have a working gradient-based optimization procedure2, we can apply it to problems beyond linear regression. We need to identify a suitable cost function, which no longer needs to be least squares, and then obtain its gradients.

4 Check your understanding

5 Further Reading

Bishop covers the least squares solution in Section 3.1.1. The equations include a basis function expansion (replacing \(X\) with \(\Phi\)), and motivates the least squares cost from the log-likelihood of a Gaussian model.

Murphy derives the least squares solution in Chapter 7 slightly more slowly. This description also uses a Gaussian model to motivate the least squares cost function.

Barber Chapter 17 starts with linear regression.

Numerical Recipes (Press et al.) will tell you that steepest gradient descent is a bad algorithm, and describes more sophisticated alternatives. See the section on Conjugate Gradient Methods in Multidimensions (section 10.6 of the second edition or 10.8 of the third edition). The books have some nice descriptions, but I would stay clear of the code (better alternatives with free licenses are available). However, they only describe ‘batch’ methods, that look at an entire dataset before making each update. Stochastic (steepest) gradient descent is not such a bad algorithm, especially for machine learning. https://arxiv.org/abs/1606.04838 is one recent survey.


  1. See Solving Least Squares Problems, Lawson and Hanson (1974), Chapter 19, which argues that it’s better to use a QR decomposition of the data matrix than losing precision by forming the summary \(X^\top X\). Murphy’s textbook also champions the QR approach. I must say I’m not totally convinced. For noisy and regularized machine learning problems I’ve tried, the normal equations approach seems fine, and is \(\sim\!10\times\) faster on my machine. However, I still frequently use the QR solvers: they’re still quite fast, convenient, and I don’t have to keep checking whether the normal equations approach will be accurate.

  2. And we might find one slightly more sophisticated than described above.