MLPR w5a - Machine Learning and Pattern Recognition

Derivatives for neural networks, and other functions with multiple parameters and stages of computation, can be expressed by mechanical application of the chain rule. Computing these derivatives efficiently requires ordering the computation carefully, and expressing each step using matrix computations.

[The first parts of this note cover general material on differentiation. The symbols used (\(f, x, y, \ldots\)) do not imply machine learning meanings — such as model outputs, inputs, or outputs — as they do in the rest of the course notes.]

A quick review of the chain rule

Pre-requisite: given a simple multivariate function, you should be able to write down its partial derivatives. For example: \[ f(x,y) = x^2\,y, \quad \text{means that} ~~\pdd{f}{x} = 2xy, ~~\pdd{f}{y} = x^2. \] You should also know how to use these derivatives to predict how much a function will change if its inputs are perturbed. Then you can run a check:

A function might be defined in terms of a series of computations. For example, the variables \(x\) and \(y\) might be defined in terms of other quantities: \(x = r\cos\theta\), and \(y = r\sin\theta\). The chain rule of differentiation gives the derivatives with respect to the earlier quantities: \[ \pdd{f}{r} = \pdd{f}{x}\,\pdd{x}{r} + \pdd{f}{y}\,\pdd{y}{r}, \quad\text{and}\quad \pdd{f}{\theta} = \pdd{f}{x}\,\pdd{x}{\theta} + \pdd{f}{y}\,\pdd{y}{\theta}. \] A small change \(\delta_r\) in \(r\) causes the function to change by about \(\pdd{f}{r}\delta_r\). That change is caused by small changes in \(x\) and \(y\) of \(\delta_x \approx \pdd{x}{r}\delta_r\) and \(\delta_y \approx \pdd{y}{r}\delta_r\).

You could write code for \(f(\theta,r)\) and find its derivatives by evaluating the expressions above. You don’t need answers from me: you can check your derivatives by finite differences.

In the example above, you could also substitute the expressions for \(x(\theta,r)\) and \(y(\theta,r)\) into the equation for \(f(x,y)\), and then differentiate directly with respect to \(\theta\) and \(r\). You should get the same answers. You might have found eliminating intermediate quantities to be an easier approach in this example: there was no need to reason about the chain rule. However, the chain rule approach is better for larger computations, such as neural network functions, because we can apply it mechanically.

In general, given a function \(f(\bx)\), where the vector of inputs is computed from another vector \(\bu\), the chain rule states: \[ \pdd{f}{u_i} ~=~ \sum_{d=1}^D \pdd{f}{x_d}\,\pdd{x_d}{u_i}. \] When a quantity \(u_i\) is used multiple times in a function computation, we sum over the effect it has through each of the quantities that we compute directly from it.

Application to graphs of scalar computations

A function expressed as a sequence of operations in code can be represented as a Directed Acyclic Graph (DAG)¹, for example:

\[\begin{array}{rcccl} & & \!\!w\!\! & & \\ &\!\!\nearrow & & \searrow & \\ u \rightarrow v & & & & \!\!y \rightarrow z\\ &\!\!\searrow & & \nearrow & \\ & & \!\!x\!\! & & \\ \end{array}\]

To compute the numerical value of each \(\dot\theta\), we only need the derivatives of the elementary function used at that stage, and the numerical values of the parents’ derivative signals.

A stage of the derivative computation can be computationally cheaper than computing the function in the corresponding stage. For example \(\dot{z} = z\dot{y}\) requires one floating-point multiply operation, whereas \(z = \exp(y)\) usually has the cost of many floating point operations. Propagating derivatives can also be more expensive: \(y\) requires one multiply, whereas \(\dot{y}\) requires two multiplies and an addition. However, for the elementary mathematical functions we use in code, the derivatives are never much more expensive.

The computation of the derivatives can be done alongside the original function computation. Only a constant factor of extra memory is required: we need to track a \(\dot{\theta}\) derivative for every \(\theta\) intermediate quantity currently in memory. Because only derivatives of elementary functions are needed, the process can be completely automated, and is then called forwards-mode automatic differentiation (AD).²

To compute the numerical value of each \(\bar\theta\), we only need the derivatives of the elementary function used at that stage, and the numerical values of the childrens’ derivative signals. Reverse-mode automatic differentiation tools also exist, which can construct a graph and compute reverse-mode derivatives automatically from your code for a function.

Again the number of floating point operations is at most a small constant factor times the number required for one function evaluation. However, the computation can no longer be done in parallel with the original function computation. We need to traverse the graph in reverse after computing the function, using stored values of the intermediate quantities. The memory cost of reverse mode differentiation can be much higher than for the function, if we wouldn’t normally keep the whole computation graph in memory.³

Comparison: At the end of the forwards-mode computation we had accumulated \(\dot{z} = \pdd{z}{u}\). At the end of the reverse-mode computation we accumulated the same quantity: \(\bar{u} = \pdd{z}{u}\). For this example: \[ \dot{z} = \bar{u} = \pdd{z}{u} = \pdd{v}{u}\left(\pdd{w}{v}\pdd{y}{w} + \pdd{x}{v}\pdd{y}{x}\right)\pdd{z}{y}, \] with forwards-mode running left-to-right, and reverse-mode running right-to-left.

Reverse-mode tools are harder to implement, and use more memory, but have a major advantage for machine learning applications. One reverse-mode sweep simultaneously accumulates the derivatives of the output with respect to every quantity in the computation graph. If we give the function multiple inputs, we can still obtain all of the derivatives in a constant times the number of operations used by one function evaluation. However, the forwards-mode procedure, like finite-differences, would require multiple sweeps through the graph, one sweep for each partial derivative. For a neural network with millions of parameters, reverse-mode differentiation (or backpropagation⁴) is millions of times faster than forwards-mode differentiation and finite differences!

It’s easy to dismiss differentiation as “just the chain rule”. But reverse-mode differentiation is amazing. We can get a million partial-derivatives, which describe the whole tangent-hyperplane of a function, usually for the cost of only a couple of evaluations of the scalar function!

Application to graphs of array-based operations

Cost functions in machine learning are usually based on matrix computations. In theory, matrix computations can be broken down into primitive scalar operations. Then reverse-mode automatic differentiation could be applied. This theory tells us that there is always an algorithm for computing derivatives of a cost function that uses at most a few times as many operations as the cost function itself.

However, in practice, machine learning code is not usually written at the level of primitive scalar operations. Where possible, code is written using high-level matrix operations, which are provided by a library like BLAS or LAPACK, often optimized for the hardware we’re using. Our derivatives need to use these optimized libraries too, or they will be slow.⁵

As before, we consider a computation graph that ends with the computation of a scalar \(z\). An intermediate computation creates a matrix \(C\) from two parent matrices \(A\) and \(B\). \[\begin{array}{cccl} \cdots \rightarrow & \!\!A\!\! & & \\ & & \searrow & \\ & & & \!\!C \rightarrow \cdots \rightarrow z\\ & & \nearrow & \\ \cdots\rightarrow & \!\!B\!\! & & \\ \end{array}\] In reverse-mode differentiation we will compute \(\bar{C}\), a matrix the same size as \(C\) where \(\bar{C}_{ij} = \pdd{z}{C_{ij}}\). We then need to backpropagate this derivative signal to compute \(\bar{A}\) and \(\bar{B}\). Where \(\bar{A}_{ij} = \pdd{z}{A_{ij}}\) and \(\bar{B}_{ij} = \pdd{z}{B_{ij}}\).

The chain rule gives a general equation⁶ for backpropagating matrix derivatives: \[ \bar{A}_{ij} = \pdd{z}{A_{ij}} = \sum_{k,l} \pdd{z}{C_{kl}}\pdd{C_{kl}}{A_{ij}} = \sum_{k,l} \bar{C}_{kl}\pdd{C_{kl}}{A_{ij}}. \] However, we shouldn’t evaluate all of the terms in this equation. If \(A\) and \(C\) are \(N\ttimes N\) matrices, there are \(N^4\) elements \(\left\{\pdd{C_{kl}}{A_{ij}}\right\}\), when considering all combinations of \(i\), \(j\), \(k\), and \(l\). We’ve argued that derivatives can have the same computational scaling as the cost function, but most matrix functions scale better than \(O(N^4)\). Therefore, it is usually possible to evaluate the sum above without explicitly computing all of its terms.

Given a standard matrix function, we don’t want to differentiate the function! That is, we usually won’t compute the partial derivatives of all the outputs with respect to all the inputs. Instead we derive a propagation rule that takes the derivatives of a scalar cost function with respect to the output, \(\bar{C}\), and returns the derivatives with respect to parent quantities: in the above example \(\bar{A}\), and \(\bar{B}\). These reverse-mode propagation rules are the building blocks for differentiating larger matrix functions. By chaining them together we can differentiate any function built up of primitives for which we know the propagation rules.

Standard results: The general backpropagation rule above simplifies for some standard matrix operations as follows:

Matrix-matrix multiplication is an expensive operation — \(O(LMN)\) for \(L\ttimes M\) and \(M\ttimes N\) matrices⁷ — it’s often a computational bottleneck, and so this operation is also heavily optimized in library routines. The derivative propagation rule for matrix multiplication above can use the same optimized routines, and so has similar cost to the corresponding stage of the original function.

You could break down any matrix operation into a sequence of scalar operations, apply backpropagation to those, and then try to implement the result using matrix operations. However, there is also an elegant framework for deriving matrix-based backpropagation rules, described by Giles (2008) — full reference below.

Tools like Theano (developed 2008–2017) made automated application of reverse-mode differentiation popular in machine learning. Theano knows the backpropagation rules for many standard matrix operations, and can thus automatically differentiate matrix-based functions built from these primitives, including most neural networks. Recently a large number of alternative tools have emerged. If you want to add a new matrix function primitive to one of these tools, you often need to add the corresponding gradient propagation rule by hand.

Application to straightforward feedforward networks

The computation of a feedforward neural network’s error on a training example is easily expressed as a graph, starting at the input features and moving through the layers of the neural network. Each layer introduces more parameters, which are additional inputs to the computation. In one reverse mode sweep through the network we can work out the derivative signals for all of the parameters in all of the layers.

In each layer of a standard feedforward neural network we form an activation using the weights and biases for the layer and the values of the units in the layer below: \[ \ba^{(l)} = W^{(l)}\bh^{(l-1)} + \bb^{(l)} \qquad\text{or}\qquad A^{(l)} = W^{(l)}H^{(l-1)} + \bb^{(l)}\mathbf{1}_B^\top. \] The equation on the left is for a single \(K^{(l)}\ttimes1\) vector of activations, using a \(K^{(l)}\ttimes K^{(l-1)}\) weight matrix, and a \(K^{(l)}\ttimes1\) vector of biases. On the right we compute a \(K^{(l)}\ttimes B\) matrix of activations \(A^{(l)}\) from a matrix of hidden unit values in the previous layer, for a minibatch of \(B\) examples.⁸ The biases are added to the activation for each example with the help of a \(B\ttimes 1\) vector of ones, \(\mathbf{1}_B\).⁹ Computing the layer for a whole minibatch at once lets us use a matrix-matrix product, which in most frameworks is heavily optimized.

These activations then have an element-wise function applied: \[ \bh^{(l)} = g^{(l)}(\ba^{(l)}) \qquad\text{or}\qquad H^{(l)} = g^{(l)}(A^{(l)}), \] to give hidden unit values. These values are inputs into the next layer, or in the final layer give the network’s function value \(f = h^{(L)}\) or values \(\bff = \bh^{(L)}\). We could also create a matrix of function values for a minibatch \(F = H^{(L)\top}\). I’ve added a transpose to make the output \(B\ttimes K\), because in this example the neural net internally has examples in columns, but data is often stored with examples in rows.¹⁰

Finally, we compare the network’s function values to the training targets, to compute a scalar training cost \[ \text{for a single example:}~ c = L(f, y), ~~\text{or}~~ c = L(\bff, \by), \qquad \text{or a minibatch:}~ c = L(F, Y). \] We find derivatives of this scalar cost with respect to everything in the network by reverse-mode differentiation, or backpropagation. For example, \(\bar{W}^{(l)}_{ij} = \pdd{c}{W^{(l)}_{ij}}\).

Traditional explanations of backpropagation (e.g., Murphy 16.5.4) refer to “error signals” \(\delta\) (not to be confused with small perturbations \(\delta\) as used in the opening section of this note). These error signals are the reverse-mode derivative signals for the activations of each layer: \(\delta^{(l)}_{nk} = \pdd{c}{A^{(l)}_{kn}}\).

We can backpropagate these \(\bdelta \te \bar{\ba}\) error signals through the layer equations above by deriving the updates from scratch, or by combining the standard backpropagation rules for matrix operations: \[\begin{aligned} \bar{\bb}^{(l)} &= \bar{\ba}^{(l)} & \bar{\bb}^{(l)} &= \bar{A}^{(l)}\mathbf{1}_B = {\textstyle \sum_{b=1}^B \bar{A}^{(l)}_{:,b}}\\ \bar{W}^{(l)} &= \bar{\ba}^{(l)} \bh^{(l-1)\top} & \bar{W}^{(l)} &= \bar{A}^{(l)} H^{(l-1)\top}\\ \bar{\bh}^{(l-1)} &= W^{(l)\top}\bar{\ba}^{(l)} & \bar{H}^{(l-1)} &= W^{(l)\top}\bar{A}^{(l)}\\ \bar{\ba}^{(l-1)} &= g^{(l-1)'}(\ba^{(l-1)}) \odot \bar{\bh}^{(l-1)}~~ & ~~\bar{A}^{(l-1)} &= g^{(l-1)'}(A^{(l-1)}) \odot \bar{H}^{(l-1)}. \end{aligned}\] We need to derive or look up \(g^{(l)'}\), the derivative of the non-linearity in the \(l\)th layer. We obtain the gradients for all of the parameters in the layer, and a new error signal to backpropagate to the previous layer.

Check your understanding

Do you understand the scalar example of forward and reverse propagation well enough to numerically check that the maths in that section is correct for some random settings of \(u\)?

The updates are given for each layer of a neural network once backpropagation has begun. Can you see how to use the derivatives of the training loss function to start the backpropagation procedure?

You want to backpropagate through the linear algebra operation \(C = A^\top B\). You recall that \(\bar{A}\) involves B and \(\bar{C}\) somehow. It’s something like \(\bar{A} = B \bar{C} , ~ \bar{A} = B^\top\bar{C} , ~ \bar{A} = B \bar{C}^\top, \bar{A} = B^\top\bar{C}^\top, \bar{A} = \bar{C} B, ~ \bar{A} = \bar{C} B^\top, ~ \bar{A} = \bar{C}^\top B, ~ \text{or} ~ \bar{A} = \bar{C}^\top B^\top\). How can you immediately know which of these is the only one that could be correct in general? Could you check it numerically?

Given a rule to compute \(\bar{B}\) from \(\bar{C}\), how could you compute a derivative of the local function \(\pdd{C_{ij}}{B_{kl}}\) if you wanted to know it? You should know why we don’t usually compute all of these derivatives.

By combining the propagation rules for \(C\te AB\) and matrix transposition, can you backpropagate through the quadratic form \(c = \bx^\top A\by\)? You could check your answer by deriving the expression a different way, and/or numerically. Can you then write down the \(\bar{\bx}\) and \(\bar{A}\) rules for \(c = \bx^\top A\bx\)?

Why study this material if derivatives can be done by software?

The code for the derivatives in many methods can be short and simple. You won’t always want to bring in Theano or TensorFlow as a dependency.

You need to understand how backpropagation works to structure your software correctly, or provide new derivative propagation operators, even if you lean on automatic differentiation in parts. Currently several extensible “Gaussian process” frameworks are slower than they should be, because they have structured part of their derivative computations incorrectly. GPML updated its implementation of gradients in September 2016 to have the correct scaling. Widespread understanding of how to do differentiation efficiently is surprisingly recent!

Some learning procedures involving modifying (hacking) parts of the gradient computation. Implementing such procedures probably requires understanding how the derivatives are computed.

Some neural network code still resorts to computing derivatives by hand. The automatic differentiation engines in machine learning frameworks aren’t yet clever enough to save memory in all of the creative ways people can implement by hand.

Moving outside neural networks, software support for automatic differentiation still isn’t perfect. There are highly regarded machine learning papers, even with authors who are early adopters of backpropagation, that contain inefficient expressions for derivatives¹¹. You might avoid these mistakes if you study reverse mode differentiation more generally than backpropagation for standard feedforward neural networks.

Further Reading

Reverse mode differentiation and its automatic application to code is an old idea:

Automatic differentiation in machine learning: a survey. Atilim Gunes Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind (2015). An overview of automatic differentiation and some common misconceptions.

In my opinion, the Baydin et al. survey misses the real reasons that adoption in machine learning has been so slow. Software tools that work easily with the sort of code written by machine learning researchers haven’t existed. Theano wasn’t as general as traditional AD compilers, but it did most of what people want, and was easy to use, so it revolutionized how machine learning code was written. Now the gap of what’s possible and what’s easy needs closing further.

What if we want machine learning tools to automatically pass derivatives through a useful function of a matrix, like the Cholesky decomposition, that isn’t commonly used in neural networks? I recently wrote a fairly tutorial note https://arxiv.org/abs/1602.07527 discussing the options. Matrix-based approaches from this note were rapidly adopted by several tools, including TensorFlow, Autograd, and Theano. However, it’s still the case that there are mainstream machine learning packages that still can’t differentiate some reasonably common linear algebra operations. I’m sure the situation will improve.

As the Cholesky note explains, I think many of the existing guides on differentiating matrix computations are misleading. The note I found most useful was An extended collection of matrix derivative results for forward and reverse mode automatic differentiation, Mike B Giles, 2008. The introduction is succinct, so requires some careful reading. However, the note focusses on the right primitive task: given a matrix function \(C(A,B)\), backpropagate the derivative signal \(\bar{C}\) to obtain \(\bar{A}\) and \(\bar{B}\), without creating large intermediate objects. Giles also provides example Matlab/Octave code, and demonstrates how to check everything. This is the note that finally made me believe I could differentiate any reasonable matrix-based code without it being a huge chore.

Caveats: 1) Computing functions does not normally require keeping the whole DAG in memory. Computing derivatives as advocated in this note can sometimes have prohibitive memory costs. 2) If you take the PMR course, you will see DAG’s defining probability distributions rather than deterministic functions as here. The diagram here is not a probabilistic graphical model.↩
The (non-examinable) complex step trick mentioned in the previous note is a hacky proof of concept: for some cases (analytic functions of real-valued inputs) it tracks a small multiple of the \(\dot{\theta}\) derivative quantities in the complex part of each number. More general AD tools exist.↩
A literature on reducing the memory consumption of reverse-mode differentiation exists. There are tricks to avoid storing everything, at the cost of more computation, including “check-pointing” and recomputing the inputs of reversible operations from their outputs.↩
Backpropagation is the term from the neural network literature for reverse-mode differentiation. Backpropagation is also sometimes used to mean the whole procedure of training a network with a gradient descent method, where the gradients come from reverse-mode differentiation.↩
How slow? It’s hardware and compiler dependent. The last time I compared my naive C code for a matrix operation to a hardware-optimized BLAS implementation, the difference in speed was about \(50{\times}\).↩
If \(A\) is a parent to more than one child in the graph, then \(\bar{A}\) is the sum over this equation for all its children \(C\).↩
In practice. There are algorithms with better scaling, but they have large constant factors.↩
Apologies: I’ve put the activations and hiddens for each example in columns rather than rows in this section. The single example and mini-batch cases look more similar this way, but it’s probably less standard, and doesn’t match the orientation of the design matrix, so I’d have to say \(H^{(0)} = X^\top\).↩
The vector of ones isn’t required in NumPy code, or recent Matlab, because these languages broadcast the addition operator automatically. However, using the vector of ones makes the “\(+\)” a standard matrix addition, and helps us get the derivatives correct.↩
Earlier in the notes (and later), when considering \(N\) scalar targets, I used \(\by\) and \(\bff\) as \(N\ttimes1\) vectors. Whereas here I just used \(\bff\) for a single \(K\)-dimensional network output. Another option is to use \(B\ttimes K\) matrices \(F\) and \(Y\) for predictions and targets for \(B\) examples, even if \(K\te1\). I’m afraid keeping track of the orientation and meaning of matrices is generally a pain.↩
e.g., https://papers.nips.cc/paper/2566-neighbourhood-components-analysis ↩

Backpropagation of Derivatives