$$\notag \newcommand{\E}{\mathbb{E}} \newcommand{\N}{\mathcal{N}} \newcommand{\ba}{\mathbf{a}} \newcommand{\bb}{\mathbf{b}} \newcommand{\be}{\mathbf{e}} \newcommand{\bff}{\mathbf{f}} \newcommand{\bh}{\mathbf{h}} \newcommand{\bphi}{{\boldsymbol{\phi}}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} \newcommand{\by}{\mathbf{y}} \newcommand{\bz}{\mathbf{z}} \newcommand{\g}{\,|\,} \newcommand{\nth}{^{(n)}} \newcommand{\pdd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\te}{\!=\!} \newcommand{\ttimes}{\!\times\!} $$

Notation

Textbooks, papers, code, and other courses will all use different names and notations for the things covered in this course. While learning a subject, these differences can be confusing. However, dealing with different notations is a necessary research skill. There will always be differences in presentation in different sources, due to different trade-offs and priorities.

We try to make the notation fairly consistent within the course. This note lists some of the conventions that we’ve chosen to follow that might be unfamiliar.

You can probably skip over this note at the start of the class. Most notation is introduced in lectures as we use it. However, everything mentioned here is something that has been queried by previous students of this class. So please refer back to this note if you find any unfamiliar notation later on.

1 Vectors, matrices, and indexing

Indexing:  Where possible, these notes use lower-case letters for indexing, with the corresponding capital letters for the numbers of settings. Thus the \(D\) ‘dimensions’ or elements of a vector are indexed with \(d = 1\ldots D\), and \(N\) training data-points are indexed with \(n = 1\ldots N\).

Vectors in this course are all column-vectors — these are just columns of numbers, you don’t need to know about more abstract vector-spaces in this course. We use bold-face lower-case letters to denote these column vectors in the typeset notes. For example, we write a \(D\)-dimensional vector as: \[ \bx = \left[\begin{array}{c} x_1\\x_2\\\vdots\\x_D \end{array}\right] = [x_1 ~~ x_2 ~\cdots~ x_D]^\top. \] If we need to show the contents of a column vector, we often create it from a row-vector with the transpose \({}^\top\) to save vertical space in the notes. We use subscripts to index into a vector (or matrix).

Depending on your background, you might be more familiar with \(\vec{x}\) than \(\bx\).

In handwriting during lectures we underline vectors: \(\underline{x}\) because it’s difficult to handwrite in bold! We recommend you write with the same notation in the exam. But as long is it’s clear from context what you are doing, it’s not critical. It’s likely we’ll miss out the odd underline accidentally while writing in lectures, so we understand.

Matrices (for us rectangular arrays of numbers) are upper-case letters, like \(A\) and \(\Sigma\). We’ve chosen not to bold them, even though there are sometimes numbers represented by upper-case letters floating around (such as \(D\) for number of dimensions). It should usually be clear from context which quantities are matrices, and what size they are. See the cribsheet for details on indexing matrices, and how sizes match in matrix-vector multiplication:
https://homepages.inf.ed.ac.uk/imurray2/pub/cribsheet.pdf

Addition: normally sizes of vectors or matrices should match when adding quantities: \(\ba + \bb\), or \(A + B\). However, to add a scalar to every element, we’ll just write \(\ba + c\) or \(A + c\).

Indexing items:  Sometimes we use superscripts to identify items, such as \(\bx\nth\) for the \(n\)th \(D\)-dimensional input vector in a training set with \(N\) examples. We can (and often do) stack these vectors into an \(N\ttimes D\) matrix \(X\), so we could use a notation such as \(X_{n,:}\) to access the \(n\)th row. In this case we chose to introduce the extra superscript notation instead. In the past, many students have found it hard to follow the distinction between datapoints and feature dimensions when studying “kernel methods” such as Gaussian processes. The hope was that familiarizing you with a notation where datapoint identity and feature dimensions look different would help avoid confusion later.

2 Probabilities

The probability mass of a discrete outcome \(x\) is written \(P(x)\).

When it doesn’t seem necessary (nearly always) we don’t introduce notation for a corresponding random variable \(X\), and write more explicit expressions like \(P_X(x)\) or \(P(X\te x)\). Notation is a trade-off, and more explicit notation can be more of a burden to work with.

Joint probabilities: \(P(x,y)\). Conditional probabilities: \(P(x\g y)\). Conditional probabilities are sometimes written in the literature as \(P(x;\,y)\) — especially in frequentist statistics rather than Bayesian statistics. The ‘\(|\)’ symbol, introduced by Jeffreys, is historically associated with Bayesian reasoning. Hence for arbitrary functions, like \(f(\bx;\, \bw)\), where we want to emphasize that it’s primarily a function of \(\bx\) controlled by parameters \(\bw\), we’ve chosen not to use a ‘\(|\)’.

The probability density of a real-valued outcome \(x\) is written with a lower-case \(p(x)\). Such that \(P(a < X < b) = \int_a^b p(x)\,\mathrm{d}x\). We tend not to introduce new symbols for density functions over different variables, but again overload the notation: we call them all “\(p\)” and infer which density we are talking about from the argument.

Gaussian distributions are reviewed later in the notes. We will write that an outcome \(x\) was sampled from a Gaussian or normal distribution as \(x\sim\N(\mu, \Sigma)\). We write the probability density associated with that outcome as \(\N(x;\,\mu, \Sigma)\). We could also have chosen to write \(\N(x\g \mu, \Sigma)\), as Bishop and Murphy do. The ‘\(;\)’ was force of habit, because the Gaussian (outside of this course) is used in many contexts, and not just Bayesian reasoning.

3 Integrals and expectations

All of the integrals in this course are definite integrals corresponding to expectations. For a real-valued quantity \(x\), we write: \[ \E_{p(x)}[f(x)] = \E[f(x)] = \int f(x)\,p(x)\,\mathrm{d}x. \] This is a definite integral over the whole range of the variable \(x\). We might have written \(\int_{-\infty}^\infty\ldots\) or \(\int_X\ldots\), but because our integrals are always over the whole range of the variable, we don’t bother to specify the limits.

The expectation notation is often quicker to work with than writing out the integral. As above, we sometimes don’t specify the distribution (especially during lectures when writing quickly), if it can be inferred from context.

Please do review the background note on expectations and sums of random variables. Throughout the course we will see generalizations of those results to real-valued variables (as above) and expressions with matrices and vectors. You need to have worked through the basics.

You may have seen multiple dimensional integrals written with multiple integral signs, for example for a 2-dimensional vector: \[ \int \int f(\bx)\,\mathrm{d}x_1\,\mathrm{d}x_2. \] Our integrals are often over high-dimensional vectors, so we simply write: \[ \int f(\bx)\,\mathrm{d}\bx. \] And if we are integrating over multiple vectors we still might only write one integral sign: \[ \int f(\bx,\bz)\,\mathrm{d}\bx\,\mathrm{d}\bz. \]

4 Derivatives

Partial derivative of a scalar with respect to another scalar: \(\pdd{f}{x}\). Example: \(\pdd{\sin(yx)}{x} = y\cos(yx)\).

Column vector of partial derivatives: \(\nabla_\bw f = \left[ \pdd{f}{x_1} ~~ \pdd{f}{x_2} ~~ \cdots ~~ \pdd{f}{x_D} \right]^\top\).

These notes avoid writing derivatives involving vectors as \(\pdd{\by}{\bz}\). Usually this expression would be a matrix with \(\left(\pdd{\by}{\bz}\right)_{ij} = \pdd{y_i}{z_j}\). Thus the derivative of a scalar \(\pdd{f}{\bw}\) is a row vector, \((\nabla_\bw f)^\top\).

We also don’t write \(\pdd{A}{B}\) for matrices \(A\) and \(B\). While we could put all of the partial derivatives \(\left\{ \pdd{A_{ij}}{B_{kl}}\right\}\) into a 4-dimensional array, that’s often not a good idea in machine learning.

Instead we review a notation more suitable for computing derivatives, where derivative quantities are stored in arrays of the same size and shape as the original variables. All will be explained in the note on Backpropagation of Derivatives.

5 Frequently used symbols

We try to reserve some letters to mean the same thing throughout the course, so you can recognize what the terms in equations are at a glance.