MLPR w0f - Machine Learning and Pattern Recognition

$$ \newcommand{\E}{\mathbb{E}} \newcommand{\la}{\!\leftarrow\!} \newcommand{\noo}[1]{\frac{1}{#1}} \newcommand{\ra}{\!\rightarrow\!} \newcommand{\te}{\!=\!} \newcommand{\tm}{\!-\!} \newcommand{\ttimes}{\!\times\!} $$

Expectations and sums of variables

You are expected to know some probability theory, including expectations/averages. This sheet reviews some of that background.

Probability Distributions / Notation

An outcome, $x$, comes from a discrete set or ‘alphabet’ $\mathcal{A}_X = \{a_1,a_2,\dots, a_I\}$, with corresponding probabilities $\mathcal{P}_X = \{p_1,p_2, \dots, p_I\}$.

Examples

A standard six-sided die has $\mathcal{A}_X = \{1,2,3,4,5,6\}$ with corresponding probabilities $\mathcal{P}_X = \{\noo{6},\noo{6},\noo{6},\noo{6},\noo{6},\noo{6}\}$.

A Bernoulli distribution, which has probability distribution \[ P(x) = \begin{cases}1 & p,\\0 & \text{otherwise,}\end{cases} \] has alphabet $\mathcal{A}_X = \{1,0\}$ with $\mathcal{P}_X = \{p,1\tm p\}$.

Expectations

An expectation is a property of a probability distribution, defined by a probability-weighted sum. The expectation of some function, $f$, of an outcome, $x$, is: \[ \E_{P(x)}[f(x)] = \sum_{i=1}^I p_i f(a_i). \] Often the subscript $P(x)$ is dropped from the notation because the reader knows under which distribution the expectation is being taken. Notation can vary considerably, and details are often dropped. You might also see $E[f]$, $\mathcal{E}[f]$, or $\langle f\rangle$, which all mean the same thing.

The expectation is sometimes a useful representative value of a random function value. The expectation of the identity function, $f(x)\te x$, is the ‘mean’, which is one measure of the centre of a distribution.

The expectation is a linear operator:

\[ \E[f(x) + g(x)] = \E[f(x)] + \E[g(x)] \quad \text{and}\quad \E[cf(x)] = c\E[f(x)]. \] These properties are apparent if you explicitly write out the summations.

The expectation of a constant with respect to $x$ is the constant:

\[ \E[c] = c\sum_{i=1}^I p_i = c, \] because probability distributions sum to one (‘probabilities are normalized’).

The expectation of independent outcomes separate:

\[ \E[f(x)g(y)] = \E[f(x)]\,\E[g(y)]. \] True if $x$ and $y$ are independent.

The mean

The mean of a distribution over a number, is simply the ‘expected’ value of the numerical outcome. \[ \text{`Expected Value'} = \text{`mean'} = \mu = \E[x] = \sum_{i=1}^I p_i a_i. \] For a six-sided die: \[ \E[x] = \frac{1}{6}\ttimes1 + \frac{1}{6}\ttimes2 + \frac{1}{6}\ttimes3 + \frac{1}{6}\ttimes4 + \frac{1}{6}\ttimes5 + \frac{1}{6}\ttimes6 = 3.5. \] In every day language I wouldn’t say that I ‘expect’ to see 3.5 as the outcome of throwing a die… I expect to see an integer! However, 3.5 is the expected value as defined. Similarly a single Bernoulli outcome will be a zero or a one, but its ‘expected’ value is a fraction, \[ \E[x] = p\ttimes1 + (1\tm p)\ttimes0 = p, \] the probability of getting a one.

Change of units: I might have a distribution over heights measured in metres, for which I have computed the mean. If I multiply the heights by 100 to obtain heights in centimetres, the mean in centimetres can be obtained by multiplying the mean in metres by 100. Formally: $\E[100\,x] = 100\,\E[x]$.

The variance

The variance is also an expectation, measuring the average squared distance from the mean: \[ \text{var}[x] = \sigma^2 = \E[(x-\mu)^2] = \E[x^2] - \E[x]^2, \] where $\mu\te\E[x]$ is the mean.

Exercise 4: show that $\text{var}[x + y] = \text{var}[x] + \text{var}[y]$, for independent outcomes $x$ and $y$.

Exercise 5: Given outcomes distributed with mean $\mu$ and variance $\sigma^2$, how could you shift and scale them to have mean zero and variance one?

Change of units: If the outcome $x$ is a height measured in metres, then $x^2$ has units $\mathrm{m}^2$; $x^2$ is an area. The variance also has units $\mathrm{m}^2$, it cannot be represented on the same scale as the outcome, because it has different units. If you multiply all heights by 100 to convert to centimetres, the variance is multiplied by $100^2$. Therefore, the relative size of the mean and the variance depends on the units you use, and so often isn’t meaningful.

Standard deviation: The standard deviation $\sigma$, the square root of the variance, does have the same units as the mean. Therefore it is a meaningful number to use as a typical distance from the mean. Often variances are used in intermediate calculations because they are easier to deal with: it is variances that add, not standard deviations.

Sums of independent variables: “random walks”

A drunkard starts at the centre of an alleyway, with exits at each end. He takes a sequence of random staggers either to the left or right along the alleyway. His position after $N$ steps is $k_N = \sum_{n=1}^N x_n$, where the outcomes, $\{x_n\}$, the staggering motions, are drawn from some ensemble with zero mean and finite variance $\sigma^2$. For example $\mathcal{A}_X = \{-1,+1\}$ with $\mathcal{P}_X = \{\noo{2},\noo{2}\}$, which has $\E[x_n]\te0$ and $\text{var}[x_n]\te1$.

If the drunkard started in the centre of the alleyway, will he ever escape? If so, roughly how long will it take? (If you don’t already know, have a think…)

The expected, or mean position after $N$ steps is $\E[k_N] = N\E[x_n] = 0$. This doesn’t mean we don’t think he’ll escape. There are ways of escaping both left and right, and ‘on average’ he’ll stay in the middle.

The variance of the position is $\text{var}[k_N] = N\text{var}[x_n] = N\sigma^2$. The standard deviation of the position is then $\text{std}[k_N] = \sqrt{N}\sigma$, and is a measure of the width of the distribution over the displacement from the centre of the alleyway. If we double the length of the alley, then it will typically take four times the number of random steps to escape.

Worthwhile remembering: the typical magnitude of the sum of $N$ independent zero-mean variables scales with $\sqrt{N}$. The invidual variables need to have finite variance, and ‘typical magnitude’ is measured by standard deviation. Sometimes you might have to work out the $\sigma$ for your problem, or do other detailed calculations. But sometimes the scaling of the width of the distribution is all that really matters.

Corollary: the typical magnitude of the mean of $N$ independent zero-mean variables with finite variance scales with $1/\sqrt{N}$.

Solutions

As always, you are strongly recommended to work hard on a problem yourself before looking at the solutions. As you transition into doing research, there won’t be any answers, and you have to build confidence in getting and checking your own answers.

Exercise 1: For independent outcomes $x$ and $y$, $p(x,y)\te p(x)p(y)$ and so
$\E[f(x)g(y)] = \sum_x\sum_y p(x)p(y)f(x)g(y) = \sum_x p(x)f(x) \sum_y p(y)g(y) = \E[f(x)]\E[g(y)]$.

Exercise 2: $\E[(x-\mu)^2] = \E[x^2 + \mu^2 - 2x\mu] = \E[x^2] + \mu^2 - 2\mu\E[x] = \E[x^2] - \mu^2$.

Exercise 3: $\text{var}[cx] = \E[(cx)^2] -\E[cx]^2 = \E[c^2x^2] - (c\E[x])^2 = c^2(\E[x^2] - \E[x]^2) = c^2\text{var}[x]$.

Exercise 4: $\text{var}[x + y] = \E[(x+y)^2] - \E[x+y]^2 = \E[x^2] + \E[y^2] + 2\E[xy] - (\E[x]^2 + \E[y]^2 + 2\E[x]\E[y]) = \text{var}[x] + \text{var}[y]$, if $\E[xy]\te\E[x]\E[y]$, true if $x$ and $y$ are independent variables.

Exercise 5: $z = (x-\mu)/\sigma$ has mean 0 and variance 1. The division is by the standard deviation, not the variance. You should now be able to prove this result for yourself.

What to remember: using the expectation notation where possible, rather than writing out the summations or integrals explicitly, makes the mathematics concise.