$$ \newcommand{\E}{\mathbb{E}} \newcommand{\g}{\,|\,} \newcommand{\intd}{\;\mathrm{d}} \newcommand{\pdd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\te}{\!=\!} \newcommand{\tp}{\!+\!} \newcommand{\ttimes}{\!\times\!} $$

Self-test answers

Make sure to have a proper attempt at the questions before looking at these answers. If necessary, work through tutorials on probability, linear algebra, and calculus first. If you look at the answers before you’ve had an honest attempt, it will be too easy to trick yourself into thinking you have no problems.

I expect some of the class will have no trouble with any of these questions. If you couldn’t do everything, that’s also ok. However, to do well in this course, this material needs to become second nature. So if you’re not nearly there, there might not be time to develop the mathematical experience required.

  1. The point of this question is to check that you know some of the basics rules of probability: the product rule, the sum rule, independence. You should know the formulae, and how to apply them to special cases.

    1. The conditional probability distribution is \[P(x\g y\te1) = P(x, y\te1)/P(y\te1) \propto P(x, y\te1).\] You can find the marginal by the sum rule, \(P(y\te1) = \sum_x P(x, y\te1)\). Or write down what the conditional distribution is proportional to, and spot the normalization (which is equivalent). \[ P(x\g y\te1) \propto P(x, y\te1) \propto \begin{cases} 1 & x=1\\ 3 & x=2 \end{cases} ~=~ \begin{cases} \frac{1}{4} & x=1\\ \frac{3}{4} & x=2. \end{cases} \] Your conditional distribution, like any distribution, should be normalized (add up to one). You should have noticed if it did not!

    2. Two variables are independent if and only if \(P(x, y) = P(x)P(y)\) for all \(x\) and \(y\). Equivalently you can check whether it’s always the case that \(P(y\g x)=P(y)\) or \(P(x\g y)=P(x)\).

      Here \(P(x\te1) = 0.3 \neq P(x\te1\g y\te1)\), so the variables are not independent. Computing other values would be a waste of time once you have already disproved independence. However, if the values had matched, you would have had to keep on checking.

  2. Means and variances (not standard deviations) add for independent variables. (Means add for dependent variables too, but variances do not.) Multiplying an outcome by a constant increases the variance by the square of the constant. Review the note on expectations for more detail.

    Mean of Z: \(\mu_x + 3\mu_y\)

    Standard deviation of Z: \(\sqrt{\sigma_x^2 + 9\sigma_y^2}\)

  3. This question was meant to test whether you are able to use mathematical knowledge to answer questions that aren’t just mechanical evaluation of equations. Learning some rules isn’t enough: you’ll need the mathematical experience to be able to select relevant results, without being told which ones to use.

    1. The product rule states: \[ P(X\te x, Y\te y) = P(X\te x\g Y\te y)\, P(Y\te y). \] As \(P(X\te x\g Y\te y)\) is a probability, it is between 0 and 1. Therefore, removing it would increase the value of the right-hand side (unless it is one): \[ P(X\te x, Y\te y) \le P(Y\te y), \] with equality when \(P(X\te x\g Y\te y) = 1\).

      To be thorough: The conditional probability is not defined when \(P(Y\te y)=0\). In that case the inequality is also true, and with equality. (I forgot this case on first writing these answers.)

      Alternatively you could write down the sum rule, and argue that the left-hand side is just one term in a positive sum that gives the right-hand side. Therefore the joint probability can’t possibly be more than the marginal.

    2. In many situations, probability density functions (PDFs) can be treated as if they were probabilities. However, PDFs are not probabilities, and this question demonstrates that sometimes care is required.

      As before, the product rule is: \[ p(x, y) = p(x\g y)\, p(y). \] However, \(p(x\g y)\) is not bounded above: it’s infinite if \(x\) is completely determined given \(y\). This conditional density could also be zero. Thus \(p(x,y)\) can be more or less than \(p(y)\).

      If you didn’t spot the difference from i), please make sure you understand how PDFs relate to probabilities.

  4. You can easily check this question for yourself on a computer.

    Matlab/Octave:

    A = [1 3 5; 4 6 0];
    B = [0 5 2; 7 9 1; 8 2 3];
    C = A*B;
    size(C)
    C(2,3)

    Python:

    import numpy as np
    
    A = np.array([[1, 3, 5], [4, 6, 0]])
    B = np.array([[0, 5, 2], [7, 9, 1], [8, 2, 3]])
    C = np.dot(A, B)
    print(C.shape)
    print(C[2-1,3-1]) # -1's because Python uses 0-based indexes
  5. The previous question tested a mechanical knowledge of how matrix multiplication works. You are also expected to have (or rapidly develop) some understanding of how matrices can correspond to transformations.

    The question said \(A\) is a \(2\ttimes3\) matrix, \(X\) is a \(N\ttimes 3\) matrix.

    1. A new data matrix \(Y = X A^\top A\) is created.

      \(Z = X A^\top\) is a \(N\ttimes2\) matrix, so each row is a point in 2-dimensions. Applying a transformation has projected the 3-dimensional data down into two dimensions.

      \(Y = Z A\) projects the 2-dimensional data \(Z\) into 3-dimensions, however, all the points will still lie within a 2-dimensional plane contained within the three-dimensional space. It’s not possible to recover the original positions \(X\), as different values for a row of \(X\) would have been projected to the same 2-dimensional point in \(X A^\top\), and then the same 3-dimensional point in \(X A^\top A\). Information was lost.

    2. Because applying the matrix \(A^\top A\) is an irreversible transformation, this matrix cannot be invertible.

      More technical answer: \(A\) has at most rank 2, so the product \(A^\top A\) has at most rank 2, and has at most 2 non-zero eigenvalues. As the matrix has size \(3\ttimes3\), it is low-rank and is not invertible.

    3. The matrix \(A A^\top\) could be invertible. It transforms two-dimensional data, via a temporary projection into 3D, which doesn’t necessarily lose any information. For most \(2\ttimes3\) matrices \(A\), the \(2\ttimes2\) matrix \(A A^\top\) will be full rank.

      Pathological case: if the columns of \(A\) are all multiples of each other, then \(A A^\top\) will be low rank and not invertible.

      For square matrices \(B\) and \(C\), \((BC)^{-1} = C^{-1}B^{-1}\). However, we can’t write \(A^{-\top} A^{-1}\) here, because only square matrices have inverses, and \(A\) is not square.

  6. Given the function \(f(x,y) = (2x^2 + 3xy)^2\), the vector of partial derivatives is: \[ \nabla f = \left[ \begin{array}{c} \pdd{f}{x} \\[1ex] \pdd{f}{y} \end{array} \right] = \left[ \begin{array}{c} 2(2x^2 + 3xy)(4x + 3y) \\[1ex] 6x(2x^2 + 3xy) \end{array} \right]. \] You might have written your answer in a different form, for example by multiplying out the brackets.

    1. For a small vector \(\delta \te [\delta_x~\delta_y]^\top\), the inner-product with the partial derivatives or gradient vector is \[ (\nabla f)^\top\delta = \pdd{f}{x}\delta_x + \pdd{f}{y}\delta_y. \] For infinitesimal changes \(\delta_x\te\mathrm{d}x\) and \(\delta_y\te\mathrm{d}y\) this expression is the chain rule of partial derivatives, giving the infinitesimal change \(\mathrm{d}f\) in the function. For small finite changes, it will still give the approximate change in the function value: \[ f(x\tp\delta_x,\, y\tp\delta_y) - f(x, y) \;\approx\; \pdd{f}{x}\delta_x + \pdd{f}{y}\delta_y. \] This approximation can be used with small \(\delta\) to numerically check the analytically-derived derivatives. I did such a check to reduce the chance of making an embarrassing mistake in these answers. In research or a job, where you won’t have a lecturer providing answers, you’ll need ways to check things too.

    2. We just saw that for moves from some position \((x,y)\) to \((x\tp\delta_x,y\tp\delta_y)\), the inner product \((\nabla f)^\top\delta\) gives the increase in the function value (in the limit of small \(\delta\)).

      \((\nabla f)^\top\delta = |\nabla f||\delta|\cos\theta\), where \(\theta\) is the angle between direction of the change and the gradient vector. Considering moves of the same length \(|\delta|\), in different directions \(\theta\), the change in function value is maximized by \(\cos\theta\te1\) or \(\theta\te0\). The function is increased the most by moving in the direction of the gradient vector.

  7. The Uniform\([0,1]\) distribution has PDF \[ p(x) = \begin{cases} 1 & 0\le x\le 1 \\ 0 & \text{otherwise}. \end{cases} \] The variance of a distribution is \[ \E[x^2] - \E[x]^2 = \int_0^1 x^2\intd{x} - \big(\tfrac{1}{2}\big)^2 = \frac{1}{3} - \frac{1}{4} = \frac{1}{12}. \] You might have computed \(\E[(x-\mu)^2] = \E[(x-\frac{1}{2})^2]\) instead, which should have led to the same answer. (An exercise on the expectations sheet is to prove that these two expressions for variance are equivalent.)

    This variance of standard deviations can be looked up online: https://en.wikipedia.org/wiki/Uniform_distribution_(continuous)

    It’s also easy to check numerically: var(rand(1e7,1)) in Matlab/Octave or in Python: np.var(np.random.rand(int(1e7))). These should give a number close to 1/12.

  8. You might not have programmed in Matlab or Python before. You should have used a language that looks sufficiently similar that you can follow these answers.

    A Python function to find the minimum value in a list could look like this:

    def min_of_list(my_list):
        """minimum value in sequence my_list"""
        cur_min = my_list[0]
        for item in my_list[1:]:
            if item < cur_min:
                cur_min = item
        return cur_min

    We would actually use Python’s min() function, or np.min() on a NumPy array. Matlab/Octave also has a built in min function, which you should normally use. However, a Matlab/Octave implementation could look like this:

    function the_min = min_of_list(my_list)
    %MIN_OF_LIST smallest number in input array
    
    cur_min = my_list(1);
    for ii = 2:numel(my_list)
        item = my_list(ii);
        if item < cur_min
            cur_min = item;
        end
    end
    the_min = cur_min;

     
    Both functions above will work when the list has only one item. However, if I had tried to access my_list[1] in Python or my_list(2) in Matlab, the code would crash given a list of only one item. Would your function have been ok?

    Both of my functions will throw errors if the list contains no items. Crashing may be desirable, as arguably the minimum item isn’t defined. Alternatively, you may wish to define some non-error behaviour for empty lists, such as (in Python):

    if len(my_list) == 0:
        return None

    That code will still throw an error if my_list isn’t a sequence that len() understands. One way to prevent those errors is by brute force:

    try:
        cur_min = my_list[0]
    except:
        return None

    However, blanket catching of all errors (including ones you haven’t thought of) is usually bad practice. It will make the program harder to debug later.

    cur_min has to be defined before we can start comparing to it. I have previously seen answers that declare cur_min = 0 at the start, which would give the wrong answer if every item in the list was positive. You could initialize with \(\infty\) if you assume the list contains numbers, and your programming language knows how to deal with \(\infty\). However, I thought using one of the data items was easier and more general.

    Even a trivial function can require a little bit of thought, and a little bit of experience to get right! Those of you with some programming experience should have no trouble. However, if you haven’t done any programming before, expect to do a serious amount of work to catch up.

    Maths background is more important than programming background. The assessed assignment is only worth 20% of the course. Also, the programs you write for this course will be fairly simple but mathematically dense.