$$\notag \newcommand{\ba}{\mathbf{a}} \newcommand{\bphi}{{\boldsymbol{\phi}}} \newcommand{\bv}{\mathbf{v}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} \newcommand{\by}{\mathbf{y}} \newcommand{\bz}{\mathbf{z}} \newcommand{\g}{\,|\,} \newcommand{\la}{\!\leftarrow\!} \newcommand{\te}{\!=\!} \newcommand{\tp}{\!+\!} $$

MLPR Tutorial Sheet 1

You should attempt the tutorial questions before your tutorial.

We strongly recommend you discuss these questions — and the course in general — with your peers. You could work at ML-Base, or arrange to meet up with people from your tutorial group (e.g., through Learn) or other people you have met in lectures.

If you can’t do a part, skip it at first and move on! After attempting what you can, try to meet up with friend from the class and pool your understanding. Tutorials run more smoothly if you have agreed with someone what you want to talk about.

Your tutorial session usually won’t discuss every part. The point of tutorials isn’t to give you the answers: detailed answers will be made available after the week’s tutorials, and can be discussed further on Hypothesis. The tutorial sessions are mainly useful for giving you practice at explaining your thinking, and discussing particular points that the group might find helpful.


  1. Linear Regression and linear transformations:

    Alice fits a function \(f(\bx) = \bw^\top\bx\) to a training set of \(N\) datapoints \(\{\bx^{(n)},y^{(n)}\}\) by least squares. The inputs \(\bx\) are \(D\)-dimensional column vectors. You can assume a unique setting of the weights \(\bw\) minimizes the square error on the training set.

    Bob has heard that by transforming the inputs \(\bx\) with a vector-valued function \(\bphi\), he can fit an alternative function, \(g(\bx) = \bv^\top\bphi(\bx)\), with the same fitting code. He decides to use a linear transformation \(\bphi(\bx) = A\bx\), where \(A\) is an invertible matrix.

    1. Show that Bob’s procedure will fit the same function as Alice’s original procedure. 
       
      NB You don’t have to do any extensive mathematical manipulation. You also don’t need a mathematical expression for the least squares weights. Hint: reason about the sets of functions that Alice and Bob are choosing their functions from.

    2. Could Bob’s procedure be better than Alice’s if the matrix \(A\) is not invertible? 
       
      [If you need a hint, it may help to remind yourself of the discussion involving invertible matrices in the pre-test answers.]

    3. Alice becomes worried about overfitting, adds a regularizer \(\lambda \bw^\top\bw\) to the least-squares error function, and refits the model. Assuming \(A\) is invertible, can Bob choose a regularizer so that he will still always obtain the same function as Alice?

    4. Bonus part: Only do this part this week if you have time. Otherwise review it later. 
      Suppose we wish to find the vector \(\bv\) that minimizes the function \[\notag (\by - \Phi\bv)^\top(\by - \Phi\bv) + \bv^\top M \bv. \]

      1. Show that \(\bv^\top M \bv = \bv^\top(\frac{1}{2}M + \frac{1}{2}M^\top) \bv\), and hence that we can assume without loss of generality that \(M\) is symmetric.

      2. Why would we usually choose \(M\) to be positive semi-definite in a regularizer, meaning that \(\ba^\top M \ba \ge 0\) for all vectors \(\ba\)?

      3. Assume we can find a factorization \(M = AA^\top\). Can we minimize the function above using a standard routine that can minimize \((\bz-X\bw)^\top(\bz-X\bw)\) with respect to \(\bw\)?

  2. Logistic Sigmoids:

    1. Sketch — with pen and paper — a contour plot of the sigmoidal function \[\notag\phi(\bx) = \sigma(\bv^\top\bx + b),\] for \(\bv \te [1~2]^\top\) and \(b \te 5\), where \(\sigma(a) \te 1/(1\tp\exp(-a))\)
       
      Indicate the precise location of the \(\phi\te 0.5\) contour on your sketch, and give at least a rough indication of some other contours. Also mark the vector \(\bv\) on your diagram, and indicate how its direction is related to the contours. 
       
      Hints: What happens to \(\phi\) as \(\bx\) moves orthogonal (perpendicular) to \(\bv\)? What happens to \(\phi\) as \(\bx\) moves parallel to \(\bv\)? To draw the \(\phi\te0.5\) contour, it may help to identify special places where it is easy to show that \(\phi\te0.5\).

    2. If \(\bx\) and \(\bv\) were three-dimensional, what would the contours of \(\phi\) look like, and how would they relate to \(\bv\)? (A sketch is not expected.)

  3. Radial Basis Functions (RBFs):

    In this question we form a linear regression model for one-dimensional inputs: \(f(x) = \bw^\top\bphi(x;h)\), where \(\bphi(x;h)\) evaluates the input at 101 basis functions. The basis functions \[\notag \phi_k(x) = e^{-(x-c_k)^2/h^2} \] share a common user-specified bandwidth \(h\), while the positions of the centers are set to make the basis functions overlap: \(c_k \te (k-51)h/\sqrt{2}\), with \(k = 1\ldots 101\). The free parameters of the model are the bandwidth \(h\) and weights \(\bw\).

    The model is used to fit a dataset with \(N\te70\) observations each with inputs \(x\in[-1,+1]\). Assume each of the observations has outputs \(y\in[-1,+1]\) also. The model is fitted for any particular \(h\) by transforming the inputs using that bandwidth into a feature matrix \(\Phi\), then minimizing the regularized least squares cost: \[\notag C = (\by-\Phi\bw)^\top(\by-\Phi\bw) + 0.1\bw^\top\bw.\]

    1. Explain why many of the weights will be close to zero when \(h\te0.2\), and why even more weights will probably be close to zero when \(h\te1.0\).

    2. It is suggested that we could choose \(h\) by fitting \(\bw\) for each \(h\) in a grid of values, and pick the \(h\) which led to a fit with the smallest cost \(C\). Explain whether this suggestion is a good idea, or whether you would modify the procedure.

    3. Another data set with inputs \(x\!\in\![-1,+1]\) arrives, but now you notice that all of the observed outputs are larger, \(y\!\in\![1000,1010]\). What problem would we encounter if we performed linear regression as above to this data? How could this problem be fixed?