$$ \newcommand{\ba}{\mathbf{a}} \newcommand{\bphi}{{\boldsymbol{\phi}}} \newcommand{\bv}{\mathbf{v}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} \newcommand{\by}{\mathbf{y}} \newcommand{\bz}{\mathbf{z}} \newcommand{\g}{\,|\,} \newcommand{\la}{\!\leftarrow\!} \newcommand{\te}{\!=\!} \newcommand{\tp}{\!+\!} $$

MLPR Tutorial Sheet 1

You should attempt the tutorial questions before your tutorial.

I strongly recommend you also discuss these questions—and the course in general—with your peers. You could try to meet up with people from your tutorial group (their email addresses are listed in the tutorial assignments), or other people you have met in lectures.

As a student I’d attempt tutorial questions by myself at first, but usually couldn’t do absolutely everything in a sensible amount of time. (If you can’t do a part, skip it at first and move on!) I’d then meet up with a friend from the class, and we’d pool our understanding. Finally we’d ask someone else, or ask questions in a tutorial, about the parts that neither of us could manage.

  1. Linear Regression and linear transformations:

    Alice fits a function \(f(\bx) = \bw^\top\bx\) to a training set of \(N\) datapoints \(\{\bx^{(n)},y^{(n)}\}\) by least squares. The inputs \(\bx\) are \(D\)-dimensional column vectors.

    Bob has heard that by transforming the inputs \(\bx\) with a vector-valued function \(\bphi\), he can fit an alternative function, \(g(\bx) = \bv^\top\bphi(\bx)\), with the same fitting code. He decides to use a linear transformation \(\bphi(\bx) = A\bx\), where \(A\) is an invertible matrix.

    1. Show that Bob’s procedure will fit the same function as Alice’s original procedure. 
      NB You don’t have to do any extensive mathematical manipulation. You also don’t need a mathematical expression for the least squares weights.

    2. Could Bob’s procedure be better than Alice’s if the matrix \(A\) is not invertible? 
      [If you need a hint, it may help to remind yourself of the discussion involving invertible matrices in the pre-test answers.]

    3. Alice becomes worried about overfitting, adds a regularizer \(\lambda \bw^\top\bw\) to the least-squares error function, and refits the model. Assuming \(A\) is invertible, can Bob choose a regularizer so that he will still always obtain the same function as Alice?

    4. Bonus part: Only do this part this week if you have time. Otherwise review it later. 
      Suppose we wish to find the vector \(\bv\) that minimizes the function \[ (\by - \Phi\bv)^\top(\by - \Phi\bv) + \bv^\top M \bv. \]

      1. Show that \(\bv^\top M \bv = \bv^\top(\frac{1}{2}M + \frac{1}{2}M^\top) \bv\), and hence that we can assume without loss of generality that \(M\) is symmetric.

      2. Why would we usually choose \(M\) to be positive semi-definite in a regularizer, meaning that \(\ba^\top M \ba \ge 0\) for all vectors \(\ba\)?

      3. Assume we can find a factorization \(M = LL^\top\). Can we minimize the function above using a standard routine that can minimize \((\bz-X\bw)^\top(\bz-X\bw)\) with respect to \(\bw\)?

  2. Radial Basis Functions (RBFs):

    In this question we form a linear regression model for one-dimensional inputs: \(f(x) = \bw^\top\bphi(x;h)\), where \(\bphi(x;h)\) evaluates the input at 101 basis functions. The basis functions \[ \phi_k(x) = e^{-(x-c_k)^2/h^2} \] share a common user-specified bandwidth \(h\), while the positions of the centers are set to make the basis functions overlap: \(c_k \te (k-51)h/\sqrt{2}\), with \(k = 1\ldots 101\). The free parameters of the model are the bandwidth \(h\) and weights \(\bw\).

    The model is used to fit a dataset with \(N\te70\) observations each with inputs \(x\in[-1,+1]\). Assume each of the observations has outputs \(y\in[-1,+1]\) also. The model is fitted for any particular \(h\) by transforming the inputs using that bandwidth, then minimizing the regularized least squares cost: \[C = (\by-\Phi\bw)^\top(\by-\Phi\bw) + 0.1\bw^\top\bw.\]

    1. Explain why many of the weights will be close to zero when \(h\te0.2\), and why even more weights will probably be close to zero when \(h\te1.0\).

    2. It is suggested that we could choose \(h\) by fitting \(\bw\) for each \(h\) in a grid of values, and pick the \(h\) which led to a fit with the smallest cost \(C\). Explain whether this suggestion is a good idea, or whether you would modify the procedure.

    3. Another data set with inputs \(x\in[-1,+1]\) arrives, but now you notice that all of the observed outputs are larger, \(y\in[1000,1010]\). What problem would we encounter if we performed linear regression as above to this data? How could this problem be fixed?

  3. Logistic Sigmoids:

    1. Sketch (with pen and paper) a contour plot of the sigmoidal function \[\phi(\bx) = \sigma(\bv^\top\bx + b),\] for \(\bv \te [1~2]^\top\) and \(b \te 5\), where \(\sigma(a) \te 1/(1\tp\exp(-a))\)
      Indicate the precise location of the \(\phi\te 0.5\) contour on your sketch, and give at least a rough indication of some other contours. Also mark the vector \(\bv\) on your diagram, and indicate how its direction is related to the contours. 
      Hints: What happens to \(\phi\) as \(\bx\) moves orthogonal (perpendicular) to \(\bv\)? What happens to \(\phi\) as \(\bx\) moves parallel to \(\bv\)? To draw the \(\phi\te0.5\) contour, it may help to identify special places where it is easy to show that \(\phi\te0.5\).

    2. If \(\bx\) and \(\bv\) were three-dimensional, what would the contours of \(\phi\) look like, and how would they relate to \(\bv\)? (A sketch is not expected.)