$$\notag \newcommand{\E}{\mathbb{E}} \newcommand{\N}{\mathcal{N}} \newcommand{\be}{\mathbf{e}} \newcommand{\bh}{\mathbf{h}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} \newcommand{\by}{\mathbf{y}} \newcommand{\g}{\,|\,} \newcommand{\nth}{^{(n)}} \newcommand{\te}{\!=\!} $$

MLPR Tutorial1 Sheet 6

Reminder: If you need more guidance to get started on a question, seek clarifications and hints on the class forum. Move on if you’re getting stuck on a part for a long time. Full answers will be released after the last group meets.


  1. Classification with unbalanced data

    Classification tasks often involve rare outcomes, for example predicting click-through events, fraud detection, and disease screening. We’ll restrict ourselves to binary classification, \(y\in\{0,1\}\), where the positive class is rare: \(P(y\te1)\) is small.

    We are likely to see lots of events before observing enough rare \(y\te1\) events to train a good model. To save resources, it’s common to only keep a small fraction \(f\) of the \(y\te0\) ‘negative examples’. A classifier trained naively on this sub-sampled data would predict positive labels more frequently than they actually occur. For Bayes classifiers we could set the class probabilities based on the original class counts (or domain knowledge). This question explores what to do for logistic regression.

    We write that an input-output pair occurs in the world with probability \(P(\bx,y)\), and that the joint probability of an input-output pair (\(\bx,y\)) and keeping it (\(k\)) is \(P(\bx, y, k)\). Because conditional probabilities are proportional to joint probabilities, we can write: \[\notag P(y\g\bx,k)\propto P(\bx,y\g k) \propto P(\bx, y, k) = \begin{cases} P(\bx,y) & y=1\\ f\,P(\bx,y) & y=0.\end{cases} \]

    1. Show that if: \[\notag P(y\g \bx) \propto \begin{cases} c & y=1\\ d & y=0,\end{cases} \] then \(P(y\te1\g \bx) = \sigma(\log c - \log d)\), where \(\sigma(a) = 1/(1+e^{-a})\).

      [This exercise should be revision, given Tutorial 4, Q3.]

    2. We train a logistic regression classifier, with a bias weight, to match subsampled data, so that \(P(y\te1\g \bx, k)\approx \sigma(\bw^\top\bx + b)\). Use the above results to argue that we should add \(\log f\) to the bias parameter to get a model for the real-world distribution \(P(y\g \bx) \propto P(y, \bx)\).

      Have we changed the bias in the direction that you would expect?

    3. We now consider a different approach. We may wish to minimize the loss: \[\notag L = -\E_{P(\bx,y)}\!\left[ \log P(y\g\bx) \right]. \] Multiplying the integrand by \(1 \te \frac{P(\bx,\,\by\g k)}{P(\bx,\,\by\g k)}\) changes nothing, so we can write: \[\notag\begin{align} L &= -\int \sum_y P(\bx,y\g k)\, \frac{P(\bx,y)}{P(\bx,y\g k)} \,\log P(y\g\bx)\;\mathrm{d}\bx \notag\\ &= -\E_{p(\bx,y\g k)}\!\left[\frac{P(\bx,y)}{P(\bx,y\g k)} \,\log P(y\g\bx)\right] \notag\\ &\approx -\frac{1}{N} \sum_{n=1}^N \frac{P(\bx\nth,y\nth)}{P(\bx\nth,y\nth\g k)} \,\log P(y\nth\g\bx\nth),\notag \end{align}\] where \(\bx\nth,y\nth\) come from the subsampled data. This manipulation is a special case of a trick known as importance sampling, which we will see later in lectures in another context. We have converted an expectation under the original data distribution, into an expectation under the subsampling distribution. We then replaced the formal expectation with an average over subsampled data.

      Use the above idea to justify multiplying the gradients for \(y\te0\) examples by \(1/f\), when training a logistic regression classifier on subsampled data.

    4. Hard:  Compare the two methods that we have considered for training models based on subsampled data, giving some pros and cons of each.

  2. Building a toy neural network

    Consider the following classification problem. There are two real-valued features \(x_1\) and \(x_2\), and a binary class label. The class label is determined by \[\notag y = \begin{cases} 1 & \text{if}~x_2 \ge |x_1|\\ 0 & \text{otherwise}. \end{cases} \]

    1. Can this function be perfectly represented by logistic regression, or a feedforward neural network without a hidden layer? Why or why not?

    2. Consider a simpler problem for a moment, the classification problem \[\notag y = \begin{cases} 1 & \text{if}~x_2 \ge x_1\\ 0 & \text{otherwise}. \end{cases} \] Design a single ‘neuron’ that represents this function. Pick the weights by hand. Use the hard threshold function \[\notag \Theta(a) = \begin{cases} 1 & \text{if}~a \ge 0\\ 0 & \text{otherwise}, \end{cases} \] applied to a linear combination of the \(\bx\) inputs.

    3. Now go back to the classification problem at the beginning of this question. Design a two layer feedforward network (that is, one hidden layer with two layers of weights) that represents this function. Use the hard threshold activation function as in the previous question.

      Hints: Use two units in the hidden layer. The unit from the last question will be one of the units, and you will need to design one more. You can then find settings of the weights such that the output unit performs a binary AND operation on the two hidden units.

  3. Regression with input-dependent noise:

    In lectures we turned a representation of a function into a probabilistic model of real-valued outputs by modelling the residuals as Gaussian noise: \[\notag p(y\g\bx,\theta) = \N(y;\, f(\bx;\theta),\, \sigma^2). \] The noise variance \(\sigma^2\) is often assumed to be a constant, but it could be a function of the input location \(\bx\) (a “heteroscedastic” model).

    A flexible model could set the variance using a neural network: \[\notag \sigma(\bx)^2 = \exp(\bw^{(\sigma)\top}\bh(\bx;\theta) + b^{(\sigma)}), \] where \(\bh\) is a vector of hidden unit values. These could be hidden units from the neural network used to compute function \(f(\bx;\theta)\), or there could be a separate network to model the variances.

    1. Assume that \(\bh\) is the final layer of the same neural network used to compute \(f\). How could we modify the training procedure for a neural network that fits \(f\) by least squares, to fit this new model?

    2. In the suggestion above, the activation \(a^{(\sigma)} = \bw^{(\sigma)\top}\bh + b^{(\sigma)}\) sets the log of the variance of the observations.

      1. Why not set the variance directly to this activation value, \(\sigma^2 = a^{(\sigma)}\)?

      2. Hard:  Why not set the variance to the square of the activation, \(\sigma^2 = (a^{(\sigma)})^2\)?

    3. Given a test input \(\bx^{(*)}\), the model above outputs both a guess of an output, \(f(\bx^{(*)})\), and an ‘error bar’ \(\sigma(\bx^{(*)})\), which indicates how wrong the guess could be.

      The Bayesian linear regression and Gaussian process models covered in lectures also give error bars on their predictions. What are the pros and cons of the neural network approach in this question? Would you use this neural network to help guide which experiments to run?


  1. Q2 is based on a previous sheet by Amos Storkey, Charles Sutton, and/or Chris Williams