MLPR w3c - Machine Learning and Pattern Recognition

$\newcommand{\bphi}{{\boldsymbol{\phi}}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} \newcommand{\g}{\,|\,} \newcommand{\pdd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\te}{\!=\!} \newcommand{\tm}{\!-\!} \newcommand{\tp}{\!+\!}$

Logistic Regression

There is a lot more that could be said about linear regression. But I’m going to leave most of that for statistics courses. There is also a lot more that could be said about gradient-based optimization, and I’ll return to some of it later. In this note we’ll add a non-linearity to our model, and introduce one of the most used machine learning models.

Transforming the output

We saw that we could attempt to fit the probability of being in a particular class with straightforward linear regression models. However, we are likely to see outputs outside the range

$[0,1]$ for some test inputs. We will now force our function to lie in the desired

$[0,1]$ range by transforming a linear function with a logistic sigmoid:

$f(\bx; \bw) = \sigma(\bw^\top\bx) = \frac{1}{1+e^{-\bw^\top\bx}}\,.$ As with linear regression, we can replace our input features

$\bx$ with a vector of basis function values

$\bphi(\bx)$ . I won’t clutter the notation with this detail. Wherever you see

$\bx$ , know that you can replace this input vector with a version that has been transformed in any way you like.

Loss function

As before, we wish to fit the function to match the training data closely, If the labels are zero and one,

$y\in\{0,1\}$ , we could minimize the square loss

$\sum_{n=1}^N \big(y^{(n)} - f(\bx^{(n)}; \bw)\big)^2.$ However, the Logistic Regression model uses the interpretation of the function as a probability,

$f(\bx; \bw) = P(y\te1\g \bx,\bw)$ , more directly. Maximum likelihood fitting of this model maximizes the probability of the data:

$L(\bw) = \prod_{n=1}^N P(y^{(n)}\g \bx^{(n)}, \bw),$ for the model with parameters

$\bw$ . Equivalently, we minimize the negative log-probability of the training labels, which for this model can be written as:

$\text{NLL} = -\log L(\bw) = - \sum_{n=1}^N \log \left[\sigma(\bw^\top\bx^{(n)})^{y^{(n)}}(1 \tm \sigma(\bw^\top\bx^{(n)}))^{1 - y^{(n)}} \right],$ or

$\text{NLL} = - \sum_{n: y^{(n)}=1} \log \sigma(\bw^\top\bx^{(n)}) - \sum_{n: y^{(n)}=0} \log (1 - \sigma(\bw^\top\bx^{(n)})).$ There is a trick to write the cost function more compactly. We transform the labels to be

$z^{(n)}\in\{-1,+1\}$ where

$z^{(n)}=(2y^{(n)}\tm1)$ , and noticing

$\sigma(-a) = 1 - \sigma(a)$ , we can write:

$\text{NLL} = - \sum_{n=1}^N \log \sigma(z^{(n)}\bw^\top\bx^{(n)}).$ As before, either cost function can have a regularizer added to it to discourage extreme weights.

Maximum likelihood estimation has some good statistical properties. In particular, asymptotically (for lots of data) it is the most efficient estimator. Although the loss can be extreme where confident wrong predictions are made, which could mean that outliers cause more problems than with the square loss approach.

Gradients

The final required ingredient is the gradient vector

$\nabla_\bw \text{NLL}$ . A gradient-based optimizer can then find the weights that minimize our cost.

The derivative of the logistic sigmoid is as follows¹:

$\pdd{\sigma(a)}{a} = \sigma(a)(1-\sigma(a)),$ which tends to zero at the asymptotes

$a\rightarrow \pm\infty$ .

I like to derive the derivatives using the third form of the cost function NLL, because it’s shorter, although I think most books use the first form. We’ll get an equivalent answer. For brevity, I’ll use

$\sigma_n = \sigma(z^{(n)}\bw^\top\bx^{(n)})$ . We then apply the chain rule:

$\begin{aligned} \nabla_\bw \text{NLL} &= - \sum_{n=1}^N \nabla_\bw \log \sigma_n = - \sum_{n=1}^N \frac{1}{\sigma_n} \nabla_\bw \sigma_n = - \sum_{n=1}^N \frac{1}{\sigma_n} \sigma_n (1-\sigma_n) \nabla_\bw z^{(n)}\bw^\top\bx^{(n)},\\ &= - \sum_{n=1}^N (1-\sigma_n) z^{(n)}\bx^{(n)}. \end{aligned}$

Interpretation:

$\sigma_n$ is the probability assigned to the correct training label. So when the classifier is confident and correct on an example, it contributes little to the gradient. Stochastic gradient descent will improve the examples the classifier gets wrong, or is less confident about, by pushing the weights parallel to the direction of the corresponding input.

Some things to know about gradients

Whenever we can compute a differentiable cost function, it should always be possible to compute all of the derivatives at once in a similar number of operations to one function evaluation. That’s an amazing result from the old field of automatic differentiation. (Caveat it might take a lot of memory.) So if your derivates are orders of magnitude more expensive than your cost function, you are probably doing something wrong.

Despite a long history, few people use fully automatic differentiation in machine learning. There are machine learning tools like Theano, and Tensor Flow that will do most of the work for you. But some people² are still needing to do some work by hand so that those tools can work on all the models that we might build.

Whether you are differentiating by hand, or writing a compiler to compute derivatives, you need to test your code. Derivatives are easily checked by finite differences:

$\pdd{f(w)}{w} \approx \frac{f(w\tp\epsilon/2) - f(w\tm\epsilon/2)}{\epsilon},$ and so should always be checked. Unless the weights are extreme, I’d normally set

$\epsilon\te10^{-5}$ . You have to perturb each parameter in turn to evaluate one element of a gradient vector

$\nabla_\bw f(\bw)$ . Therefore, for

$D$ -dimensional vectors of derivatives, the computational cost of finite differences scales

$D$ times worse than well-written derivative code, as well as being less accurate. Finite differences are a useful check, but not for use in production.

Check your understanding