$$ \newcommand{\bphi}{\boldsymbol{\phi}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} \newcommand{\g}{\,|\,} \newcommand{\pdd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\te}{\!=\!} \newcommand{\tm}{\!-\!} \newcommand{\tp}{\!+\!} $$

Logistic Regression

There is a lot more that could be said about linear regression. But I’m going to leave most of that for statistics courses. There is also a lot more that could be said about gradient-based optimization, and I’ll return to some of it later. For this note I want to add a non-linearity to our model, and introduce one of the most used machine learning models.

Transforming the output

We saw that we could attempt to fit the probability of being in a particular class with straightforward linear regression models. However, we are likely to see outputs outside the range \([0,1]\) for some test inputs. We will now force our function to lie in the desired \([0,1]\) range by transforming a linear function with a logistic sigmoid: \[ f(\bx; \bw) = \sigma(\bw^\top\bx) = \frac{1}{1+e^{-\bw^\top\bx}}\,. \] As with linear regression, we can replace our input features \(\bx\) with a vector of basis function values \(\bphi(\bx)\). I won’t clutter the notation with this detail. Wherever you see \(\bx\), know that you can replace this input vector with a version that has been transformed in any way you like.

Loss function

As before, we wish to fit the function to match the training data closely, If the labels are zero and one, \(y\in\{0,1\}\), we could minimize the squared loss \[ \sum_{n=1}^N \big(y^{(n)} - f(\bx^{(n)}; \bw)\big)^2. \] However, the Logistic Regression model uses the interpretation of the function as a probability, \(f(\bx; \bw) = P(y\te1\g \bx,\bw)\), more directly. Maximum likelihood fitting of this model minimizes the negative log-probability of the training labels, which could be written as: \[ \text{NLL} = - \sum_{n=1}^N \log \left[\sigma(\bw^\top\bx^{(n)})^{y^{(n)}}(1 \tm \sigma(\bw^\top\bx^{(n)}))^{1 - y^{(n)}} \right], \] or \[ \text{NLL} = - \sum_{n: y^{(n)}=1} \log \sigma(\bw^\top\bx^{(n)}) - \sum_{n: y^{(n)}=0} \log (1 - \sigma(\bw^\top\bx^{(n)})), \] or using labels \(z^{(n)}\in\{-1,+1\}\) where \(z^{(n)}=(2y^{(n)}\tm1)\), and noticing \(\sigma(-a) = 1 - \sigma(a)\): \[ \text{NLL} = - \sum_{n=1}^N \log \sigma(z^{(n)}\bw^\top\bx^{(n)}). \] As before, either cost function can have a regularizer added to it to discourage extreme weights.

Maximum likelihood estimation has some good statistical properties. In particular, asymptotically (for lots of data) it is the most efficient estimator. Although the loss can be extreme where confident wrong predictions are made, which could mean that outliers cause more problems than the square loss approach.


The final required ingredient is the gradient vector \(\nabla_\bw \text{NLL}\). A gradient-based optimizer can then find the weights that minimize our cost.

The derivative of the logistic sigmoid is as follows1: \[ \pdd{\sigma(a)}{a} = \sigma(a)(1-\sigma(a)), \] which tends to zero at the asymptotes \(a\rightarrow \pm\infty\).

I like to derive the derivatives using the third form of the cost function NLL, because it’s shorter, although I think most books use the first form. We’ll get an equivalent answer. For brevity, I’ll use \(\sigma_n = \sigma(z^{(n)}\bw^\top\bx^{(n)})\). We then apply the chain rule: \[\begin{aligned} \nabla_\bw \text{NLL} &= - \sum_{n=1}^N \nabla_\bw \log \sigma_n = - \sum_{n=1}^N \frac{1}{\sigma_n} \nabla_\bw \sigma_n = - \sum_{n=1}^N \frac{1}{\sigma_n} \sigma_n (1-\sigma_n) \nabla_\bw z^{(n)}\bw^\top\bx^{(n)},\\ &= - \sum_{n=1}^N (1-\sigma_n) z^{(n)}\bx^{(n)}. \end{aligned}\]

Interpretation: \(\sigma_n\) is the probability assigned to the correct training label. So when the classifier is confident and correct on an example, it contributes little to the gradient. Stochastic gradient descent will improve the examples the classifier gets wrong, or is less confident about, by pushing the weights parallel to the direction of the corresponding input.

Some things to know about gradients

Whenever we can compute a differentiable cost function, it should always be possible to compute all of the derivatives at once in a similar number of operations to one function evaluation. That’s an amazing result from the old field of automatic differentiation. (Caveat it might take a lot of memory.) So if your derivates are orders of magnitude more expensive than your cost function, you are probably doing something wrong.

Despite a long history, very few people use fully automatic differentiation. There are machine learning tools like Theano, and Tensor Flow that will do most of the work for you. But some people2 are still needing to do some work by hand so that those tools can work on all the models that we might build.

Whether you are differentiating by hand, or writing a compiler to compute derivatives, you need to test your code. Derivatives are easily checked by finite differences: \[ \pdd{f(w)}{w} \approx \frac{f(w\tp\epsilon/2) - f(w\tm\epsilon/2)}{\epsilon}, \] and so should always be checked. Unless the weights are extreme, I’d normally set \(\epsilon\te10^{-5}\). You have to perturb each parameter in turn to evaluate one element of a gradient vector \(\nabla_\bw f(\bw)\). Therefore, finite differences scale \(D\) times worse than well-written derivative code, as well as being less accurate. They’re a useful check, but not for use in production.

Further Reading

All machine learning textbooks should have a treatment of logistic regression. You could read an alternative treatment to this note in Barber 17.4.1 and 17.4.4. Or Murphy: quick introduction section 1.4.6, then Chapter 8, which has an in depth treatment beyond this note.

Tom Minka has a review of alternative batch optimizers for logistic regression
(Stochastic gradient methods were less popular then, and were not considered.)

For a large-scale practical tool that uses stochastic optimization, check out Vowpal Wabbit:
Its framework includes support for logistic regression. It has various tricks to train fast. It can cope with data, like large-scale text corpora, where you might not know what you want your features to be until you start streaming data.

We don’t have to transform a linear function with the logistic sigmoid. We could instead create a ‘probit’ model, which uses a Gaussian’s cumulative density function (cdf) instead of the logistic sigmoid. We can also transform the function to model other data types. For example, count data can be modeled by using the underlying linear function to set the log-rate of a Poisson distribution. These alternatives can be unified as “Generalized Linear Models” (GLMs). R has a widely used glm library.

There is a very cool trick with complex numbers to evaluate derivatives to machine precision, which I’d like to share:
It is no faster than finite differences though, so shouldn’t be used, except as a check.

  1. It’s not hard to show, but I’d give you this result in an exam if you needed it.

  2. Your lecturer is one of them: https://arxiv.org/abs/1602.07527.