$$ \newcommand{\ba}{\mathbf{a}} \newcommand{\bff}{\mathbf{f}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} \newcommand{\by}{\mathbf{y}} \newcommand{\g}{\,|\,} \newcommand{\pdd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\te}{\!=\!} \newcommand{\tm}{\!-\!} $$

Softmax and robust regressions

In the previous note we adapted linear regression to work better for binary labels. In this note we give another two examples where we modify the least squares cost. In each case we write down alternative models of our target outputs, find a suitable cost function and differentiate it.

Softmax regression

Softmax regression is a generalization of logistic regression to cases with more than two labels. Some textbooks will simply call this generalization “logistic regression” as well.

If there are \(K\) possible classes, we will model labels with a length-\(K\) one-hot encoding: \[ \by = [0~~0~~\ldots~~0~~1~~0~~\ldots~~0], \] where if data-point \(n\) has label \(c\), then \(y_k^{(n)} = \delta_{kc}\), where \(\delta_{kc}\) is a Kroneker delta.

We could attempt to predict the elements of \(\by\) one at a time, with \(K\) separate one-vs-rest classifiers, which would need combining in some way. Instead we’ll fit a system that is explicit about how to predict the whole vector.

We will make a system with parameters \(W\) that outputs a vector \(\bff\). We know that the target vector \(\by\) contains a single one. We interpret the vector output \(\bff\) as a probability distribution over which bit in the target vector \(\by\) is turned on, giving the model: \[ P(y_k\te1\g \bx, W) = f_k(\bx, W). \] We can start by using a separate regression vector \(\bw^{(k)}\) for each \(f_k\), forcing the number to be positive by exponentiating a linear combination of features: \[ f_k \propto e^{(\bw^{(k)})^\top\bx} \] Moving in the direction \(\bw^{(k)}\) makes it more probable under the model that a set of features belongs to class \(k\). In particular, if \(w^{(k)}_d\) is positive, then large values of \(x_d\) tend to support class \(k\).

However, probabilities have to add up to one, so we normalize the output vector: \[ f_k = \frac{e^{(\bw^{(k)})^\top\bx}}{\sum_{k'}e^{(\bw^{(k')})^\top\bx}}. \] Here \(k'\) is a dummy index used to sum over all of the classes. Our vector-valued function \(\bff\) is parameterized by \(W=\{\bw^{(k)}\}_{k=1}^K\). The classes now compete with each other. If \(w^{(k)}_d\) is large, then large \(x_d\) doesn’t necessarily favour class \(k\). The normalization means that the probability of class \(k\) could be small if \(w^{(j)}_d\) is larger for another class \(j\).

Fitting the model

We want our function evaluated at a training input, \(\bff(\bx^{(n)})\), to predict the corresponding target \(\by^{(n)}\). At best, they will be equal, where we confidently predict the correct the label. However, if the labels are inherently noisy, so that the same input could be labelled differently, making perfect predictions every time is impossible.

Matching targets \(\by\) and function values \(\bff\) by least squares would be consistent. If a setting of the parameters can make \(\bff\) give the probabilities of the labels under the process generating the data, then we will fit those parameters given infinite data. However, “maximum likelihood” estimation is also consistent, and for well-specified models has better asymptotic properties. Compared to least squares, maximum likelihood heavily penalizes confident but wrong function values, which may or may not be seen as desirable. (Consider the maximum loss that can be obtained on a single example under the two cost functions.)

We first find the gradient for the model’s log-probability of observing a single label \(c\) at location \(\bx\): \[\begin{aligned} \log f_c &= (\bw^{(c)})^\top\bx - \log \sum_{k'} e^{(\bw^{(k')})^\top\bx}\\ \nabla_{\bw^{(k)}} \log f_c &= \delta_{kc}\bx - \frac{1}{\sum_{k'} e^{(\bw^{(k')})^\top\bx}} \,\bx\, e^{(\bw^{(k)})^\top\bx}\\ &= (y_k - f_k)\,\bx. \end{aligned}\] In the last line we have substituted \(y_k\te\delta_{kc}\) and the definition of \(f_k\).

To apply stochastic gradient ascent1 on the log-likelihood we would take a training example \((\bx^{(n)},\by^{(n)})\) and move the weight vector for each class parallel to the input \(\bx^{(n)}\). The size and sign of the move depends on the disparity between the binary output \(y_k^{(n)}\) and the prediction \(f_k\).

Alternatively we could use an optimizer that uses a batch of examples, requiring the gradient \[\nabla_{\bw^{(k)}} \sum_n \log f_{c^n} = \sum_n (y_k^{(n)} - f_k(\bx^{(n)}))\,\bx^{(n)}. \] (or minus that value).

Redundant parameters and binary logistic regression

For the special case of two classes, the softmax classifier gives: \[\begin{aligned} P(y\te1\g \bx, W) &= \frac{e^{(\bw^{(1)})^\top\bx}}{e^{(\bw^{(1)})^\top\bx} + e^{(\bw^{(2)})^\top\bx}}\\ &= \frac{1}{1 + e^{(\bw^{(2)} - \bw^{(1)})^\top\bx}} = \sigma\big((\bw^{(1)} - \bw^{(2)})^\top\bx\big). \end{aligned}\] The prediction only depends on the vector \((\bw^{(1)} - \bw^{(2)})\). In logistic regression we normally fit that single vector directly as “\(\bw\)”.

We can fit two vectors, as in the softmax classifier formulations. The parameters are not well specified: adding a constant vector \(\ba\) to both class’s weight vectors gives the same predictions. However, if we add a regularization term to our cost function, the cost function for this model will be minimized by a unique setting of the weights.

Alternatively we could set the weight vector for one of the classes to zero, and fit the rest of the parameters. For example, recovering logistic regression by setting \(\bw^{(2)}\te\mathbf{0}\).

A Robust Model

Real data is often imperfect. The labels attached to training examples, may be incorrect. The logistic/softmax models can model noisy labels by reducing the magnitude of the weight vector. However, a corrupted example with an extreme input \(\bx\) could force the weight vector in some direction to be small, even if many other examples, with less extreme \(\bx\), indicate otherwise. This problem occurs most readily for the negative log-likelihood loss, because a single example can have arbitrarily large loss.

There are various solutions to this issue. We could limit the magnitude of the feature vector: some practitioners only use binary features, or scale their features to have length one. We could try to identify outliers and discard them. We could intervene in the optimization procedure, and limit the size of the update from any individual example, or group of examples.

In the rest of this section we take a probabilistic modelling approach. We take the existing model of the labels, and modify it to reflect the fact we think corruptions could happen. We can then derive a new expression for the log probability and proceed as before. The lesson isn’t necessarily to use this particular robust classifier — I haven’t seen it widely used in practice for generic machine learning tasks. The lesson is that writing down a probabilistic model for your data can suggest ways to improve it. After modifying the model we immediately get a new negative log-likelihood cost function, we can then mechanically obtain gradients and get a new learning algorithm quickly.

We will assume that a binary choice \(m\in\{0,1\}\) was made for each observation: \[ P(m) = \text{Bernoulli}(m, 1\tm\epsilon) = \begin{cases} 1-\epsilon & m=1\\ \epsilon & m=0. \end{cases} \] With probability \(1\tm\epsilon\) we set \(m\te1\) and generate a label using the normal logistic regression process (to keep notation simple, I’ll assume binary classification). Otherwise we pick the label uniformly at random: \[ P(y\te1\g\bx,\bw,m) = \begin{cases} \sigma(\bw^\top\bx) & m=1\\ \frac{1}{2} & m=0. \end{cases} \] As we don’t observe the indicators, we need to write an expression for the probability of just the labels we observe. We do that by introducing the unknown choices, and “marginalizing” (summing) them out again using the sum rule: \[\begin{aligned} P(y\te1\g\bx,\bw) &= \sum_{m\in\{0,1\}} P(y\te1,m\g\bx,\bw)\\ &= \sum_{m\in\{0,1\}} P(y\te1\g\bx,\bw, m)\, P(m)\\ &= (1\tm\epsilon)\sigma(\bw^\top\bx) + \epsilon\frac{1}{2}. \end{aligned}\] In the second line we use the product rule. In general the last term would be \(P(m\g \bx,\bw)\), but in this model we decided to make the choice independent of the input position and the weights.

Now we need the gradients of the log probability. As stated in the previous note, it’s always possible to get these at reasonable cost. I rearranged them by hand to see if they were interpretable. As in the previous logistic regression note, I wrote \(\sigma_n = \sigma(z^{(n)}\bw^\top\bx^{(n)})\) as the probability of getting the \(n\)th label correct, where \(z^{(n)}\te(2y\tm1)\) is a \(\{-1,+1\}\) version of the label. I obtained the expression: \[ \pdd{\log P(z^{(n)}\g\bx^{(n)},\bw)}{\bw} = \frac{1}{1 + \frac{1}{2}\frac{\epsilon}{1-\epsilon}\frac{1}{\sigma_n}}\nabla_\bw \log \sigma_n, \] where \(\nabla_\bw \log \sigma_n = (1\tm\sigma_n) z^{(n)}\bx^{(n)}\) was the gradient of the log-probability for the original logistic regression model. As a check, we recover the original model when \(\epsilon\te0\). If we implemented this equation we could also check it with finite differences (see previous note).

In the original model, the gradient is large for an extreme \(\bx\) where the label is poorly predicted. In the robust model, the contribution from the gradient is small if the probability of being correct, \(\sigma_n\), is much smaller than the probability of a corruption, \(\epsilon\). Really extreme outliers will nearly be ignored.

We could set \(\epsilon\) by hand, for example to \(0.01\). We could try a grid of settings. Or we could optimize \(\epsilon\) along with \(\bw\) using gradient methods. Most gradient-based optimizers need the parameters to be unconstrained, whereas \(\epsilon\) is constrained to \([0,1]\). We can state \(\epsilon \te \sigma(b)\), and optimize \(b \te \text{logit}(\epsilon) \te \log \frac{\epsilon}{1-\epsilon}\) instead.


Transforming the range of numbers that a quantity can take is frequently useful in machine learning. So far in the course we have used \(\exp\) to force a number to be positive, and \(\log\) to make a positive number unconstrained. We used the logistic sigmoid \(\sigma\) to make numbers lie in \([0,1]\), and now we’ve used its inverse, the \(\text{logit}\), to take make values in \([0,1]\) become unconstrained.

Where easily possible, we transform data to have a sensible distribution for a model. Sometimes we have to transform the model output to better match the data. Then if we apply an appropriate differentiable loss function, we can match the model to data with gradient methods. When we have identified a probabilistic model, the loss function can be negative log-probability. We might then see obvious ways that the model could be wrong, which may help us to derive improvements.

Check your understanding

Can you show that any softmax classifier with a weight vector for each class has an equivalent set of parameters with the weights for the final class set to zero, \(\bw^{(K)}\te\mathbf{0}\)?

How would you perform stochastic gradient descent on the models in this note if you wanted to apply L2 regularization of the weights?

A complete description of an algorithm includes any initialization conditions. How would you initialize the weights? Does it matter? What might happen numerically if you initialize with weights that have extremely large magnitude?

For keen students: We could write down a model that states that the standard logistic regression model with no label noise is correct for our data, but that the inputs \(\bx\) are noisy. Why did I not go that route? That is, if you write down a model that says \(\bx\) could be corrupted by noise, do you get stuck? If so, where and why?

Aside: why log probability?

You might have been wondering why we use the log probability to score our models, rather than evaluating and differentiating the probability itself.

One reason is that the log probability of a training set is a sum (rather than product) over examples. That usually makes the maths more convenient. Also we can approximate the sum with a subset to perform stochastic gradient descent.

We also use log probabilities to avoid numerical underflow. Most numerical software (including Matlab and NumPy) works with IEEE floating point. Even with “double” precision, the probability of a sequence of 1,000 coin tosses, \(0.5^{1000}\), underflows to zero. In contrast, we can compute \(1000\log0.5\) with no difficulty.

Finally, the log probability is more appropriate for numerical optimizers. As an example, if we fit the mean of a Gaussian’s negative log PDF by gradient methods, we have a quadratic cost function, which is convex and the ideal cost function for most gradient-based optimization methods. The PDF itself is not convex and has values in a narrow numerical range near zero in the tails. The log-likelihood for non-Gaussian models, such as logistic regression, is sometimes approximately Gaussian in shape when there are many datapoints.

  1. Optimization texts normally talk about gradient descent and minimizing functions. Gradient descent on the negative log-likelihood amounts to the same thing. We take a negative step in the direction of the gradient of the negative likelihood, and the two negatives cancel.