$$\notag \newcommand{\ba}{\mathbf{a}} \newcommand{\bff}{\mathbf{f}} \newcommand{\bw}{\mathbf{w}} \newcommand{\bx}{\mathbf{x}} \newcommand{\by}{\mathbf{y}} \newcommand{\g}{\,|\,} \newcommand{\nth}{^{(n)}} \newcommand{\pdd}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\te}{\!=\!} \newcommand{\tm}{\!-\!} \newcommand{\ttimes}{\!\times\!} $$

Softmax and robust regressions

In the previous note we motivated logistic regression as a modification of linear regression that works better for binary labels. The main trade-off was that logistic regression can no longer use a standard linear least squares solver, but instead runs an iterative optimizer to convergence.

In this note we give another two examples where we modify the least squares cost. In each case we write down alternative models of our target outputs, find a suitable cost function, and then differentiate it so that we can fit the parameters with a gradient-based method.

1 Softmax regression

Softmax1 regression is a generalization of logistic regression to cases with more than two labels. Some textbooks will simply call this generalization “logistic regression” as well.

If there are \(K\) possible classes, we will model labels with a length-\(K\) one-hot encoding: \[ \by = [0~~0~~\ldots~~0~~1~~0~~\ldots~~0], \] where if data-point \(n\) has label \(c\), then \(y_k^{(n)} = \delta_{kc}\), where \(\delta_{kc}\) is a Kronecker delta.

We could attempt to predict the elements of \(\by\) one at a time, with \(K\) separate one-vs-rest classifiers. We would need a method to combine the outputs of these classifiers. Instead we’ll fit a system that is explicit about how to predict the whole vector.

We will make a system with parameters \(W\) that outputs a vector \(\bff\). We know that the target vector \(\by\) contains a single one. We interpret the vector output \(\bff\) as a probability distribution over which bit in the target vector \(\by\) is turned on, giving the model: \[ P(y_k\te1\g \bx, W) = f_k(\bx; W). \] We can start by using a separate regression vector \(\bw^{(k)}\) for each class, and create a positive score, by exponentiating a linear combination of features: \[ s_k = e^{(\bw^{(k)})^\top\bx} \] Moving the features \(\bx\) in the direction \(\bw^{(k)}\) increases the score for class \(k\). In particular, if \(w^{(k)}_d\) is positive, then large values of \(x_d\) give class \(k\) a large score.

However, probabilities have to add up to one, so our output vector is normalized: \[ f_k = \frac{s_k}{\sum_{k'} s_{k'}} = \frac{e^{(\bw^{(k)})^\top\bx}}{\sum_{k'}e^{(\bw^{(k')})^\top\bx}}. \] Here \(k'\) is a dummy index used to sum over all of the classes. Our vector-valued function \(\bff\) is parameterized by \(W=\{\bw^{(k)}\}_{k=1}^K\). The classes now compete with each other. If \(w^{(k)}_d\) is large, then large \(x_d\) doesn’t necessarily favour class \(k\). The normalization means that the probability of class \(k\) could be small if \(w^{(j)}_d\) is larger for another class \(j\).

If \(W\) is a \(K\ttimes D\) matrix, we might write the above function as \(\bff(\mathbf{x}; W) = \mathrm{softmax}(W\mathbf{x})\).

1.1 Fitting the model

We want our function evaluated at a training input, \(\bff(\bx^{(n)})\), to predict the corresponding target \(\by^{(n)}\). At best, they are equal, in which case we confidently predict the correct label. However, if the labels are inherently noisy, so that the same input could be labelled differently, making perfect predictions every time is impossible.

Matching targets \(\by\) and function values \(\bff\) by least squares would be consistent: if a setting of the parameters can make \(\bff\) match the probabilities of the labels under the real process that is generating the data, then we will fit those parameters given infinite data. However, “maximum likelihood” estimation is also consistent, and for well-specified models has faster asymptotic convergence. Compared to least squares, maximum likelihood heavily penalizes confident but wrong function values, which may or may not be seen as desirable. (Consider the maximum loss that can be obtained on a single example under the two cost functions.)

We first find the gradients of the model’s log-probability of a single observation: a label \(c\) when the input features are \(\bx\): \[\begin{align} \log P(y_c\te 1\g \bx, W) = \log f_c &= (\bw^{(c)})^\top\bx - \log \sum_{k'} e^{(\bw^{(k')})^\top\bx}\\ \nabla_{\bw^{(k)}} \log f_c &= \delta_{kc}\bx - \frac{1}{\sum_{k'} e^{(\bw^{(k')})^\top\bx}} \,\bx\, e^{(\bw^{(k)})^\top\bx}\\ &= (y_k - f_k)\,\bx. \end{align}\] In the last line we have substituted \(y_k\te\delta_{kc}\) and the definition of \(f_k\).2

To apply stochastic gradient ascent3 on the log-likelihood we would take a training example \((\bx^{(n)},\by^{(n)})\) and move the weight vector for each class parallel to the input \(\bx^{(n)}\). The size and sign of the move depends on the disparity between the binary output \(y_k^{(n)}\) and the prediction \(f_k\).

Alternatively we could use an optimizer that uses a batch of examples, requiring the gradient \[\nabla_{\bw^{(k)}} \sum_n \log f_{c\nth}\nth = \sum_n (y_k^{(n)} - f_k(\bx^{(n)}))\,\bx^{(n)}, \] or minus that value for the gradient of the batch’s negative log probability.

1.2 Redundant parameters and binary logistic regression

For the special case of two classes, the softmax classifier gives: \[\begin{align} P(y\te1\g \bx, W) &= \frac{e^{(\bw^{(1)})^\top\bx}}{e^{(\bw^{(1)})^\top\bx} + e^{(\bw^{(2)})^\top\bx}}\\ &= \frac{1}{1 + e^{(\bw^{(2)} - \bw^{(1)})^\top\bx}} = \sigma\big((\bw^{(1)} - \bw^{(2)})^\top\bx\big). \end{align}\] The prediction only depends on the vector \((\bw^{(1)} - \bw^{(2)})\). In logistic regression we normally fit that single vector directly as “\(\bw\)”.

We can fit two vectors, as in the softmax classifier formulations. The parameters are not well specified: adding a constant vector \(\ba\) to both class’s weight vectors gives the same predictions. However, if we add a regularization term to our cost function, the cost function for this model will be minimized by a unique setting of the weights.

Alternatively we could set the weight vector for one of the classes to zero, and fit the rest of the parameters. For example, recovering logistic regression by setting \(\bw^{(2)}\te\mathbf{0}\).

2 A Robust Model

Real data is often imperfect. The labels attached to training examples, may be incorrect. The logistic/softmax models can model noisy labels by reducing the magnitude of the weight vector. However, a corrupted example with an extreme input \(\bx\) could force the weight vector in some direction to be small, even if many other examples, with less extreme \(\bx\), indicate otherwise. This problem occurs most readily for the negative log-likelihood loss, because a single example can have arbitrarily large loss.

There are various solutions to this issue. We could limit the magnitude of the feature vector: some practitioners only use binary features, or scale their features to have length one. We could try to identify outliers and discard them. We could intervene in the optimization procedure, and limit the size of the update from any individual example, or group of examples.

In the rest of this section we take a probabilistic modelling approach. We take the existing model of the labels, and modify it to reflect the fact we think corruptions could happen. We can then derive a new expression for the log probability and proceed as before. The lesson isn’t necessarily to use this particular robust classifier — I haven’t seen it widely used in practice for generic machine learning tasks. The lesson is that writing down a probabilistic model for your data can suggest ways to improve it. After modifying the model we immediately get a new negative log-likelihood cost function, we can then mechanically obtain gradients and get a new learning algorithm quickly.

We will assume that a binary choice \(m\in\{0,1\}\) was made for each observation, about whether to corrupt the label: \[ P(m) = \text{Bernoulli}(m;\, 1\tm\epsilon) = \begin{cases} 1-\epsilon & m=1\\ \epsilon & m=0. \end{cases} \] With probability \((1\tm\epsilon)\) the model sets \(m\te1\) and generates a label using the normal logistic regression process (to keep notation simple, I’ll assume binary classification). Otherwise, with probability \(\epsilon\), the model sets \(m\te0\) and picks the label uniformly at random: \[ P(y\te1\g\bx,\bw,m) = \begin{cases} \sigma(\bw^\top\bx) & m=1\\ \frac{1}{2} & m=0. \end{cases} \] If we knew the indicator variable \(m\) for each example, we’d use it as a mask: we’d ignore the irrelevant points where \(m\te0\), and fit the remaining points as usual. However, because we don’t observe the indicator variable \(m\), we need to write an expression for the probability of just the label we observe. To include the unknown choice \(m\) on the right-hand side, we need to “marginalize” (sum) it out using the sum rule: \[\begin{align} P(y\te1\g\bx,\bw) &= \sum_{m\in\{0,1\}} P(y\te1,m\g\bx,\bw)\\ &= \sum_{m\in\{0,1\}} P(y\te1\g\bx,\bw, m)\, P(m)\\ &= (1\tm\epsilon)\sigma(\bw^\top\bx) + \epsilon\frac{1}{2}. \end{align}\] In the second line we use the product rule. In general the last term would be \(P(m\g \bx,\bw)\), but in this model we decided to make the choice independent of the input position and the weights.

Now we need the gradients of the log probability. As stated in the previous note, it’s always possible to get these at reasonable cost. I rearranged them by hand to see if they were interpretable. As in the previous logistic regression note, I use \(\sigma_n = \sigma(z^{(n)}\bw^\top\bx^{(n)})\) as the probability of getting the \(n\)th label correct under standard logistic regression, where \(z^{(n)}\te(2y\tm1)\) is a \(\{-1,+1\}\) version of the label. I obtained the expression: \[ \pdd{\log P(z^{(n)}\g\bx^{(n)},\bw)}{\bw} = \frac{1}{1 + \frac{1}{2}\frac{\epsilon}{1-\epsilon}\frac{1}{\sigma_n}}\nabla_\bw \log \sigma_n, \] where \(\nabla_\bw \log \sigma_n = (1\tm\sigma_n) z^{(n)}\bx^{(n)}\) was the gradient of the log-probability for the original logistic regression model. As a check, we recover the original model when \(\epsilon\te0\). If we implemented this equation we could also check it with finite differences (see previous note).

In the original model, the gradient is large for an extreme \(\bx\) where the label is poorly predicted. In the robust model, the contribution from the gradient is small if the probability of being correct, \(\sigma_n\), is much smaller than the probability of a corruption, \(\epsilon\). Really extreme outliers will be nearly ignored.

We could set \(\epsilon\) by hand, for example to \(0.01\). We could try a grid of settings. Or we could optimize \(\epsilon\) along with \(\bw\) using gradient methods. Most gradient-based optimizers need the parameters to be unconstrained, whereas \(\epsilon\) is constrained to \([0,1]\). We can state \(\epsilon \te \sigma(b)\), and optimize \(b \te \text{logit}(\epsilon) \te \log \frac{\epsilon}{1-\epsilon}\) instead.

3 Some general principles

Transforming the range of numbers that a quantity can take is frequently useful in machine learning. So far in the course we have used \(\exp()\) to force a number to be positive, and \(\log()\) to make a positive number unconstrained. We used the logistic sigmoid \(\sigma()\) to make numbers lie in \([0,1]\), and now we’ve used its inverse, the \(\text{logit}()\), to make values in \([0,1]\) become unconstrained.

Where easily possible, we transform data to have a sensible distribution for a model. Sometimes we have to transform the model output to better match the data. Then if we apply an appropriate differentiable loss function, we can match the model to data with gradient methods. When we have identified a probabilistic model, the loss function can be negative log-probability. We might then see obvious ways that the model could be wrong, which may help us to derive improvements.

3.1 Convexity

We have a recipe for writing down a cost function. Identify a probabilistic model of the data, and minimize the negative log probability. But will we be able to minimize this cost function?

If the cost function is convex the numerical optimization literature provides methods where we can guarantee finding parameters that minimize the cost as much as possible. A function is convex if a straight line drawn between any two points on the cost function surface never goes below the surface. Algebraically:

\[C(\alpha\bw + (1\!-\!\alpha)\bw')\;\le\; \alpha C(\bw) + (1\!-\!\alpha)C(\bw'),\quad\text{for any~}\bw,\bw',0\le\alpha\le1.\]

If the inequality above is strict (\(<\) rather than \(\le\)), the function is strictly convex and has one unique minimum. Otherwise there could be a connected convex set of parameters, where the cost is equal, but as small as possible.

From the definition, it’s easy to show that the sum of any convex functions is convex. So if the loss function for an individual example is convex, then our loss function will be too.

For logistic regression, the loss, \(-\log\sigma((2y\tm 1)\bw^\top\bx)\), is a convex function of \(\bw\). We can guarantee we’ll find the maximum likelihood solution to within a small numerical tolerance.

For the robust regression model, \(-\log(\epsilon/2 + (1\tm\epsilon)\sigma(z\bw^\top\bx))\) is not convex for \(\epsilon>0\). Simply plot an example of changing one weight to see a counter example. So, we can’t provide the same guarantees about being able to find the minimum of this loss. We can however try to reduce the loss function using gradient methods, and in practice we’re likely to find parameters with reasonable loss, that generalize usefully to new examples. We just can’t guarantee that a better optimizer wouldn’t have found better parameters.

3.2 Why log probability?

You might have been wondering why we use the log probability to score our models, rather than evaluating and differentiating the probability itself.

One reason is that the log probability of a training set is a sum (rather than product) over examples. That usually makes the maths more convenient. Also we can approximate the sum with a subset to perform stochastic gradient descent.

We also use log probabilities to avoid numerical underflow. Most numerical software (including Matlab and NumPy) works with IEEE floating point. Even with “double” precision, the probability of a sequence of 1,000 coin tosses, \(0.5^{1000}\), underflows to zero. In contrast, we can compute \(1000\log0.5\) with no difficulty.

Finally, the log probability is more appropriate for numerical optimizers. As an example, if we fit the mean of a Gaussian’s negative log PDF by gradient methods, we have a quadratic cost function, which is convex and the ideal cost function for most gradient-based optimization methods. The PDF itself is not convex and has values in a narrow numerical range near zero in the tails. The log-likelihoods for non-Gaussian models, such as logistic regression, are also sometimes approximately quadratic when there are many datapoints. Even if the model is not convex (like the robust model), optimizing the log-probability (which might look nearly convex, at least near an optimum) usually works better.

4 Check your understanding

Can you show that any softmax classifier with a weight vector for each class has an equivalent set of parameters with the weights for the final class set to zero, \(\bw^{(K)}\te\mathbf{0}\)?

How would you perform stochastic gradient descent on the models in this note if you wanted to apply L2 regularization of the weights?

A complete description of an algorithm includes any initialization conditions. How would you initialize the weights? Does it matter? What might happen numerically if you initialize with weights that have extremely large magnitude?

For keen students: We could write down a model that states that the standard logistic regression model with no label noise is correct for our data, but that the inputs \(\bx\) are noisy. Why did I not go that route? That is, if you write down a model that says \(\bx\) could be corrupted by noise, do you get stuck? If so, where and why?

5 For keen students

The readings for the previous note already mentioned that by reading about “generalized linear models” you could see more examples of probabilistic models for the outputs, beyond those we’ve seen.

Those who already know something about neural networks (or those returning to these notes later) might also think about adding probabilistic models for different types of output to neural networks:


  1. I believe the term softmax was coined by John S. Bridle (1990). Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition. In: F.Fogleman Soulie and J.Herault (eds.), Neurocomputing: Algorithms, Architectures and Applications, Berlin: Springer–Verlag, pp. 227–236.

  2. It may take some time to digest all of the notation here. We index into the vector of function outputs using the observed label \(c\). We didn’t use \(k\) for this index so we can use \(k\) to index the weight vector we are computing gradients for. We need the gradients for \(k\!\ne\!c\) as well as \(k\te c\). The dummy index \(k'\) is used because we need to consider all of the weight vectors \(\{\bw^{(k')}\}_{k'=1}^K\), when computing the gradient with respect to a particular weight vector \(\bw^{(k)}\).

  3. Optimization texts normally talk about gradient descent and minimizing functions. Gradient descent on the negative log-likelihood amounts to the same thing. We take a negative step in the direction of the gradient of the negative likelihood, and the two negatives cancel.