🏷️sec_softmax
In :numref:sec_linear_regression
, we introduced linear regression,
working through implementations from scratch in :numref:sec_linear_scratch
and again using high-level APIs of a deep learning framework
in :numref:sec_linear_concise
to do the heavy lifting.
Regression is the hammer we reach for when we want to answer how much? or how many? questions. If you want to predict the number of dollars (price) at which a house will be sold, or the number of wins a baseball team might have, or the number of days that a patient will remain hospitalized before being discharged, then you are probably looking for a regression model.
In practice, we are more often interested in classification: asking not "how much" but "which one":
- Does this email belong in the spam folder or the inbox?
- Is this customer more likely to sign up or not to sign up for a subscription service?
- Does this image depict a donkey, a dog, a cat, or a rooster?
- Which movie is Aston most likely to watch next?
Colloquially, machine learning practitioners overload the word classification to describe two subtly different problems: (i) those where we are interested only in hard assignments of examples to categories (classes); and (ii) those where we wish to make soft assignments, i.e., to assess the probability that each category applies. The distinction tends to get blurred, in part, because often, even when we only care about hard assignments, we still use models that make soft assignments.
🏷️subsec_classification-problem
To get our feet wet, let us start off with
a simple image classification problem.
Here, each input consists of a
Next, we have to choose how to represent the labels.
We have two obvious choices.
Perhaps the most natural impulse would be to choose
But general classification problems do not come with natural orderings among the classes.
Fortunately, statisticians long ago invented a simple way
to represent categorical data: the one-hot encoding.
A one-hot encoding is a vector with as many components as we have categories.
The component corresponding to particular instance's category is set to 1
and all other components are set to 0.
In our case, a label
In order to estimate the conditional probabilities associated with all the possible classes,
we need a model with multiple outputs, one per class.
To address classification with linear models,
we will need as many affine functions as we have outputs.
Each output will correspond to its own affine function.
In our case, since we have 4 features and 3 possible output categories,
we will need 12 scalars to represent the weights (
We can depict this calculation with the neural network diagram shown in :numref:fig_softmaxreg
.
Just as in linear regression, softmax regression is also a single-layer neural network.
And since the calculation of each output,
To express the model more compactly, we can use linear algebra notation.
In vector form, we arrive at
The main approach that we are going to take here is to interpret the outputs of our model as probabilities. We will optimize our parameters to produce probabilities that maximize the likelihood of the observed data. Then, to generate predictions, we will set a threshold, for example, choosing the label with the maximum predicted probabilities.
Put formally, we would like any output
You might be tempted to suggest that we interpret
the logits sec_prob
To interpret our outputs as probabilities, we must guarantee that (even on new data), they will be nonnegative and sum up to 1. Moreover, we need a training objective that encourages the model to estimate faithfully probabilities. Of all instances when a classifier outputs 0.5, we hope that half of those examples will actually belong to the predicted class. This is a property called calibration.
The softmax function, invented in 1959 by the social scientist R. Duncan Luce in the context of choice models, does precisely this. To transform our logits such that they become nonnegative and sum to 1, while requiring that the model remains differentiable, we first exponentiate each logit (ensuring non-negativity) and then divide by their sum (ensuring that they sum to 1):
eq_softmax_y_and_o
It is easy to see
Although softmax is a nonlinear function, the outputs of softmax regression are still determined by an affine transformation of input features; thus, softmax regression is a linear model.
🏷️subsec_softmax_vectorization
To improve computational efficiency and take advantage of GPUs,
we typically carry out vector calculations for minibatches of data.
Assume that we are given a minibatch
$$ \begin{aligned} \mathbf{O} &= \mathbf{X} \mathbf{W} + \mathbf{b}, \ \hat{\mathbf{Y}} & = \mathrm{softmax}(\mathbf{O}). \end{aligned} $$
:eqlabel:eq_minibatch_softmax_reg
This accelerates the dominant operation into
a matrix-matrix product eq_minibatch_softmax_reg
,
both the minibatch logits
Next, we need a loss function to measure
the quality of our predicted probabilities.
We will rely on maximum likelihood estimation,
the very same concept that we encountered
when providing a probabilistic justification
for the mean squared error objective in linear regression
(:numref:subsec_normal_distribution_and_squared_loss
).
The softmax function gives us a vector
According to maximum likelihood estimation,
we maximize
where for any pair of label
$$ l(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_{j=1}^q y_j \log \hat{y}_j. $$
:eqlabel:eq_l_cross_entropy
For reasons explained later on, the loss function in :eqref:eq_l_cross_entropy
is commonly called the cross-entropy loss.
Since
🏷️subsec_softmax_and_derivatives
Since the softmax and the corresponding loss are so common,
it is worth understanding a bit better how it is computed.
Plugging :eqref:eq_softmax_y_and_o
into the definition of the loss
in :eqref:eq_l_cross_entropy
and using the definition of the softmax we obtain:
To understand a bit better what is going on,
consider the derivative with respect to any logit
In other words, the derivative is the difference
between the probability assigned by our model,
as expressed by the softmax operation,
and what actually happened, as expressed by elements in the one-hot label vector.
In this sense, it is very similar to what we saw in regression,
where the gradient was the difference
between the observation
Now consider the case where we observe not just a single outcome
but an entire distribution over outcomes.
We can use the same representation as before for the label eq_l_cross_entropy
still works out fine,
just that the interpretation is slightly more general.
It is the expected value of the loss for a distribution over labels.
This loss is called the cross-entropy loss and it is
one of the most commonly used losses for classification problems.
We can demystify the name by introducing just the basics of information theory.
If you wish to understand more details of information theory,
you may further refer to the online appendix on information theory.
Information theory deals with the problem of encoding, decoding, transmitting, and manipulating information (also known as data) in as concise form as possible.
The central idea in information theory is to quantify the information content in data.
This quantity places a hard limit on our ability to compress the data.
In information theory, this quantity is called the entropy of a distribution
eq_softmax_reg_entropy
One of the fundamental theorems of information theory states
that in order to encode data drawn randomly from the distribution
You might be wondering what compression has to do with prediction. Imagine that we have a stream of data that we want to compress. If it is always easy for us to predict the next token, then this data is easy to compress! Take the extreme example where every token in the stream always takes the same value. That is a very boring data stream! And not only it is boring, but it is also easy to predict. Because they are always the same, we do not have to transmit any information to communicate the contents of the stream. Easy to predict, easy to compress.
However if we cannot perfectly predict every event,
then we might sometimes be surprised.
Our surprise is greater when we assigned an event lower probability.
Claude Shannon settled on eq_softmax_reg_entropy
is then the expected surprisal
when one assigned the correct probabilities
that truly match the data-generating process.
So if entropy is level of surprise experienced
by someone who knows the true probability,
then you might be wondering, what is cross-entropy?
The cross-entropy from
In short, we can think of the cross-entropy classification objective in two ways: (i) as maximizing the likelihood of the observed data; and (ii) as minimizing our surprisal (and thus the number of bits) required to communicate the labels.
After training the softmax regression model, given any example features, we can predict the probability of each output class. Normally, we use the class with the highest predicted probability as the output class. The prediction is correct if it is consistent with the actual class (label). In the next part of the experiment, we will use accuracy to evaluate the model's performance. This is equal to the ratio between the number of correct predictions and the total number of predictions.
- The softmax operation takes a vector and maps it into probabilities.
- Softmax regression applies to classification problems. It uses the probability distribution of the output class in the softmax operation.
- Cross-entropy is a good measure of the difference between two probability distributions. It measures the number of bits needed to encode the data given our model.
- We can explore the connection between exponential families and the softmax in some more depth.
- Compute the second derivative of the cross-entropy loss
$l(\mathbf{y},\hat{\mathbf{y}})$ for the softmax. - Compute the variance of the distribution given by
$\mathrm{softmax}(\mathbf{o})$ and show that it matches the second derivative computed above.
- Compute the second derivative of the cross-entropy loss
- Assume that we have three classes which occur with equal probability, i.e., the probability vector is
$(\frac{1}{3}, \frac{1}{3}, \frac{1}{3})$ .- What is the problem if we try to design a binary code for it?
- Can you design a better code? Hint: what happens if we try to encode two independent observations? What if we encode
$n$ observations jointly?
- Softmax is a misnomer for the mapping introduced above (but everyone in deep learning uses it). The real softmax is defined as
$\mathrm{RealSoftMax}(a, b) = \log (\exp(a) + \exp(b))$ .- Prove that
$\mathrm{RealSoftMax}(a, b) > \mathrm{max}(a, b)$ . - Prove that this holds for
$\lambda^{-1} \mathrm{RealSoftMax}(\lambda a, \lambda b)$ , provided that$\lambda > 0$ . - Show that for
$\lambda \to \infty$ we have$\lambda^{-1} \mathrm{RealSoftMax}(\lambda a, \lambda b) \to \mathrm{max}(a, b)$ . - What does the soft-min look like?
- Extend this to more than two numbers.
- Prove that