🏷️sec_backprop
So far, we have trained our models with minibatch stochastic gradient descent. However, when we implemented the algorithm, we only worried about the calculations involved in forward propagation through the model. When it came time to calculate the gradients, we just invoked the backpropagation function provided by the deep learning framework.
The automatic calculation of gradients (automatic differentiation) profoundly simplifies the implementation of deep learning algorithms. Before automatic differentiation, even small changes to complicated models required recalculating complicated derivatives by hand. Surprisingly often, academic papers had to allocate numerous pages to deriving update rules. While we must continue to rely on automatic differentiation so we can focus on the interesting parts, you ought to know how these gradients are calculated under the hood if you want to go beyond a shallow understanding of deep learning.
In this section, we take a deep dive
into the details of backward propagation
(more commonly called backpropagation).
To convey some insight for both the
techniques and their implementations,
we rely on some basic mathematics and computational graphs.
To start, we focus our exposition on
a one-hidden-layer MLP
with weight decay (
Forward propagation (or forward pass) refers to the calculation and storage of intermediate variables (including outputs) for a neural network in order from the input layer to the output layer. We now work step-by-step through the mechanics of a neural network with one hidden layer. This may seem tedious but in the eternal words of funk virtuoso James Brown, you must "pay the cost to be the boss".
For the sake of simplicity, let's assume
that the input example is
where
The hidden layer output
Assuming that the loss function is
According to the definition of
eq_forward-s
where the Frobenius norm of the matrix
is simply the
We refer to
Plotting computational graphs helps us visualize
the dependencies of operators
and variables within the calculation.
:numref:fig_forward
contains the graph associated
with the simple network described above,
where squares denote variables and circles denote operators.
The lower-left corner signifies the input
and the upper-right corner is the output.
Notice that the directions of the arrows
(which illustrate data flow)
are primarily rightward and upward.
Backpropagation refers to the method of calculating
the gradient of neural network parameters.
In short, the method traverses the network in reverse order,
from the output to the input layer,
according to the chain rule from calculus.
The algorithm stores any intermediate variables
(partial derivatives)
required while calculating the gradient
with respect to some parameters.
Assume that we have functions
Here we use the
Recall that
the parameters of the simple network with one hidden layer,
whose computational graph is in :numref:fig_forward
,
are
Next, we compute the gradient of the objective function
with respect to variable of the output layer
Next, we calculate the gradients of the regularization term with respect to both parameters:
Now we are able to calculate the gradient
eq_backprop-J-h
To obtain the gradient with respect to
Since the activation function
Finally, we can obtain the gradient
When training neural networks, forward and backward propagation depend on each other. In particular, for forward propagation, we traverse the computational graph in the direction of dependencies and compute all the variables on its path. These are then used for backpropagation where the compute order on the graph is reversed.
Take the aforementioned simple network as an example to illustrate.
On the one hand,
computing the regularization term :eqref:eq_forward-s
during forward propagation
depends on the current values of model parameters eq_backprop-J-h
during backpropagation
depends on the current value of the hidden layer output
Therefore when training neural networks, after model parameters are initialized, we alternate forward propagation with backpropagation, updating model parameters using gradients given by backpropagation. Note that backpropagation reuses the stored intermediate values from forward propagation to avoid duplicate calculations. One of the consequences is that we need to retain the intermediate values until backpropagation is complete. This is also one of the reasons why training requires significantly more memory than plain prediction. Besides, the size of such intermediate values is roughly proportional to the number of network layers and the batch size. Thus, training deeper networks using larger batch sizes more easily leads to out of memory errors.
Forward propagation sequentially calculates and stores intermediate variables within the computational graph defined by the neural network. It proceeds from the input to the output layer. Backpropagation sequentially calculates and stores the gradients of intermediate variables and parameters within the neural network in the reversed order. When training deep learning models, forward propagation and back propagation are interdependent, and training requires significantly more memory than prediction.
- Assume that the inputs
$\mathbf{X}$ to some scalar function$f$ are$n \times m$ matrices. What is the dimensionality of the gradient of$f$ with respect to$\mathbf{X}$ ? - Add a bias to the hidden layer of the model described in this section (you do not need to include bias in the regularization term).
- Draw the corresponding computational graph.
- Derive the forward and backward propagation equations.
- Compute the memory footprint for training and prediction in the model described in this section.
- Assume that you want to compute second derivatives. What happens to the computational graph? How long do you expect the calculation to take?
- Assume that the computational graph is too large for your GPU.
- Can you partition it over more than one GPU?
- What are the advantages and disadvantages over training on a smaller minibatch?