🏷️sec_sgd
In earlier chapters we kept using stochastic gradient descent in our training procedure, however, without explaining why it works.
To shed some light on it,
we just described the basic principles of gradient descent
in :numref:sec_gd
.
In this section, we go on to discuss
stochastic gradient descent in greater detail.
#@tab mxnet
%matplotlib inline
from d2l import mxnet as d2l
import math
from mxnet import np, npx
npx.set_np()
#@tab pytorch
%matplotlib inline
from d2l import torch as d2l
import math
import torch
#@tab tensorflow
%matplotlib inline
from d2l import tensorflow as d2l
import math
import tensorflow as tf
In deep learning, the objective function is usually the average of the loss functions for each example in the training dataset.
Given a training dataset of
The gradient of the objective function at
If gradient descent is used, the computational cost for each independent variable iteration is
Stochastic gradient descent (SGD) reduces computational cost at each iteration. At each iteration of stochastic gradient descent, we uniformly sample an index
where
$$\mathbb{E}i \nabla f_i(\mathbf{x}) = \frac{1}{n} \sum{i = 1}^n \nabla f_i(\mathbf{x}) = \nabla f(\mathbf{x}).$$
This means that, on average, the stochastic gradient is a good estimate of the gradient.
Now, we will compare it with gradient descent by adding random noise with a mean of 0 and a variance of 1 to the gradient to simulate a stochastic gradient descent.
#@tab all
def f(x1, x2): # Objective function
return x1 ** 2 + 2 * x2 ** 2
def f_grad(x1, x2): # Gradient of the objective function
return 2 * x1, 4 * x2
#@tab mxnet
def sgd(x1, x2, s1, s2, f_grad):
g1, g2 = f_grad(x1, x2)
# Simulate noisy gradient
g1 += d2l.normal(0.0, 1, (1,))
g2 += d2l.normal(0.0, 1, (1,))
eta_t = eta * lr()
return (x1 - eta_t * g1, x2 - eta_t * g2, 0, 0)
#@tab pytorch
def sgd(x1, x2, s1, s2, f_grad):
g1, g2 = f_grad(x1, x2)
# Simulate noisy gradient
g1 += torch.normal(0.0, 1, (1,)).item()
g2 += torch.normal(0.0, 1, (1,)).item()
eta_t = eta * lr()
return (x1 - eta_t * g1, x2 - eta_t * g2, 0, 0)
#@tab tensorflow
def sgd(x1, x2, s1, s2, f_grad):
g1, g2 = f_grad(x1, x2)
# Simulate noisy gradient
g1 += d2l.normal([1], 0.0, 1)
g2 += d2l.normal([1], 0.0, 1)
eta_t = eta * lr()
return (x1 - eta_t * g1, x2 - eta_t * g2, 0, 0)
#@tab all
def constant_lr():
return 1
eta = 0.1
lr = constant_lr # Constant learning rate
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad))
As we can see, the trajectory of the variables in the stochastic gradient descent is much more noisy than the one we observed in gradient descent in :numref:sec_gd
. This is due to the stochastic nature of the gradient. That is, even when we arrive near the minimum, we are still subject to the uncertainty injected by the instantaneous gradient via
This is also the reason for adding a learning rate function lr
into the sgd
step function. In the example above any functionality for learning rate scheduling lies dormant as we set the associated lr
function to be constant.
Replacing
In the first piecewise constant scenario we decrease the learning rate, e.g., whenever progress in optimization stalls. This is a common strategy for training deep networks. Alternatively we could decrease it much more aggressively by an exponential decay. Unfortunately this often leads to premature stopping before the algorithm has converged. A popular choice is polynomial decay with
Let's see what the exponential decay looks like in practice.
#@tab all
def exponential_lr():
# Global variable that is defined outside this function and updated inside
global t
t += 1
return math.exp(-0.1 * t)
t = 1
lr = exponential_lr
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=1000, f_grad=f_grad))
As expected, the variance in the parameters is significantly reduced. However, this comes at the expense of failing to converge to the optimal solution
#@tab all
def polynomial_lr():
# Global variable that is defined outside this function and updated inside
global t
t += 1
return (1 + 0.1 * t) ** (-0.5)
t = 1
lr = polynomial_lr
d2l.show_trace_2d(f, d2l.train_2d(sgd, steps=50, f_grad=f_grad))
There exist many more choices for how to set the learning rate. For instance, we could start with a small rate, then rapidly ramp up and then decrease it again, albeit more slowly. We could even alternate between smaller and larger learning rates. There exists a large variety of such schedules. For now let's focus on learning rate schedules for which a comprehensive theoretical analysis is possible, i.e., on learning rates in a convex setting. For general nonconvex problems it is very difficult to obtain meaningful convergence guarantees, since in general minimizing nonlinear nonconvex problems is NP hard. For a survey see e.g., the excellent lecture notes of Tibshirani 2015.
The following convergence analysis of stochastic gradient descent for convex objective functions
is optional and primarily serves to convey more intuition about the problem.
We limit ourselves to one of the simplest proofs :cite:Nesterov.Vial.2000
.
Significantly more advanced proof techniques exist, e.g., whenever the objective function is particularly well behaved.
Suppose that the objective function
$$\mathbf{x}{t+1} = \mathbf{x}{t} - \eta_t \partial_\mathbf{x} f(\boldsymbol{\xi}_t, \mathbf{x}),$$
where
the expected risk and by $R^$ its minimum with regard to $\mathbf{x}$. Last let $\mathbf{x}^$ be the minimizer (we assume that it exists within the domain where
$$\begin{aligned} &|\mathbf{x}{t+1} - \mathbf{x}^*|^2 \ =& |\mathbf{x}{t} - \eta_t \partial_\mathbf{x} f(\boldsymbol{\xi}t, \mathbf{x}) - \mathbf{x}^*|^2 \ =& |\mathbf{x}{t} - \mathbf{x}^|^2 + \eta_t^2 |\partial_\mathbf{x} f(\boldsymbol{\xi}_t, \mathbf{x})|^2 - 2 \eta_t \left\langle \mathbf{x}_t - \mathbf{x}^, \partial_\mathbf{x} f(\boldsymbol{\xi}_t, \mathbf{x})\right\rangle. \end{aligned}$$
:eqlabel:eq_sgd-xt+1-xstar
We assume that the
eq_sgd-L
We are mostly interested in how the distance between
$$f(\boldsymbol{\xi}_t, \mathbf{x}^) \geq f(\boldsymbol{\xi}_t, \mathbf{x}_t) + \left\langle \mathbf{x}^ - \mathbf{x}t, \partial{\mathbf{x}} f(\boldsymbol{\xi}_t, \mathbf{x}_t) \right\rangle.$$
:eqlabel:eq_sgd-f-xi-xstar
Plugging both inequalities :eqref:eq_sgd-L
and :eqref:eq_sgd-f-xi-xstar
into :eqref:eq_sgd-xt+1-xstar
we obtain a bound on the distance between parameters at time
$$|\mathbf{x}{t} - \mathbf{x}^*|^2 - |\mathbf{x}{t+1} - \mathbf{x}^|^2 \geq 2 \eta_t (f(\boldsymbol{\xi}_t, \mathbf{x}_t) - f(\boldsymbol{\xi}_t, \mathbf{x}^)) - \eta_t^2 L^2.$$
:eqlabel:eqref_sgd-xt-diff
This means that we make progress as long as the difference between current loss and the optimal loss outweighs
Next we take expectations over :eqref:eqref_sgd-xt-diff
. This yields
$$E\left[|\mathbf{x}{t} - \mathbf{x}^*|^2\right] - E\left[|\mathbf{x}{t+1} - \mathbf{x}^|^2\right] \geq 2 \eta_t [E[R(\mathbf{x}_t)] - R^] - \eta_t^2 L^2.$$
The last step involves summing over the inequalities for
$$|\mathbf{x}1 - \mathbf{x}^*|^2 \geq 2 \left (\sum{t=1}^T \eta_t \right) [E[R(\mathbf{x}t)] - R^*] - L^2 \sum{t=1}^T \eta_t^2.$$
:eqlabel:eq_sgd-x1-xstar
Note that we exploited that
$$\bar{\mathbf{x}} \stackrel{\mathrm{def}}{=} \frac{\sum_{t=1}^T \eta_t \mathbf{x}t}{\sum{t=1}^T \eta_t}.$$
Since
$$E\left(\frac{\sum_{t=1}^T \eta_t R(\mathbf{x}t)}{\sum{t=1}^T \eta_t}\right) = \frac{\sum_{t=1}^T \eta_t E[R(\mathbf{x}t)]}{\sum{t=1}^T \eta_t} = E[R(\mathbf{x}_t)],$$
by Jensen's inequality (setting eq_jensens-inequality
) and convexity of
$$\sum_{t=1}^T \eta_t E[R(\mathbf{x}t)] \geq \sum{t=1}^T \eta_t E\left[R(\bar{\mathbf{x}})\right].$$
Plugging this into the inequality :eqref:eq_sgd-x1-xstar
yields the bound
where
So far we have played a bit fast and loose when it comes to talking about stochastic gradient descent. We posited that we draw instances
However, this is not really what we did. In the toy examples in the current section we simply added noise to an otherwise non-stochastic gradient, i.e., we pretended to have pairs
A similar reasoning shows that the probability of picking some sample (i.e., training example) exactly once is given by
Sampling with replacement leads to an increased variance and decreased data efficiency relative to sampling without replacement. Hence, in practice we perform the latter (and this is the default choice throughout this book). Last note that repeated passes through the training dataset traverse it in a different random order.
- For convex problems we can prove that for a wide choice of learning rates stochastic gradient descent will converge to the optimal solution.
- For deep learning this is generally not the case. However, the analysis of convex problems gives us useful insight into how to approach optimization, namely to reduce the learning rate progressively, albeit not too quickly.
- Problems occur when the learning rate is too small or too large. In practice a suitable learning rate is often found only after multiple experiments.
- When there are more examples in the training dataset, it costs more to compute each iteration for gradient descent, so stochastic gradient descent is preferred in these cases.
- Optimality guarantees for stochastic gradient descent are in general not available in nonconvex cases since the number of local minima that require checking might well be exponential.
- Experiment with different learning rate schedules for stochastic gradient descent and with different numbers of iterations. In particular, plot the distance from the optimal solution
$(0, 0)$ as a function of the number of iterations. - Prove that for the function
$f(x_1, x_2) = x_1^2 + 2 x_2^2$ adding normal noise to the gradient is equivalent to minimizing a loss function$f(\mathbf{x}, \mathbf{w}) = (x_1 - w_1)^2 + 2 (x_2 - w_2)^2$ where$\mathbf{x}$ is drawn from a normal distribution. - Compare convergence of stochastic gradient descent when you sample from
${(x_1, y_1), \ldots, (x_n, y_n)}$ with replacement and when you sample without replacement. - How would you change the stochastic gradient descent solver if some gradient (or rather some coordinate associated with it) was consistently larger than all the other gradients?
- Assume that
$f(x) = x^2 (1 + \sin x)$ . How many local minima does$f$ have? Can you change$f$ in such a way that to minimize it one needs to evaluate all the local minima?
:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab:
:begin_tab:tensorflow
Discussions
:end_tab: