🏷️sec_linear_regression
Regression refers to a set of methods for modeling the relationship between one or more independent variables and a dependent variable. In the natural sciences and social sciences, the purpose of regression is most often to characterize the relationship between the inputs and outputs. Machine learning, on the other hand, is most often concerned with prediction.
Regression problems pop up whenever we want to predict a numerical value. Common examples include predicting prices (of homes, stocks, etc.), predicting length of stay (for patients in the hospital), demand forecasting (for retail sales), among countless others. Not every prediction problem is a classic regression problem. In subsequent sections, we will introduce classification problems, where the goal is to predict membership among a set of categories.
Linear regression may be both the simplest
and most popular among the standard tools to regression.
Dating back to the dawn of the 19th century,
linear regression flows from a few simple assumptions.
First, we assume that the relationship between
the independent variables
To motivate the approach, let us start with a running example. Suppose that we wish to estimate the prices of houses (in dollars) based on their area (in square feet) and age (in years). To actually fit a model for predicting house prices, we would need to get our hands on a dataset consisting of sales for which we know the sale price, area, and age for each home. In the terminology of machine learning, the dataset is called a training dataset or training set, and each row (here the data corresponding to one sale) is called an example (or data point, data instance, sample). The thing we are trying to predict (price) is called a label (or target). The independent variables (age and area) upon which the predictions are based are called features (or covariates).
Typically, we will use
🏷️subsec_linear_model
The linearity assumption just says that the target (price) can be expressed as a weighted sum of the features (area and age):
eq_price-area
In :eqref:eq_price-area
, eq_price-area
is an affine transformation
of input features,
which is characterized by
a linear transformation of features via weighted sum, combined with
a translation via the added bias.
Given a dataset, our goal is to choose
the weights
In disciplines where it is common to focus
on datasets with just a few features,
explicitly expressing models long-form like this is common.
In machine learning, we usually work with high-dimensional datasets,
so it is more convenient to employ linear algebra notation.
When our inputs consist of
Collecting all features into a vector
eq_linreg-y
In :eqref:eq_linreg-y
, the vector
For a collection of features
where broadcasting (see :numref:subsec_broadcasting
) is applied during the summation.
Given features of a training dataset
Even if we believe that the best model for
predicting
Before we can go about searching for the best parameters (or model parameters)
Before we start thinking about how to fit our model,
we need to determine a measure of fitness.
The loss function quantifies the distance
between the real and predicted value of the target.
The loss will usually be a non-negative number
where smaller values are better
and perfect predictions incur a loss of 0.
The most popular loss function in regression problems
is the squared error.
When our prediction for an example
The constant fig_fit_linreg
.
Note that large differences between
estimates
When training the model, we want to find parameters ($\mathbf{w}^, b^$) that minimize the total loss across all training examples:
$$\mathbf{w}^, b^ = \operatorname*{argmin}_{\mathbf{w}, b}\ L(\mathbf{w}, b).$$
Linear regression happens to be an unusually simple optimization problem.
Unlike most other models that we will encounter in this book,
linear regression can be solved analytically by applying a simple formula.
To start, we can subsume the bias
While simple problems like linear regression may admit analytic solutions, you should not get used to such good fortune. Although analytic solutions allow for nice mathematical analysis, the requirement of an analytic solution is so restrictive that it would exclude all of deep learning.
Even in cases where we cannot solve the models analytically, it turns out that we can still train models effectively in practice. Moreover, for many tasks, those difficult-to-optimize models turn out to be so much better that figuring out how to train them ends up being well worth the trouble.
The key technique for optimizing nearly any deep learning model, and which we will call upon throughout this book, consists of iteratively reducing the error by updating the parameters in the direction that incrementally lowers the loss function. This algorithm is called gradient descent.
The most naive application of gradient descent consists of taking the derivative of the loss function, which is an average of the losses computed on every single example in the dataset. In practice, this can be extremely slow: we must pass over the entire dataset before making a single update. Thus, we will often settle for sampling a random minibatch of examples every time we need to compute the update, a variant called minibatch stochastic gradient descent.
In each iteration, we first randomly sample a minibatch
We can express the update mathematically as follows
(
To summarize, steps of the algorithm are the following: (i) we initialize the values of the model parameters, typically at random; (ii) we iteratively sample random minibatches from the data, updating the parameters in the direction of the negative gradient. For quadratic losses and affine transformations, we can write this out explicitly as follows:
$$\begin{aligned} \mathbf{w} &\leftarrow \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{\mathbf{w}} l^{(i)}(\mathbf{w}, b) = \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right),\ b &\leftarrow b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_b l^{(i)}(\mathbf{w}, b) = b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right). \end{aligned}$$
:eqlabel:eq_linreg_batch_update
Note that eq_linreg_batch_update
.
Here, the more elegant vector notation makes the math
much more readable than expressing things in terms of coefficients,
say
After training for some predetermined number of iterations
(or until some other stopping criteria are met),
we record the estimated model parameters,
denoted
Linear regression happens to be a learning problem where there is only one minimum over the entire domain. However, for more complicated models, like deep networks, the loss surfaces contain many minima. Fortunately, for reasons that are not yet fully understood, deep learning practitioners seldom struggle to find parameters that minimize the loss on training sets. The more formidable task is to find parameters that will achieve low loss on data that we have not seen before, a challenge called generalization. We return to these topics throughout the book.
Given the learned linear regression model
We will try to stick with prediction because calling this step inference, despite emerging as standard jargon in deep learning, is somewhat of a misnomer. In statistics, inference more often denotes estimating parameters based on a dataset. This misuse of terminology is a common source of confusion when deep learning practitioners talk to statisticians.
When training our models, we typically want to process whole minibatches of examples simultaneously. Doing this efficiently requires that we vectorize the calculations and leverage fast linear algebra libraries rather than writing costly for-loops in Python.
%matplotlib inline
from d2l import mxnet as d2l
import math
from mxnet import np
import time
#@tab pytorch
%matplotlib inline
from d2l import torch as d2l
import math
import torch
import numpy as np
import time
#@tab tensorflow
%matplotlib inline
from d2l import tensorflow as d2l
import math
import tensorflow as tf
import numpy as np
import time
To illustrate why this matters so much,
we can consider two methods for adding vectors.
To start we instantiate two 10000-dimensional vectors
containing all ones.
In one method we will loop over the vectors with a Python for-loop.
In the other method we will rely on a single call to +
.
#@tab all
n = 10000
a = d2l.ones(n)
b = d2l.ones(n)
Since we will benchmark the running time frequently in this book, let us define a timer.
#@tab all
class Timer: #@save
"""Record multiple running times."""
def __init__(self):
self.times = []
self.start()
def start(self):
"""Start the timer."""
self.tik = time.time()
def stop(self):
"""Stop the timer and record the time in a list."""
self.times.append(time.time() - self.tik)
return self.times[-1]
def avg(self):
"""Return the average time."""
return sum(self.times) / len(self.times)
def sum(self):
"""Return the sum of time."""
return sum(self.times)
def cumsum(self):
"""Return the accumulated time."""
return np.array(self.times).cumsum().tolist()
Now we can benchmark the workloads. First, we add them, one coordinate at a time, using a for-loop.
#@tab mxnet, pytorch
c = d2l.zeros(n)
timer = Timer()
for i in range(n):
c[i] = a[i] + b[i]
f'{timer.stop():.5f} sec'
#@tab tensorflow
c = tf.Variable(d2l.zeros(n))
timer = Timer()
for i in range(n):
c[i].assign(a[i] + b[i])
f'{timer.stop():.5f} sec'
Alternatively, we rely on the reloaded +
operator to compute the elementwise sum.
#@tab all
timer.start()
d = a + b
f'{timer.stop():.5f} sec'
You probably noticed that the second method is dramatically faster than the first. Vectorizing code often yields order-of-magnitude speedups. Moreover, we push more of the mathematics to the library and need not write as many calculations ourselves, reducing the potential for errors.
🏷️subsec_normal_distribution_and_squared_loss
While you can already get your hands dirty using only the information above, in the following we can more formally motivate the square loss objective via assumptions about the distribution of noise.
Linear regression was invented by Gauss in 1795,
who also discovered the normal distribution (also called the Gaussian).
It turns out that the connection between
the normal distribution and linear regression
runs deeper than common parentage.
To refresh your memory, the probability density
of a normal distribution with mean
Below we define a Python function to compute the normal distribution.
#@tab all
def normal(x, mu, sigma):
p = 1 / math.sqrt(2 * math.pi * sigma**2)
return p * np.exp(-0.5 / sigma**2 * (x - mu)**2)
We can now visualize the normal distributions.
#@tab all
# Use numpy again for visualization
x = np.arange(-7, 7, 0.01)
# Mean and standard deviation pairs
params = [(0, 1), (0, 2), (3, 1)]
d2l.plot(x, [normal(x, mu, sigma) for mu, sigma in params], xlabel='x',
ylabel='p(x)', figsize=(4.5, 2.5),
legend=[f'mean {mu}, std {sigma}' for mu, sigma in params])
As we can see, changing the mean corresponds to a shift along the
One way to motivate linear regression with the mean squared error loss function (or simply square loss) is to formally assume that observations arise from noisy observations, where the noise is normally distributed as follows:
Thus, we can now write out the likelihood
of seeing a particular
Now, according to the principle of maximum likelihood,
the best values of parameters
Estimators chosen according to the principle of maximum likelihood
are called maximum likelihood estimators.
While, maximizing the product of many exponential functions,
might look difficult,
we can simplify things significantly, without changing the objective,
by maximizing the log of the likelihood instead.
For historical reasons, optimizations are more often expressed
as minimization rather than maximization.
So, without changing anything we can minimize the negative log-likelihood
Now we just need one more assumption that
So far we only talked about linear models. While neural networks cover a much richer family of models, we can begin thinking of the linear model as a neural network by expressing it in the language of neural networks. To begin, let us start by rewriting things in a "layer" notation.
Deep learning practitioners like to draw diagrams
to visualize what is happening in their models.
In :numref:fig_single_neuron
,
we depict our linear regression model as a neural network.
Note that these diagrams highlight the connectivity pattern
such as how each input is connected to the output,
but not the values taken by the weights or biases.
For the neural network shown in :numref:fig_single_neuron
,
the inputs are fig_single_neuron
is fig_single_neuron
is 1.
We can think of linear regression models as neural networks
consisting of just a single artificial neuron,
or as single-layer neural networks.
Since for linear regression, every input is connected
to every output (in this case there is only one output),
we can regard this transformation (the output layer in :numref:fig_single_neuron
)
as a fully-connected layer or dense layer.
We will talk a lot more about networks composed of such layers
in the next chapter.
Since linear regression (invented in 1795)
predates computational neuroscience,
it might seem anachronistic to describe
linear regression as a neural network.
To see why linear models were a natural place to begin
when the cyberneticists/neurophysiologists
Warren McCulloch and Walter Pitts began to develop
models of artificial neurons,
consider the cartoonish picture
of a biological neuron in :numref:fig_Neuron
, consisting of
dendrites (input terminals),
the nucleus (CPU), the axon (output wire),
and the axon terminals (output terminals),
enabling connections to other neurons via synapses.
Information
Certainly, the high-level idea that many such units could be cobbled together with the right connectivity and right learning algorithm, to produce far more interesting and complex behavior than any one neuron alone could express owes to our study of real biological neural systems.
At the same time, most research in deep learning today
draws little direct inspiration in neuroscience.
We invoke Stuart Russell and Peter Norvig who,
in their classic AI text book
Artificial Intelligence: A Modern Approach :cite:Russell.Norvig.2016
,
pointed out that although airplanes might have been inspired by birds,
ornithology has not been the primary driver
of aeronautics innovation for some centuries.
Likewise, inspiration in deep learning these days
comes in equal or greater measure from mathematics,
statistics, and computer science.
- Key ingredients in a machine learning model are training data, a loss function, an optimization algorithm, and quite obviously, the model itself.
- Vectorizing makes everything better (mostly math) and faster (mostly code).
- Minimizing an objective function and performing maximum likelihood estimation can mean the same thing.
- Linear regression models are neural networks, too.
- Assume that we have some data
$x_1, \ldots, x_n \in \mathbb{R}$ . Our goal is to find a constant$b$ such that$\sum_i (x_i - b)^2$ is minimized.- Find a analytic solution for the optimal value of
$b$ . - How does this problem and its solution relate to the normal distribution?
- Find a analytic solution for the optimal value of
- Derive the analytic solution to the optimization problem for linear regression with squared error. To keep things simple, you can omit the bias
$b$ from the problem (we can do this in principled fashion by adding one column to$\mathbf X$ consisting of all ones).- Write out the optimization problem in matrix and vector notation (treat all the data as a single matrix, and all the target values as a single vector).
- Compute the gradient of the loss with respect to
$w$ . - Find the analytic solution by setting the gradient equal to zero and solving the matrix equation.
- When might this be better than using stochastic gradient descent? When might this method break?
- Assume that the noise model governing the additive noise
$\epsilon$ is the exponential distribution. That is,$p(\epsilon) = \frac{1}{2} \exp(-|\epsilon|)$ .- Write out the negative log-likelihood of the data under the model
$-\log P(\mathbf y \mid \mathbf X)$ . - Can you find a closed form solution?
- Suggest a stochastic gradient descent algorithm to solve this problem. What could possibly go wrong (hint: what happens near the stationary point as we keep on updating the parameters)? Can you fix this?
- Write out the negative log-likelihood of the data under the model
:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab:
:begin_tab:tensorflow
Discussions
:end_tab: