🏷️sec_linear-algebra
By now, we can load datasets into tensors and manipulate these tensors with basic mathematical operations. To start building sophisticated models, we will also need a few tools from linear algebra. This section offers a gentle introduction to the most essential concepts, starting from scalar arithmetic and ramping up to matrix multiplication.
Most everyday mathematics
consists of manipulating
numbers one at a time.
Formally, we call these values scalars.
For example, the temperature in Palo Alto
is a balmy
We denote scalars
by ordinary lower-cased letters
(e.g.,
(Scalars are implemented as tensors that contain only one element.) Below, we assign two scalars and perform the familiar addition, multiplication, division, and exponentiation operations.
from mxnet import np, npx
npx.set_np()
x = np.array(3.0)
y = np.array(2.0)
x + y, x * y, x / y, x ** y
#@tab pytorch
import torch
x = torch.tensor(3.0)
y = torch.tensor(2.0)
x + y, x * y, x / y, x**y
#@tab tensorflow
import tensorflow as tf
x = tf.constant(3.0)
y = tf.constant(2.0)
x + y, x * y, x / y, x**y
For our purposes, [you can think of vectors
as fixed-length arrays of scalars.]
As with their code counterparts,
we call these values the elements of the vector
(synonyms include entries and components).
When vectors represent examples from real-world datasets,
their values hold some real-world significance.
For example, if we were training a model to predict
the risk of a loan defaulting,
we might associate each applicant with a vector
whose components correspond to quantities
like their income, length of employment,
or number of previous defaults.
If we were studying heart attack risk,
each vector might represent a patient
and its components might correspond to
their most recent vital signs, cholesterol levels,
minutes of exercise per day, etc.
We denote vectors by bold lowercase letters,
(e.g.,
Vectors are implemented as
x = np.arange(3)
x
#@tab pytorch
x = torch.arange(3)
x
#@tab tensorflow
x = tf.range(3)
x
We can refer to an element of a vector by using a subscript.
For example,
$$\mathbf{x} =\begin{bmatrix}x_{1} \ \vdots \x_{n}\end{bmatrix},$$
:eqlabel:eq_vec_def
Here
x[2]
#@tab pytorch
x[2]
#@tab tensorflow
x[2]
To indicate that a vector contains len
function.
len(x)
#@tab pytorch
len(x)
#@tab tensorflow
len(x)
We can also access the length via the shape
attribute.
The shape is a tuple that indicates a tensor's length along each axis.
(Tensors with just one axis have shapes with just one element.)
x.shape
#@tab pytorch
x.shape
#@tab tensorflow
x.shape
Oftentimes, the word "dimension" gets overloaded to mean both the number of axes and the length along a articular axis. To avoid this confusion, we use order to refer to the number of axes and dimensionality exclusively to refer to the number of components.
Just as scalars are
$$\mathbf{A}=\begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \ a_{21} & a_{22} & \cdots & a_{2n} \ \vdots & \vdots & \ddots & \vdots \ a_{m1} & a_{m2} & \cdots & a_{mn} \ \end{bmatrix}.$$
:eqlabel:eq_matrix_def
In code, we represent a matrix reshape
:
A = np.arange(6).reshape(3, 2)
A
#@tab pytorch
A = torch.arange(6).reshape(3, 2)
A
#@tab tensorflow
A = tf.reshape(tf.range(6), (3, 2))
A
Sometimes, we want to flip the axes.
When we exchange a matrix's rows and columns,
the result is called its transpose.
Formally, we signify a matrix
In code, we can access any (matrix's transpose) as follows:
A.T
#@tab pytorch
A.T
#@tab tensorflow
tf.transpose(A)
[Symmetric matrices are the subset of square matrices
that are equal to their own transposes:
A = np.array([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
A == A.T
#@tab pytorch
A = torch.tensor([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
A == A.T
#@tab tensorflow
A = tf.constant([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
A == tf.transpose(A)
Matrices are useful for representing datasets. Typically, rows correspond to individual records and columns correspond to distinct attributes.
While you can go far in your machine learning journey
with only scalars, vectors, and matrices,
eventually you may need to work with
higher-order [tensors].
Tensors (give us a generic way to describe
extensions to
Tensors will become more important
when we start working with images.
Each image arrives as a
np.arange(24).reshape(2, 3, 4)
#@tab pytorch
torch.arange(24).reshape(2, 3, 4)
#@tab tensorflow
tf.reshape(tf.range(24), (2, 3, 4))
Scalars, vectors, matrices, and higher-order tensors all have some handy properties. For example, elementwise operations produce outputs that have the same shape as their operands.
A = np.arange(6).reshape(2, 3)
B = A.copy() # Assign a copy of `A` to `B` by allocating new memory
A, A + B
#@tab pytorch
A = torch.arange(6, dtype=torch.float32).reshape(2, 3)
B = A.clone() # Assign a copy of `A` to `B` by allocating new memory
A, A + B
#@tab tensorflow
A = tf.reshape(tf.range(6, dtype=tf.float32), (2, 3))
B = A # No cloning of `A` to `B` by allocating new memory
A, A + B
The [elementwise product of two matrices
is called their Hadamard product] (denoted
A * B
#@tab pytorch
A * B
#@tab tensorflow
A * B
[Adding or multiplying a scalar and a tensor] produces a result with the same shape as the original tensor. Here, each element of the tensor is added to (or multiplied by) the scalar.
a = 2
X = np.arange(24).reshape(2, 3, 4)
a + X, (a * X).shape
#@tab pytorch
a = 2
X = torch.arange(24).reshape(2, 3, 4)
a + X, (a * X).shape
#@tab tensorflow
a = 2
X = tf.reshape(tf.range(24), (2, 3, 4))
a + X, (a * X).shape
🏷️subsec_lin-alg-reduction
Often, we wish to calculate [the sum of a tensor's elements.]
To express the sum of the elements in a vector
x = np.arange(3)
x, x.sum()
#@tab pytorch
x = torch.arange(3, dtype=torch.float32)
x, x.sum()
#@tab tensorflow
x = tf.range(3, dtype=tf.float32)
x, tf.reduce_sum(x)
To express [sums over the elements of tensors of arbitrary shape],
we simply sum over all of its axes.
For example, the sum of the elements
of an
A.shape, A.sum()
#@tab pytorch
A.shape, A.sum()
#@tab tensorflow
A.shape, tf.reduce_sum(A)
By default, invoking the sum function
reduces a tensor along all of its axes,
eventually producing a scalar.
Our libraries also allow us to [specify the axes
along which the tensor should be reduced.]
To sum over all elements along the rows (axis 0),
we specify axis=0
in sum
.
Since the input matrix reduces along axis 0
to generate the output vector,
this axis is missing from the shape of the output.
A.shape, A.sum(axis=0).shape
#@tab pytorch
A.shape, A.sum(axis=0).shape
#@tab tensorflow
A.shape, tf.reduce_sum(A, axis=0).shape
Specifying axis=1
will reduce the column dimension (axis 1) by summing up elements of all the columns.
A.shape, A.sum(axis=1).shape
#@tab pytorch
A.shape, A.sum(axis=1).shape
#@tab tensorflow
A.shape, tf.reduce_sum(A, axis=1).shape
Reducing a matrix along both rows and columns via summation is equivalent to summing up all the elements of the matrix.
A.sum(axis=[0, 1]) == A.sum() # Same as `A.sum()`
#@tab pytorch
A.sum(axis=[0, 1]) == A.sum() # Same as `A.sum()`
#@tab tensorflow
tf.reduce_sum(A, axis=[0, 1]), tf.reduce_sum(A) # Same as `tf.reduce_sum(A)`
[A related quantity is the mean, also called the average.]
We calculate the mean by dividing the sum
by the total number of elements.
Because computing the mean is so common,
it gets a dedicated library function
that works analogously to sum
.
A.mean(), A.sum() / A.size
#@tab pytorch
A.mean(), A.sum() / A.numel()
#@tab tensorflow
tf.reduce_mean(A), tf.reduce_sum(A) / tf.size(A).numpy()
Likewise, the function for calculating the mean can also reduce a tensor along specific axes.
A.mean(axis=0), A.sum(axis=0) / A.shape[0]
#@tab pytorch
A.mean(axis=0), A.sum(axis=0) / A.shape[0]
#@tab tensorflow
tf.reduce_mean(A, axis=0), tf.reduce_sum(A, axis=0) / A.shape[0]
🏷️subsec_lin-alg-non-reduction
Sometimes it can be useful to [keep the number of axes unchanged] when invoking the function for calculating the sum or mean. This matters when we want to use the broadcast mechanism.
sum_A = A.sum(axis=1, keepdims=True)
sum_A, sum_A.shape
#@tab pytorch
sum_A = A.sum(axis=1, keepdims=True)
sum_A, sum_A.shape
#@tab tensorflow
sum_A = tf.reduce_sum(A, axis=1, keepdims=True)
sum_A, sum_A.shape
For instance, since sum_A
keeps its two axes after summing each row,
we can (divide A
by sum_A
with broadcasting)
to create a matrix where each row sums up to
A / sum_A
#@tab pytorch
A / sum_A
#@tab tensorflow
A / sum_A
If we want to calculate [the cumulative sum of elements of A
along some axis],
say axis=0
(row by row), we can call the cumsum
function.
By design, this function does not reduce the input tensor along any axis.
A.cumsum(axis=0)
#@tab pytorch
A.cumsum(axis=0)
#@tab tensorflow
tf.cumsum(A, axis=0)
So far, we have only performed elementwise operations, sums, and averages.
And if this was all we could do, linear algebra
would not deserve its own section.
Fortunately, this is where things get more interesting.
One of the most fundamental operations is the dot product.
Given two vectors
[The dot product of two vectors is a sum over the products of the elements at the same position]
y = np.ones(3)
x, y, np.dot(x, y)
#@tab pytorch
y = torch.ones(3, dtype = torch.float32)
x, y, torch.dot(x, y)
#@tab tensorflow
y = tf.ones(3, dtype=tf.float32)
x, y, tf.tensordot(x, y, axes=1)
Equivalently, (we can calculate the dot product of two vectors by performing an elementwise multiplication followed by a sum:)
np.sum(x * y)
#@tab pytorch
torch.sum(x * y)
#@tab tensorflow
tf.reduce_sum(x * y)
Dot products are useful in a wide range of contexts.
For example, given some set of values,
denoted by a vector
Now that we know how to calculate dot products,
we can begin to understand the product
between an
where each
[The matrix-vector product
We can think of multiplication with a matrix
:begin_tab:mxnet
To express a matrix-vector product in code,
we use the same dot
function.
The operation is inferred
based on the type of the arguments.
Note that the column dimension of A
(its length along axis 1)
must be the same as the dimension of x
(its length).
:end_tab:
:begin_tab:pytorch
To express a matrix-vector product in code,
we use the mv
function.
Note that the column dimension of A
(its length along axis 1)
must be the same as the dimension of x
(its length).
PyTorch has a convenience operator @
that can execute both matrix-vector
and matrix-matrix products
(depending on its arguments).
Thus we can write A@x
.
:end_tab:
:begin_tab:tensorflow
To express a matrix-vector product in code,
we use the matvec
function.
Note that the column dimension of A
(its length along axis 1)
must be the same as the dimension of x
(its length).
:end_tab:
A.shape, x.shape, np.dot(A, x)
#@tab pytorch
A.shape, x.shape, torch.mv(A, x), A@x
#@tab tensorflow
A.shape, x.shape, tf.linalg.matvec(A, x)
If you've gotten the hang of dot products and matrix-vector products, then matrix-matrix multiplication should be straightforward.
Say that we have two matrices
Let
$$\mathbf{A}= \begin{bmatrix} \mathbf{a}^\top_{1} \ \mathbf{a}^\top_{2} \ \vdots \ \mathbf{a}^\top_n \ \end{bmatrix}, \quad \mathbf{B}=\begin{bmatrix} \mathbf{b}{1} & \mathbf{b}{2} & \cdots & \mathbf{b}_{m} \ \end{bmatrix}. $$
To form the matrix product
$$\mathbf{C} = \mathbf{AB} = \begin{bmatrix} \mathbf{a}^\top_{1} \ \mathbf{a}^\top_{2} \ \vdots \ \mathbf{a}^\top_n \ \end{bmatrix} \begin{bmatrix} \mathbf{b}{1} & \mathbf{b}{2} & \cdots & \mathbf{b}{m} \ \end{bmatrix} = \begin{bmatrix} \mathbf{a}^\top{1} \mathbf{b}1 & \mathbf{a}^\top{1}\mathbf{b}2& \cdots & \mathbf{a}^\top{1} \mathbf{b}m \ \mathbf{a}^\top{2}\mathbf{b}1 & \mathbf{a}^\top{2} \mathbf{b}2 & \cdots & \mathbf{a}^\top{2} \mathbf{b}m \ \vdots & \vdots & \ddots &\vdots\ \mathbf{a}^\top{n} \mathbf{b}1 & \mathbf{a}^\top{n}\mathbf{b}2& \cdots& \mathbf{a}^\top{n} \mathbf{b}_m \end{bmatrix}. $$
[We can think of the matrix-matrix multiplication A
and B
.
Here, A
is a matrix with 2 rows and 3 columns,
and B
is a matrix with 3 rows and 4 columns.
After multiplication, we obtain a matrix with 2 rows and 4 columns.
B = np.ones(shape=(3, 4))
np.dot(A, B)
#@tab pytorch
B = torch.ones(3, 4)
torch.mm(A, B), A@B
#@tab tensorflow
B = tf.ones((3, 4), tf.float32)
tf.matmul(A, B)
The term matrix-matrix multiplication is often simplified to matrix multiplication, and should not be confused with the Hadamard product.
🏷️subsec_lin-algebra-norms
Some of the most useful operators in linear algebra are norms.
Informally, the norm of a vector tells us how big it is.
For instance, the
A norm is a function
- Given any vector
$\mathbf{x}$ , if we scale (all elements of) the vector by a scalar$\alpha \in \mathbb{R}$ , its norm scales accordingly:$$|\alpha \mathbf{x}| = |\alpha| |\mathbf{x}|.$$ - For any vectors
$\mathbf{x}$ and$\mathbf{y}$ : norms satisfy the triangle inequality:$$|\mathbf{x} + \mathbf{y}| \leq |\mathbf{x}| + |\mathbf{y}|.$$ - The norm of a vector is nonnegative and it only vanishes if the vector is zero:
$$|\mathbf{x}| > 0 \text{ for all } \mathbf{x} \neq 0.$$
Many functions are valid norms and different norms
encode different notions of size.
The Euclidean norm that we all learned in elementary school geometry
when calculating the hypotenuse of right triangle
is the square root of the sum of squares of a vector's elements.
Formally, this is called [the
($$|\mathbf{x}|2 = \sqrt{\sum{i=1}^n x_i^2}.$$)
The method norm
calculates the
u = np.array([3, -4])
np.linalg.norm(u)
#@tab pytorch
u = torch.tensor([3.0, -4.0])
torch.norm(u)
#@tab tensorflow
u = tf.constant([3.0, -4.0])
tf.norm(u)
[The
($$|\mathbf{x}|1 = \sum{i=1}^n \left|x_i \right|.$$)
Compared to the
np.abs(u).sum()
#@tab pytorch
torch.abs(u).sum()
#@tab tensorflow
tf.reduce_sum(tf.abs(u))
Both the
$$|\mathbf{x}|p = \left(\sum{i=1}^n \left|x_i \right|^p \right)^{1/p}.$$
In the case of matrices, matters are more complicated.
After all, matrices can be viewed both as collections of individual entries
and as objects that operate on vectors and transform them into other vectors.
For instance, we can ask by how much longer
the matrix-vector product
[$$|\mathbf{X}|F = \sqrt{\sum{i=1}^m \sum_{j=1}^n x_{ij}^2}.$$]
The Frobenius norm behaves as if it were
an
np.linalg.norm(np.ones((4, 9)))
#@tab pytorch
torch.norm(torch.ones((4, 9)))
#@tab tensorflow
tf.norm(tf.ones((4, 9)))
While we do not want to get too far ahead of ourselves, we can plant some intuition already about why these concepts are useful. In deep learning, we are often trying to solve optimization problems: maximize the probability assigned to observed data; maximize the revenue associated with a recommender model; minimize the distance between predictions and the ground-truth observations; minimize the distance between representations of photos of the same person while maximizing the distance between representations of photos of different people. These distances, which constitute the objectives of deep learning algorithms, are often expressed as norms.
In this section, we reviewed all the linear algebra that you will need to understand a remarkable chunk of modern deep learning. There is a lot more to linear algebra and much of it is useful for machine learning. For example, matrices can be decomposed into factors, and these decompositions can reveal low-dimensional structure in real-world datasets. There are entire subfields of machine learning that focus on using matrix decompositions and their generalizations to high-order tensors to discover structure in datasets and solve prediction problems. But this book focuses on deep learning. And we believe you will be more inclined to learn more mathematics once you have gotten your hands dirty applying machine learning to real datasets. So while we reserve the right to introduce more mathematics later on, we wrap up this section here.
If you are eager to learn more linear algebra,
there are many excellent books and online resources.
For a more advanced crash course, consider checking out
:cite:Strang.1993,Kolter.2008,Petersen.Pedersen.ea.2008
.
To recap:
- Scalars, vectors, matrices, and tensors are the basic mathematical objects used in linear algebra and have zero, one, two, and an arbitrary number of axes, respectively.
- Tensors can be sliced or reduced along specified axes
via indexing, or operations such as
sum
andmean
, respectively. - Elementwise products are called Hadamard products. By contrast, dot products, matrix-vector products, and matrix-matrix products are not elementwise operations and in general return objects that have different shapes than the operands.
- Compared to Hadamard products, matrix-matrix products take considerably longer to compute (cubic rather than quadratic time).
- Norms capture various notions of the magnitude of a vector, and are commonly applied to the difference of two vectors to measure their distance.
- Common vector norms include the
$\ell_1$ and$\ell_2$ norms, and common matrix norms include the spectral and Frobenius norms.
- Prove that the transpose of the transpose of a matrix is the matrix itself:
$(\mathbf{A}^\top)^\top = \mathbf{A}$ . - Given two matrices
$\mathbf{A}$ and$\mathbf{B}$ , show that sum and transposition commute:$\mathbf{A}^\top + \mathbf{B}^\top = (\mathbf{A} + \mathbf{B})^\top$ . - Given any square matrix
$\mathbf{A}$ , is$\mathbf{A} + \mathbf{A}^\top$ always symmetric? Can you prove the result by using only the result of the previous two exercises? - We defined the tensor
X
of shape (2, 3, 4) in this section. What is the output oflen(X)
? Write your answer without implementing any code, then check your answer using code. - For a tensor
X
of arbitrary shape, doeslen(X)
always correspond to the length of a certain axis ofX
? What is that axis? - Run
A / A.sum(axis=1)
and see what happens. Can you analyze the reason? - When traveling between two points in downtown Manhattan, what is the distance that you need to cover in terms of the coordinates, i.e., in terms of avenues and streets? Can you travel diagonally?
- Consider a tensor with shape (2, 3, 4). What are the shapes of the summation outputs along axis 0, 1, and 2?
- Feed a tensor with 3 or more axes to the
linalg.norm
function and observe its output. What does this function compute for tensors of arbitrary shape? - Define three large matrices, say
$\mathbf{A} \in \mathbb{R}^{2^{10} \times 2^{16}}$ ,$\mathbf{B} \in \mathbb{R}^{2^{16} \times 2^{5}}$ and$\mathbf{C} \in \mathbb{R}^{2^{5} \times 2^{14}}$ , for instance initialized with Gaussian random variables. You want to compute the product$\mathbf{A} \mathbf{B} \mathbf{C}$ . Is there any difference in memory footprint and speed, depending on whether you compute$(\mathbf{A} \mathbf{B}) \mathbf{C}$ or$\mathbf{A} (\mathbf{B} \mathbf{C})$ . Why? - Define three large matrices, say
$\mathbf{A} \in \mathbb{R}^{2^{10} \times 2^{16}}$ ,$\mathbf{B} \in \mathbb{R}^{2^{16} \times 2^{5}}$ and$\mathbf{C} \in \mathbb{R}^{2^{5} \times 2^{16}}$ . Is there any difference in speed depending on whether you compute$\mathbf{A} \mathbf{B}$ or$\mathbf{A} \mathbf{C}^\top$ ? Why? What changes if you initialize$\mathbf{C} = \mathbf{B}^\top$ without cloning memory? Why? - Define three matrices, say
$\mathbf{A}, \mathbf{B}, \mathbf{C} \in \mathbb{R}^{100 \times 200}$ . Constitute a tensor with 3 axes by stacking$[\mathbf{A}, \mathbf{B}, \mathbf{C}]$ . What is the dimensionality? Slice out the second coordinate of the third axis to recover$\mathbf{B}$ . Check that your answer is correct.
:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab:
:begin_tab:tensorflow
Discussions
:end_tab: