🏷️sec_linear-algebra
Now that you can store and manipulate data, let us briefly review the subset of basic linear algebra that you will need to understand and implement most of models covered in this book. Below, we introduce the basic mathematical objects, arithmetic, and operations in linear algebra, expressing each of them through mathematical notation and the corresponding implementation in code.
If you never studied linear algebra or machine learning,
then your past experience with math probably consisted
of thinking about one number at a time.
And, if you ever balanced a checkbook
or even paid for dinner at a restaurant
then you already know how to do basic things
like adding and multiplying pairs of numbers.
For example, the temperature in Palo Alto is
In this book, we adopt the mathematical notation
where scalar variables are denoted
by ordinary lower-cased letters (e.g.,
(A scalar is represented by a tensor with just one element.) In the next snippet, we instantiate two scalars and perform some familiar arithmetic operations with them, namely addition, multiplication, division, and exponentiation.
from mxnet import np, npx
npx.set_np()
x = np.array(3.0)
y = np.array(2.0)
x + y, x * y, x / y, x ** y
#@tab pytorch
import torch
x = torch.tensor(3.0)
y = torch.tensor(2.0)
x + y, x * y, x / y, x**y
#@tab tensorflow
import tensorflow as tf
x = tf.constant(3.0)
y = tf.constant(2.0)
x + y, x * y, x / y, x**y
[You can think of a vector as simply a list of scalar values.]
We call these values the elements (entries or components) of the vector.
When our vectors represent examples from our dataset,
their values hold some real-world significance.
For example, if we were training a model to predict
the risk that a loan defaults,
we might associate each applicant with a vector
whose components correspond to their income,
length of employment, number of previous defaults, and other factors.
If we were studying the risk of heart attacks hospital patients potentially face,
we might represent each patient by a vector
whose components capture their most recent vital signs,
cholesterol levels, minutes of exercise per day, etc.
In math notation, we will usually denote vectors as bold-faced,
lower-cased letters (e.g.,
We work with vectors via one-dimensional tensors. In general tensors can have arbitrary lengths, subject to the memory limits of your machine.
x = np.arange(4)
x
#@tab pytorch
x = torch.arange(4)
x
#@tab tensorflow
x = tf.range(4)
x
We can refer to any element of a vector by using a subscript.
For example, we can refer to the
$$\mathbf{x} =\begin{bmatrix}x_{1} \x_{2} \ \vdots \x_{n}\end{bmatrix},$$
:eqlabel:eq_vec_def
where
x[3]
#@tab pytorch
x[3]
#@tab tensorflow
x[3]
Let us revisit some concepts from :numref:sec_ndarray
.
A vector is just an array of numbers.
And just as every array has a length, so does every vector.
In math notation, if we want to say that a vector
As with an ordinary Python array,
we [can access the length of a tensor]
by calling Python's built-in len()
function.
len(x)
#@tab pytorch
len(x)
#@tab tensorflow
len(x)
When a tensor represents a vector (with precisely one axis),
we can also access its length via the .shape
attribute.
The shape is a tuple that lists the length (dimensionality)
along each axis of the tensor.
(For tensors with just one axis, the shape has just one element.)
x.shape
#@tab pytorch
x.shape
#@tab tensorflow
x.shape
Note that the word "dimension" tends to get overloaded in these contexts and this tends to confuse people. To clarify, we use the dimensionality of a vector or an axis to refer to its length, i.e., the number of elements of a vector or an axis. However, we use the dimensionality of a tensor to refer to the number of axes that a tensor has. In this sense, the dimensionality of some axis of a tensor will be the length of that axis.
Just as vectors generalize scalars from order zero to order one,
matrices generalize vectors from order one to order two.
Matrices, which we will typically denote with bold-faced, capital letters
(e.g.,
In math notation, we use
$$\mathbf{A}=\begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \ a_{21} & a_{22} & \cdots & a_{2n} \ \vdots & \vdots & \ddots & \vdots \ a_{m1} & a_{m2} & \cdots & a_{mn} \ \end{bmatrix}.$$
:eqlabel:eq_matrix_def
For any
We can [create an
A = np.arange(20).reshape(5, 4)
A
#@tab pytorch
A = torch.arange(20).reshape(5, 4)
A
#@tab tensorflow
A = tf.reshape(tf.range(20), (5, 4))
A
We can access the scalar element eq_matrix_def
by specifying the indices for the row (eq_matrix_def
, are not given,
we may simply use the lower-case letter of the matrix $\mathbf{A}$ with the index subscript, $a{ij}$,
to refer to $[\mathbf{A}]{ij}$.
To keep notation simple, commas are inserted to separate indices only when necessary,
such as $a{2, 3j}$ and
Sometimes, we want to flip the axes.
When we exchange a matrix's rows and columns,
the result is called the transpose of the matrix.
Formally, we signify a matrix eq_matrix_def
is
a
Now we access a (matrix's transpose) in code.
A.T
#@tab pytorch
A.T
#@tab tensorflow
tf.transpose(A)
As a special type of the square matrix,
[a symmetric matrix B
.
B = np.array([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
B
#@tab pytorch
B = torch.tensor([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
B
#@tab tensorflow
B = tf.constant([[1, 2, 3], [2, 0, 4], [3, 4, 5]])
B
Now we compare B
with its transpose.
B == B.T
#@tab pytorch
B == B.T
#@tab tensorflow
B == tf.transpose(B)
Matrices are useful data structures:
they allow us to organize data that have different modalities of variation.
For example, rows in our matrix might correspond to different houses (data examples),
while columns might correspond to different attributes.
This should sound familiar if you have ever used spreadsheet software or
have read :numref:sec_pandas
.
Thus, although the default orientation of a single vector is a column vector,
in a matrix that represents a tabular dataset,
it is more conventional to treat each data example as a row vector in the matrix.
And, as we will see in later chapters,
this convention will enable common deep learning practices.
For example, along the outermost axis of a tensor,
we can access or enumerate minibatches of data examples,
or just data examples if no minibatch exists.
Just as vectors generalize scalars, and matrices generalize vectors, we can build data structures with even more axes.
[Tensors]
("tensors" in this subsection refer to algebraic objects)
(give us a generic way of describing
Tensors will become more important when we start working with images,
which arrive as
X = np.arange(24).reshape(2, 3, 4)
X
#@tab pytorch
X = torch.arange(24).reshape(2, 3, 4)
X
#@tab tensorflow
X = tf.reshape(tf.range(24), (2, 3, 4))
X
Scalars, vectors, matrices, and tensors ("tensors" in this subsection refer to algebraic objects) of an arbitrary number of axes have some nice properties that often come in handy. For example, you might have noticed from the definition of an elementwise operation that any elementwise unary operation does not change the shape of its operand. Similarly, [given any two tensors with the same shape, the result of any binary elementwise operation will be a tensor of that same shape.] For example, adding two matrices of the same shape performs elementwise addition over these two matrices.
A = np.arange(20).reshape(5, 4)
B = A.copy() # Assign a copy of `A` to `B` by allocating new memory
A, A + B
#@tab pytorch
A = torch.arange(20, dtype=torch.float32).reshape(5, 4)
B = A.clone() # Assign a copy of `A` to `B` by allocating new memory
A, A + B
#@tab tensorflow
A = tf.reshape(tf.range(20, dtype=tf.float32), (5, 4))
B = A # No cloning of `A` to `B` by allocating new memory
A, A + B
Specifically,
[elementwise multiplication of two matrices is called their Hadamard product]
(math notation eq_matrix_def
) and
A * B
#@tab pytorch
A * B
#@tab tensorflow
A * B
[Multiplying or adding a tensor by a scalar] also does not change the shape of the tensor, where each element of the operand tensor will be added or multiplied by the scalar.
a = 2
X = np.arange(24).reshape(2, 3, 4)
a + X, (a * X).shape
#@tab pytorch
a = 2
X = torch.arange(24).reshape(2, 3, 4)
a + X, (a * X).shape
#@tab tensorflow
a = 2
X = tf.reshape(tf.range(24), (2, 3, 4))
a + X, (a * X).shape
🏷️subseq_lin-alg-reduction
One useful operation that we can perform with arbitrary tensors
is to
calculate [the sum of their elements.]
In mathematical notation, we express sums using the
x = np.arange(4)
x, x.sum()
#@tab pytorch
x = torch.arange(4, dtype=torch.float32)
x, x.sum()
#@tab tensorflow
x = tf.range(4, dtype=tf.float32)
x, tf.reduce_sum(x)
We can express [sums over the elements of tensors of arbitrary shape.]
For example, the sum of the elements of an
A.shape, A.sum()
#@tab pytorch
A.shape, A.sum()
#@tab tensorflow
A.shape, tf.reduce_sum(A)
By default, invoking the function for calculating the sum
reduces a tensor along all its axes to a scalar.
We can also [specify the axes along which the tensor is reduced via summation.]
Take matrices as an example.
To reduce the row dimension (axis 0) by summing up elements of all the rows,
we specify axis=0
when invoking the function.
Since the input matrix reduces along axis 0 to generate the output vector,
the dimension of axis 0 of the input is lost in the output shape.
A_sum_axis0 = A.sum(axis=0)
A_sum_axis0, A_sum_axis0.shape
#@tab pytorch
A_sum_axis0 = A.sum(axis=0)
A_sum_axis0, A_sum_axis0.shape
#@tab tensorflow
A_sum_axis0 = tf.reduce_sum(A, axis=0)
A_sum_axis0, A_sum_axis0.shape
Specifying
axis=1
will reduce the column dimension (axis 1) by summing up elements of all the columns.
Thus, the dimension of axis 1 of the input is lost in the output shape.
A_sum_axis1 = A.sum(axis=1)
A_sum_axis1, A_sum_axis1.shape
#@tab pytorch
A_sum_axis1 = A.sum(axis=1)
A_sum_axis1, A_sum_axis1.shape
#@tab tensorflow
A_sum_axis1 = tf.reduce_sum(A, axis=1)
A_sum_axis1, A_sum_axis1.shape
Reducing a matrix along both rows and columns via summation is equivalent to summing up all the elements of the matrix.
A.sum(axis=[0, 1]) # Same as `A.sum()`
#@tab pytorch
A.sum(axis=[0, 1]) # Same as `A.sum()`
#@tab tensorflow
tf.reduce_sum(A, axis=[0, 1]) # Same as `tf.reduce_sum(A)`
[A related quantity is the mean, which is also called the average.] We calculate the mean by dividing the sum by the total number of elements. In code, we could just call the function for calculating the mean on tensors of arbitrary shape.
A.mean(), A.sum() / A.size
#@tab pytorch
A.mean(), A.sum() / A.numel()
#@tab tensorflow
tf.reduce_mean(A), tf.reduce_sum(A) / tf.size(A).numpy()
Likewise, the function for calculating the mean can also reduce a tensor along the specified axes.
A.mean(axis=0), A.sum(axis=0) / A.shape[0]
#@tab pytorch
A.mean(axis=0), A.sum(axis=0) / A.shape[0]
#@tab tensorflow
tf.reduce_mean(A, axis=0), tf.reduce_sum(A, axis=0) / A.shape[0]
🏷️subseq_lin-alg-non-reduction
However, sometimes it can be useful to [keep the number of axes unchanged] when invoking the function for calculating the sum or mean.
sum_A = A.sum(axis=1, keepdims=True)
sum_A
#@tab pytorch
sum_A = A.sum(axis=1, keepdims=True)
sum_A
#@tab tensorflow
sum_A = tf.reduce_sum(A, axis=1, keepdims=True)
sum_A
For instance,
since sum_A
still keeps its two axes after summing each row, we can (divide A
by sum_A
with broadcasting.)
A / sum_A
#@tab pytorch
A / sum_A
#@tab tensorflow
A / sum_A
If we want to calculate [the cumulative sum of elements of A
along some axis], say axis=0
(row by row),
we can call the cumsum
function. This function will not reduce the input tensor along any axis.
A.cumsum(axis=0)
#@tab pytorch
A.cumsum(axis=0)
#@tab tensorflow
tf.cumsum(A, axis=0)
So far, we have only performed elementwise operations, sums, and averages. And if this was all we could do, linear algebra probably would not deserve its own section. However, one of the most fundamental operations is the dot product.
Given two vectors
[The dot product of two vectors is a sum over the products of the elements at the same position]
y = np.ones(4)
x, y, np.dot(x, y)
#@tab pytorch
y = torch.ones(4, dtype = torch.float32)
x, y, torch.dot(x, y)
#@tab tensorflow
y = tf.ones(4, dtype=tf.float32)
x, y, tf.tensordot(x, y, axes=1)
Note that (we can express the dot product of two vectors equivalently by performing an elementwise multiplication and then a sum:)
np.sum(x * y)
#@tab pytorch
torch.sum(x * y)
#@tab tensorflow
tf.reduce_sum(x * y)
Dot products are useful in a wide range of contexts.
For example, given some set of values,
denoted by a vector
Now that we know how to calculate dot products,
we can begin to understand matrix-vector products.
Recall the matrix eq_matrix_def
and :eqref:eq_vec_def
respectively.
Let us start off by visualizing the matrix
where each
[The matrix-vector product
We can think of multiplication by a matrix
:begin_tab:mxnet
Expressing matrix-vector products in code with tensors,
we use the same dot
function as for dot products.
When we call np.dot(A, x)
with a matrix A
and a vector x
,
the matrix-vector product is performed.
Note that the column dimension of A
(its length along axis 1)
must be the same as the dimension of x
(its length).
:end_tab:
:begin_tab:pytorch
Expressing matrix-vector products in code with tensors, we use
the mv
function. When we call torch.mv(A, x)
with a matrix
A
and a vector x
, the matrix-vector product is performed.
Note that the column dimension of A
(its length along axis 1)
must be the same as the dimension of x
(its length).
:end_tab:
:begin_tab:tensorflow
Expressing matrix-vector products in code with tensors, we use
the matvec
function. When we call tf.linalg.matvec(A, x)
with a
matrix A
and a vector x
, the matrix-vector product is
performed. Note that the column dimension of A
(its length along axis 1)
must be the same as the dimension of x
(its length).
:end_tab:
A.shape, x.shape, np.dot(A, x)
#@tab pytorch
A.shape, x.shape, torch.mv(A, x)
#@tab tensorflow
A.shape, x.shape, tf.linalg.matvec(A, x)
If you have gotten the hang of dot products and matrix-vector products, then matrix-matrix multiplication should be straightforward.
Say that we have two matrices
Denote by
$$\mathbf{A}= \begin{bmatrix} \mathbf{a}^\top_{1} \ \mathbf{a}^\top_{2} \ \vdots \ \mathbf{a}^\top_n \ \end{bmatrix}, \quad \mathbf{B}=\begin{bmatrix} \mathbf{b}{1} & \mathbf{b}{2} & \cdots & \mathbf{b}_{m} \ \end{bmatrix}. $$
Then the matrix product
$$\mathbf{C} = \mathbf{AB} = \begin{bmatrix} \mathbf{a}^\top_{1} \ \mathbf{a}^\top_{2} \ \vdots \ \mathbf{a}^\top_n \ \end{bmatrix} \begin{bmatrix} \mathbf{b}{1} & \mathbf{b}{2} & \cdots & \mathbf{b}{m} \ \end{bmatrix} = \begin{bmatrix} \mathbf{a}^\top{1} \mathbf{b}1 & \mathbf{a}^\top{1}\mathbf{b}2& \cdots & \mathbf{a}^\top{1} \mathbf{b}m \ \mathbf{a}^\top{2}\mathbf{b}1 & \mathbf{a}^\top{2} \mathbf{b}2 & \cdots & \mathbf{a}^\top{2} \mathbf{b}m \ \vdots & \vdots & \ddots &\vdots\ \mathbf{a}^\top{n} \mathbf{b}1 & \mathbf{a}^\top{n}\mathbf{b}2& \cdots& \mathbf{a}^\top{n} \mathbf{b}_m \end{bmatrix}. $$
[We can think of the matrix-matrix multiplication A
and B
.
Here, A
is a matrix with 5 rows and 4 columns,
and B
is a matrix with 4 rows and 3 columns.
After multiplication, we obtain a matrix with 5 rows and 3 columns.
B = np.ones(shape=(4, 3))
np.dot(A, B)
#@tab pytorch
B = torch.ones(4, 3)
torch.mm(A, B)
#@tab tensorflow
B = tf.ones((4, 3), tf.float32)
tf.matmul(A, B)
Matrix-matrix multiplication can be simply called matrix multiplication, and should not be confused with the Hadamard product.
🏷️subsec_lin-algebra-norms
Some of the most useful operators in linear algebra are norms. Informally, the norm of a vector tells us how big a vector is. The notion of size under consideration here concerns not dimensionality but rather the magnitude of the components.
In linear algebra, a vector norm is a function
The second property is the familiar triangle inequality:
The third property simply says that the norm must be non-negative:
That makes sense, as in most contexts the smallest size for anything is 0. The final property requires that the smallest norm is achieved and only achieved by a vector consisting of all zeros.
You might notice that norms sound a lot like measures of distance.
And if you remember Euclidean distances
(think Pythagoras' theorem) from grade school,
then the concepts of non-negativity and the triangle inequality might ring a bell.
In fact, the Euclidean distance is a norm:
specifically it is the
[The
($$|\mathbf{x}|2 = \sqrt{\sum{i=1}^n x_i^2},$$)
where the subscript
u = np.array([3, -4])
np.linalg.norm(u)
#@tab pytorch
u = torch.tensor([3.0, -4.0])
torch.norm(u)
#@tab tensorflow
u = tf.constant([3.0, -4.0])
tf.norm(u)
In deep learning, we work more often
with the squared
You will also frequently encounter [the
($$|\mathbf{x}|1 = \sum{i=1}^n \left|x_i \right|.$$)
As compared with the
np.abs(u).sum()
#@tab pytorch
torch.abs(u).sum()
#@tab tensorflow
tf.reduce_sum(tf.abs(u))
Both the
$$|\mathbf{x}|p = \left(\sum{i=1}^n \left|x_i \right|^p \right)^{1/p}.$$
Analogous to
[$$|\mathbf{X}|F = \sqrt{\sum{i=1}^m \sum_{j=1}^n x_{ij}^2}.$$]
The Frobenius norm satisfies all the properties of vector norms.
It behaves as if it were an
np.linalg.norm(np.ones((4, 9)))
#@tab pytorch
torch.norm(torch.ones((4, 9)))
#@tab tensorflow
tf.norm(tf.ones((4, 9)))
🏷️subsec_norms_and_objectives
While we do not want to get too far ahead of ourselves, we can plant some intuition already about why these concepts are useful. In deep learning, we are often trying to solve optimization problems: maximize the probability assigned to observed data; minimize the distance between predictions and the ground-truth observations. Assign vector representations to items (like words, products, or news articles) such that the distance between similar items is minimized, and the distance between dissimilar items is maximized. Oftentimes, the objectives, perhaps the most important components of deep learning algorithms (besides the data), are expressed as norms.
In just this section, we have taught you all the linear algebra that you will need to understand a remarkable chunk of modern deep learning. There is a lot more to linear algebra and a lot of that mathematics is useful for machine learning. For example, matrices can be decomposed into factors, and these decompositions can reveal low-dimensional structure in real-world datasets. There are entire subfields of machine learning that focus on using matrix decompositions and their generalizations to high-order tensors to discover structure in datasets and solve prediction problems. But this book focuses on deep learning. And we believe you will be much more inclined to learn more mathematics once you have gotten your hands dirty deploying useful machine learning models on real datasets. So while we reserve the right to introduce more mathematics much later on, we will wrap up this section here.
If you are eager to learn more about linear algebra,
you may refer to either the
online appendix on linear algebraic operations
or other excellent resources :cite:Strang.1993,Kolter.2008,Petersen.Pedersen.ea.2008
.
- Scalars, vectors, matrices, and tensors are basic mathematical objects in linear algebra.
- Vectors generalize scalars, and matrices generalize vectors.
- Scalars, vectors, matrices, and tensors have zero, one, two, and an arbitrary number of axes, respectively.
- A tensor can be reduced along the specified axes by
sum
andmean
. - Elementwise multiplication of two matrices is called their Hadamard product. It is different from matrix multiplication.
- In deep learning, we often work with norms such as the
$L_1$ norm, the$L_2$ norm, and the Frobenius norm. - We can perform a variety of operations over scalars, vectors, matrices, and tensors.
- Prove that the transpose of a matrix
$\mathbf{A}$ 's transpose is$\mathbf{A}$ :$(\mathbf{A}^\top)^\top = \mathbf{A}$ . - Given two matrices
$\mathbf{A}$ and$\mathbf{B}$ , show that the sum of transposes is equal to the transpose of a sum:$\mathbf{A}^\top + \mathbf{B}^\top = (\mathbf{A} + \mathbf{B})^\top$ . - Given any square matrix
$\mathbf{A}$ , is$\mathbf{A} + \mathbf{A}^\top$ always symmetric? Why? - We defined the tensor
X
of shape (2, 3, 4) in this section. What is the output oflen(X)
? - For a tensor
X
of arbitrary shape, doeslen(X)
always correspond to the length of a certain axis ofX
? What is that axis? - Run
A / A.sum(axis=1)
and see what happens. Can you analyze the reason? - When traveling between two points in Manhattan, what is the distance that you need to cover in terms of the coordinates, i.e., in terms of avenues and streets? Can you travel diagonally?
- Consider a tensor with shape (2, 3, 4). What are the shapes of the summation outputs along axis 0, 1, and 2?
- Feed a tensor with 3 or more axes to the
linalg.norm
function and observe its output. What does this function compute for tensors of arbitrary shape?
:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab:
:begin_tab:tensorflow
Discussions
:end_tab: