🏷️sec_eigendecompositions
Eigenvalues are often one of the most useful notions we will encounter when studying linear algebra, however, as a beginner, it is easy to overlook their importance. Below, we introduce eigendecomposition and try to convey some sense of just why it is so important.
Suppose that we have a matrix
If we apply
However, there are some vectors for which something remains unchanged.
Namely
In general, if we can find a number
We say that
Let's figure out how to find them. By subtracting off the
eq_eigvalue_der
For :eqref:eq_eigvalue_der
to happen, we see that
Let's see this with a more challenging matrix
If we consider
We can solve this with the vectors
We can check this in code using the built-in numpy.linalg.eig
routine.
#@tab mxnet
%matplotlib inline
from d2l import mxnet as d2l
from IPython import display
import numpy as np
np.linalg.eig(np.array([[2, 1], [2, 3]]))
#@tab pytorch
%matplotlib inline
from d2l import torch as d2l
from IPython import display
import torch
torch.eig(torch.tensor([[2, 1], [2, 3]], dtype=torch.float64),
eigenvectors=True)
#@tab tensorflow
%matplotlib inline
from d2l import tensorflow as d2l
from IPython import display
import tensorflow as tf
tf.linalg.eig(tf.constant([[2, 1], [2, 3]], dtype=tf.float64))
Note that numpy
normalizes the eigenvectors to be of length one,
whereas we took ours to be of arbitrary length.
Additionally, the choice of sign is arbitrary.
However, the vectors computed are parallel
to the ones we found by hand with the same eigenvalues.
Let's continue the previous example one step further. Let
be the matrix where the columns are the eigenvectors of the matrix
be the matrix with the associated eigenvalues on the diagonal. Then the definition of eigenvalues and eigenvectors tells us that
The matrix
eq_eig_decomp
In the next section we will see some nice consequences of this,
but for now we need only know that such a decomposition
will exist as long as we can find a full collection
of linearly independent eigenvectors (so that
One nice thing about eigendecompositions :eqref:eq_eig_decomp
is that
we can write many operations we usually encounter cleanly
in terms of the eigendecomposition. As a first example, consider:
This tells us that for any positive power of a matrix, the eigendecomposition is obtained by just raising the eigenvalues to the same power. The same can be shown for negative powers, so if we want to invert a matrix we need only consider
or in other words, just invert each eigenvalue. This will work as long as each eigenvalue is non-zero, so we see that invertible is the same as having no zero eigenvalues.
Indeed, additional work can show that if
or the product of all the eigenvalues.
This makes sense intuitively because whatever stretching
Finally, recall that the rank was the maximum number
of linearly independent columns of your matrix.
By examining the eigendecomposition closely,
we can see that the rank is the same
as the number of non-zero eigenvalues of
The examples could continue, but hopefully the point is clear: eigendecomposition can simplify many linear-algebraic computations and is a fundamental operation underlying many numerical algorithms and much of the analysis that we do in linear algebra.
It is not always possible to find enough linearly independent eigenvectors for the above process to work. For instance the matrix
has only a single eigenvector, namely
The most commonly encountered family are the symmetric matrices,
which are those matrices where eq_eig_decomp
as
Eigenvalues are often difficult to reason with intuitively. If presented an arbitrary matrix, there is little that can be said about what the eigenvalues are without computing them. There is, however, one theorem that can make it easy to approximate well if the largest values are on the diagonal.
Let
This can be a bit to unpack, so let's look at an example. Consider the matrix:
We have
Performing the numerical computation shows
that the eigenvalues are approximately
#@tab mxnet
A = np.array([[1.0, 0.1, 0.1, 0.1],
[0.1, 3.0, 0.2, 0.3],
[0.1, 0.2, 5.0, 0.5],
[0.1, 0.3, 0.5, 9.0]])
v, _ = np.linalg.eig(A)
v
#@tab pytorch
A = torch.tensor([[1.0, 0.1, 0.1, 0.1],
[0.1, 3.0, 0.2, 0.3],
[0.1, 0.2, 5.0, 0.5],
[0.1, 0.3, 0.5, 9.0]])
v, _ = torch.eig(A)
v
#@tab tensorflow
A = tf.constant([[1.0, 0.1, 0.1, 0.1],
[0.1, 3.0, 0.2, 0.3],
[0.1, 0.2, 5.0, 0.5],
[0.1, 0.3, 0.5, 9.0]])
v, _ = tf.linalg.eigh(A)
v
In this way, eigenvalues can be approximated, and the approximations will be fairly accurate in the case that the diagonal is significantly larger than all the other elements.
It is a small thing, but with a complex and subtle topic like eigendecomposition, it is good to get any intuitive grasp we can.
Now that we understand what eigenvectors are in principle, let's see how they can be used to provide a deep understanding of a problem central to neural network behavior: proper weight initialization.
The full mathematical investigation of the initialization
of deep neural networks is beyond the scope of the text,
but we can see a toy version here to understand
how eigenvalues can help us see how these models work.
As we know, neural networks operate by interspersing layers
of linear transformations with non-linear operations.
For simplicity here, we will assume that there is no non-linearity,
and that the transformation is a single repeated matrix operation
$$ \mathbf{v}{out} = \mathbf{A}\cdot \mathbf{A}\cdots \mathbf{A} \mathbf{v}{in} = \mathbf{A}^N \mathbf{v}_{in}. $$
When these models are initialized,
#@tab mxnet
np.random.seed(8675309)
k = 5
A = np.random.randn(k, k)
A
#@tab pytorch
torch.manual_seed(42)
k = 5
A = torch.randn(k, k, dtype=torch.float64)
A
#@tab tensorflow
k = 5
A = tf.random.normal((k, k), dtype=tf.float64)
A
For simplicity in our toy model,
we will assume that the data vector we feed in
On the flip side, if
We need to walk the narrow line between growth and decay to make sure that our output changes depending on our input, but not much!
Let's see what happens when we repeatedly multiply our matrix
#@tab mxnet
# Calculate the sequence of norms after repeatedly applying `A`
v_in = np.random.randn(k, 1)
norm_list = [np.linalg.norm(v_in)]
for i in range(1, 100):
v_in = A.dot(v_in)
norm_list.append(np.linalg.norm(v_in))
d2l.plot(np.arange(0, 100), norm_list, 'Iteration', 'Value')
#@tab pytorch
# Calculate the sequence of norms after repeatedly applying `A`
v_in = torch.randn(k, 1, dtype=torch.float64)
norm_list = [torch.norm(v_in).item()]
for i in range(1, 100):
v_in = A @ v_in
norm_list.append(torch.norm(v_in).item())
d2l.plot(torch.arange(0, 100), norm_list, 'Iteration', 'Value')
#@tab tensorflow
# Calculate the sequence of norms after repeatedly applying `A`
v_in = tf.random.normal((k, 1), dtype=tf.float64)
norm_list = [tf.norm(v_in).numpy()]
for i in range(1, 100):
v_in = tf.matmul(A, v_in)
norm_list.append(tf.norm(v_in).numpy())
d2l.plot(tf.range(0, 100), norm_list, 'Iteration', 'Value')
The norm is growing uncontrollably! Indeed if we take the list of quotients, we will see a pattern.
#@tab mxnet
# Compute the scaling factor of the norms
norm_ratio_list = []
for i in range(1, 100):
norm_ratio_list.append(norm_list[i]/norm_list[i - 1])
d2l.plot(np.arange(1, 100), norm_ratio_list, 'Iteration', 'Ratio')
#@tab pytorch
# Compute the scaling factor of the norms
norm_ratio_list = []
for i in range(1, 100):
norm_ratio_list.append(norm_list[i]/norm_list[i - 1])
d2l.plot(torch.arange(1, 100), norm_ratio_list, 'Iteration', 'Ratio')
#@tab tensorflow
# Compute the scaling factor of the norms
norm_ratio_list = []
for i in range(1, 100):
norm_ratio_list.append(norm_list[i]/norm_list[i - 1])
d2l.plot(tf.range(1, 100), norm_ratio_list, 'Iteration', 'Ratio')
If we look at the last portion of the above computation,
we see that the random vector is stretched by a factor of 1.974459321485[...]
,
where the portion at the end shifts a little,
but the stretching factor is stable.
We have seen that eigenvectors and eigenvalues correspond
to the amount something is stretched,
but that was for specific vectors, and specific stretches.
Let's take a look at what they are for
#@tab mxnet
# Compute the eigenvalues
eigs = np.linalg.eigvals(A).tolist()
norm_eigs = [np.absolute(x) for x in eigs]
norm_eigs.sort()
print(f'norms of eigenvalues: {norm_eigs}')
#@tab pytorch
# Compute the eigenvalues
eigs = torch.eig(A)[0][:,0].tolist()
norm_eigs = [torch.abs(torch.tensor(x)) for x in eigs]
norm_eigs.sort()
print(f'norms of eigenvalues: {norm_eigs}')
#@tab tensorflow
# Compute the eigenvalues
eigs = tf.linalg.eigh(A)[0].numpy().tolist()
norm_eigs = [tf.abs(tf.constant(x, dtype=tf.float64)) for x in eigs]
norm_eigs.sort()
print(f'norms of eigenvalues: {norm_eigs}')
We see something a bit unexpected happening here:
that number we identified before for the
long term stretching of our matrix
But, if we now think about what is happening geometrically,
this starts to make sense. Consider a random vector.
This random vector points a little in every direction,
so in particular, it points at least a little bit
in the same direction as the eigenvector of Van-Loan.Golub.1983
.
Now, from above discussions, we concluded that we do not want a random vector to be stretched or squished at all, we would like random vectors to stay about the same size throughout the entire process. To do so, we now rescale our matrix by this principle eigenvalue so that the largest eigenvalue is instead now just one. Let's see what happens in this case.
#@tab mxnet
# Rescale the matrix `A`
A /= norm_eigs[-1]
# Do the same experiment again
v_in = np.random.randn(k, 1)
norm_list = [np.linalg.norm(v_in)]
for i in range(1, 100):
v_in = A.dot(v_in)
norm_list.append(np.linalg.norm(v_in))
d2l.plot(np.arange(0, 100), norm_list, 'Iteration', 'Value')
#@tab pytorch
# Rescale the matrix `A`
A /= norm_eigs[-1]
# Do the same experiment again
v_in = torch.randn(k, 1, dtype=torch.float64)
norm_list = [torch.norm(v_in).item()]
for i in range(1, 100):
v_in = A @ v_in
norm_list.append(torch.norm(v_in).item())
d2l.plot(torch.arange(0, 100), norm_list, 'Iteration', 'Value')
#@tab tensorflow
# Rescale the matrix `A`
A /= norm_eigs[-1]
# Do the same experiment again
v_in = tf.random.normal((k, 1), dtype=tf.float64)
norm_list = [tf.norm(v_in).numpy()]
for i in range(1, 100):
v_in = tf.matmul(A, v_in)
norm_list.append(tf.norm(v_in).numpy())
d2l.plot(tf.range(0, 100), norm_list, 'Iteration', 'Value')
We can also plot the ratio between consecutive norms as before and see that indeed it stabilizes.
#@tab mxnet
# Also plot the ratio
norm_ratio_list = []
for i in range(1, 100):
norm_ratio_list.append(norm_list[i]/norm_list[i-1])
d2l.plot(np.arange(1, 100), norm_ratio_list, 'Iteration', 'Ratio')
#@tab pytorch
# Also plot the ratio
norm_ratio_list = []
for i in range(1, 100):
norm_ratio_list.append(norm_list[i]/norm_list[i-1])
d2l.plot(torch.arange(1, 100), norm_ratio_list, 'Iteration', 'Ratio')
#@tab tensorflow
# Also plot the ratio
norm_ratio_list = []
for i in range(1, 100):
norm_ratio_list.append(norm_list[i]/norm_list[i-1])
d2l.plot(tf.range(1, 100), norm_ratio_list, 'Iteration', 'Ratio')
We now see exactly what we hoped for!
After normalizing the matrices by the principal eigenvalue,
we see that the random data does not explode as before,
but rather eventually equilibrates to a specific value.
It would be nice to be able to do these things from first principles,
and it turns out that if we look deeply at the mathematics of it,
we can see that the largest eigenvalue
of a large random matrix with independent mean zero,
variance one Gaussian entries is on average about Ginibre.1965
.
The relationship between the eigenvalues (and a related object called singular values) of random matrices has been shown to have deep connections to proper initialization of neural networks as was discussed in :citet:Pennington.Schoenholz.Ganguli.2017
and subsequent works.
- Eigenvectors are vectors which are stretched by a matrix without changing direction.
- Eigenvalues are the amount that the eigenvectors are stretched by the application of the matrix.
- The eigendecomposition of a matrix can allow for many operations to be reduced to operations on the eigenvalues.
- The Gershgorin Circle Theorem can provide approximate values for the eigenvalues of a matrix.
- The behavior of iterated matrix powers depends primarily on the size of the largest eigenvalue. This understanding has many applications in the theory of neural network initialization.
- What are the eigenvalues and eigenvectors of $$ \mathbf{A} = \begin{bmatrix} 2 & 1 \ 1 & 2 \end{bmatrix}? $$
- What are the eigenvalues and eigenvectors of the following matrix, and what is strange about this example compared to the previous one? $$ \mathbf{A} = \begin{bmatrix} 2 & 1 \ 0 & 2 \end{bmatrix}. $$
- Without computing the eigenvalues, is it possible that the smallest eigenvalue of the following matrix is less that
$0.5$ ? Note: this problem can be done in your head. $$ \mathbf{A} = \begin{bmatrix} 3.0 & 0.1 & 0.3 & 1.0 \ 0.1 & 1.0 & 0.1 & 0.2 \ 0.3 & 0.1 & 5.0 & 0.0 \ 1.0 & 0.2 & 0.0 & 1.8 \end{bmatrix}. $$
:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab:
:begin_tab:tensorflow
Discussions
:end_tab: