Merge branch 'master' into jax_synx

naviat · Aug 27, 2022 · 98f01e8 · 98f01e8
2 parents bcfc480 + 5571434
commit 98f01e8
Show file tree

Hide file tree

Showing 50 changed files with 3,553 additions and 5,904 deletions.
diff --git a/Jenkinsfile b/Jenkinsfile
@@ -59,7 +59,7 @@ stage("Build and Publish") {
 
       sh label:"Build HTML", script:"""set -ex
       conda activate ${ENV_NAME}
-      ./static/build_html.sh
+      ./static/build_html.sh ${env.BRANCH_NAME} ${JOB_NAME}
       """
 
       sh label:"Build PDF", script:"""set -ex

diff --git a/chapter_attention-mechanisms-and-transformers/large-pretraining-transformers.md b/chapter_attention-mechanisms-and-transformers/large-pretraining-transformers.md
@@ -42,7 +42,7 @@ DistilBERT (lightweight via knowledge distillation) :cite:`sanh2019distilbert`,
 and
 ELECTRA (replaced token detection) :cite:`clark2019electra`.
 Moreover, BERT inspired transformer pretraining in computer vision, such as with vision transformers
-:cite:`Dosovitskiy.Beyer.Kolesnikov.ea.2021`, Swin transformers :cite:`liu2021swin`, and MAE (masked autoencoders) `he2022masked`.
+:cite:`Dosovitskiy.Beyer.Kolesnikov.ea.2021`, Swin transformers :cite:`liu2021swin`, and MAE (masked autoencoders) :cite:`he2022masked`.
 
 ## Encoder-Decoder
 

diff --git a/chapter_attention-mechanisms-and-transformers/transformer.md b/chapter_attention-mechanisms-and-transformers/transformer.md
@@ -13,7 +13,7 @@ Notably,
 self-attention
 enjoys both parallel computation and
 the shortest maximum path length.
-Therefore natually,
+Therefore naturally,
 it is appealing to design deep architectures
 by using self-attention.
 Unlike earlier self-attention models
@@ -239,9 +239,11 @@ In :numref:`sec_batch_norm`,
 we explained how batch normalization
 recenters and rescales across the examples within
 a minibatch.
-Layer normalization is the same as batch normalization
+As discussed in :numref:`subsec_layer-normalization-in-bn`,
+layer normalization is the same as batch normalization
 except that the former
-normalizes across the feature dimension.
+normalizes across the feature dimension,
+thus enjoying benefits of scale independence and batch size independence.
 Despite its pervasive applications
 in computer vision,
 batch normalization

diff --git a/chapter_convolutional-modern/alexnet.md b/chapter_convolutional-modern/alexnet.md
diff --git a/chapter_convolutional-modern/batch-norm.md b/chapter_convolutional-modern/batch-norm.md
diff --git a/chapter_convolutional-modern/densenet.md b/chapter_convolutional-modern/densenet.md
@@ -7,10 +7,7 @@ tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
 :label:`sec_densenet`
 
 ResNet significantly changed the view of how to parametrize the functions in deep networks. *DenseNet* (dense convolutional network) is to some extent the logical extension of this :cite:`Huang.Liu.Van-Der-Maaten.ea.2017`.
-As a result,
-DenseNet 
-is characterized by
-both the connectivity pattern where
+DenseNet is characterized by both the connectivity pattern where
 each layer connects to all the preceding layers
 and the concatenation operation (rather than the addition operator in ResNet) to preserve and reuse features
 from earlier layers.
@@ -21,7 +18,7 @@ To understand how to arrive at it, let's take a small detour to mathematics.
 
 Recall the Taylor expansion for functions. For the point $x = 0$ it can be written as
 
-$$f(x) = f(0) + f'(0) x + \frac{f''(0)}{2!}  x^2 + \frac{f'''(0)}{3!}  x^3 + \ldots.$$
+$$f(x) = f(0) + x \cdot \left[f'(0) + x \cdot \left[\frac{f''(0)}{2!}  + x \cdot \left[\frac{f'''(0)}{3!}  + \ldots \right]\right]\right].$$
 
 
 The key point is that it decomposes a function into increasingly higher order terms. In a similar vein, ResNet decomposes functions into
@@ -30,9 +27,8 @@ $$f(\mathbf{x}) = \mathbf{x} + g(\mathbf{x}).$$
 
 That is, ResNet decomposes $f$ into a simple linear term and a more complex
 nonlinear one.
-What if we want to capture (not necessarily add) information beyond two terms?
-One solution was
-DenseNet :cite:`Huang.Liu.Van-Der-Maaten.ea.2017`.
+What if we wanted to capture (not necessarily add) information beyond two terms?
+One such solution is DenseNet :cite:`Huang.Liu.Van-Der-Maaten.ea.2017`.
 
 ![The main difference between ResNet (left) and DenseNet (right) in cross-layer connections: use of addition and use of concatenation. ](../img/densenet-block.svg)
 :label:`fig_densenet_block`
@@ -43,16 +39,17 @@ As a result, we perform a mapping from $\mathbf{x}$ to its values after applying
 $$\mathbf{x} \to \left[
 \mathbf{x},
 f_1(\mathbf{x}),
-f_2([\mathbf{x}, f_1(\mathbf{x})]), f_3([\mathbf{x}, f_1(\mathbf{x}), f_2([\mathbf{x}, f_1(\mathbf{x})])]), \ldots\right].$$
+f_2\left(\left[\mathbf{x}, f_1\left(\mathbf{x}\right)\right]\right), f_3\left(\left[\mathbf{x}, f_1\left(\mathbf{x}\right), f_2\left(\left[\mathbf{x}, f_1\left(\mathbf{x}\right)\right]\right)\right]\right), \ldots\right].$$
 
 In the end, all these functions are combined in MLP to reduce the number of features again. In terms of implementation this is quite simple:
 rather than adding terms, we concatenate them. The name DenseNet arises from the fact that the dependency graph between variables becomes quite dense. The last layer of such a chain is densely connected to all previous layers. The dense connections are shown in :numref:`fig_densenet`.
 
-![Dense connections in DenseNet.](../img/densenet.svg)
+![Dense connections in DenseNet. Note how the dimensionality increases with depth.](../img/densenet.svg)
 :label:`fig_densenet`
 
-
-The main components that compose a DenseNet are *dense blocks* and *transition layers*. The former define how the inputs and outputs are concatenated, while the latter control the number of channels so that it is not too large.
+The main components that compose a DenseNet are *dense blocks* and *transition layers*. The former define how the inputs and outputs are concatenated, while the latter control the number of channels so that it is not too large, 
+since the expansion $\mathbf{x} \to \left[\mathbf{x}, f_1(\mathbf{x}),
+f_2\left(\left[\mathbf{x}, f_1\left(\mathbf{x}\right)\right]\right), \ldots \right]$ can be quite high-dimensional.
 
 
 ## [**Dense Blocks**]
@@ -111,7 +108,7 @@ class ConvBlock(tf.keras.layers.Layer):
         return y
 ```
 
-A *dense block* consists of multiple convolution blocks, each using the same number of output channels. In the forward propagation, however, we concatenate the input and output of each convolution block on the channel dimension.
+A *dense block* consists of multiple convolution blocks, each using the same number of output channels. In the forward propagation, however, we concatenate the input and output of each convolution block on the channel dimension. Lazy evaluation allows us to adjust the dimensionality automatically.
 
 ```{.python .input}
 %%tab mxnet
@@ -125,8 +122,7 @@ class DenseBlock(nn.Block):
     def forward(self, X):
         for blk in self.net:
             Y = blk(X)
-            # Concatenate the input and output of each block on the channel
-            # dimension
+            # Concatenate input and output of each block along the channels
             X = np.concatenate((X, Y), axis=1)
         return X
 ```
@@ -144,8 +140,7 @@ class DenseBlock(nn.Module):
     def forward(self, X):
         for blk in self.net:
             Y = blk(X)
-            # Concatenate the input and output of each block on the channel
-            # dimension
+            # Concatenate input and output of each block along the channels
             X = torch.cat((X, Y), dim=1)
         return X
 ```
@@ -167,7 +162,7 @@ class DenseBlock(tf.keras.layers.Layer):
 
 In the following example,
 we [**define a `DenseBlock` instance**] with 2 convolution blocks of 10 output channels.
-When using an input with 3 channels, we will get an output with  $3+2\times 10=23$ channels. The number of convolution block channels controls the growth in the number of output channels relative to the number of input channels. This is also referred to as the *growth rate*.
+When using an input with 3 channels, we will get an output with  $3 + 10 + 10=23$ channels. The number of convolution block channels controls the growth in the number of output channels relative to the number of input channels. This is also referred to as the *growth rate*.
 
 ```{.python .input}
 %%tab all
@@ -185,7 +180,7 @@ Y.shape
 
 ## [**Transition Layers**]
 
-Since each dense block will increase the number of channels, adding too many of them will lead to an excessively complex model. A *transition layer* is used to control the complexity of the model. It reduces the number of channels by using the $1\times 1$ convolutional layer and halves the height and width of the average pooling layer with a stride of 2, further reducing the complexity of the model.
+Since each dense block will increase the number of channels, adding too many of them will lead to an excessively complex model. A *transition layer* is used to control the complexity of the model. It reduces the number of channels by using an $1\times 1$ convolution. Moreover, it halves the height and width via average pooling with a stride of 2.
 
 ```{.python .input}
 %%tab mxnet
@@ -367,15 +362,15 @@ Although these concatenation operations
 reuse features to achieve computational efficiency,
 unfortunately they lead to heavy GPU memory consumption.
 As a result,
-applying DenseNet may require more complex memory-efficient implementations that may increase training time :cite:`pleiss2017memory`.
+applying DenseNet may require more memory-efficient implementations that may increase training time :cite:`pleiss2017memory`.
 
 
 ## Exercises
 
 1. Why do we use average pooling rather than max-pooling in the transition layer?
 1. One of the advantages mentioned in the DenseNet paper is that its model parameters are smaller than those of ResNet. Why is this the case?
 1. One problem for which DenseNet has been criticized is its high memory consumption.
-    1. Is this really the case? Try to change the input shape to $224\times 224$ to see the actual GPU memory consumption.
+    1. Is this really the case? Try to change the input shape to $224\times 224$ to see the actual GPU memory consumption empirically.
     1. Can you think of an alternative means of reducing the memory consumption? How would you need to change the framework?
 1. Implement the various DenseNet versions presented in Table 1 of the DenseNet paper :cite:`Huang.Liu.Van-Der-Maaten.ea.2017`.
 1. Design an MLP-based model by applying the DenseNet idea. Apply it to the housing price prediction task in :numref:`sec_kaggle_house`.

diff --git a/chapter_convolutional-modern/googlenet.md b/chapter_convolutional-modern/googlenet.md
@@ -9,11 +9,13 @@ tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
 In 2014, *GoogLeNet*
 won the ImageNet Challenge :cite:`Szegedy.Liu.Jia.ea.2015`, using a structure
 that combined the strengths of NiN :cite:`Lin.Chen.Yan.2013`, repeated blocks :cite:`Simonyan.Zisserman.2014`,
-and a cocktail of convolution kernels. It is arguably also the first network that exhibits a clear distinction among the stem, body, and head in a CNN. This design pattern has persisted ever since in the design of deep networks: the *stem* is given by the first 2-3 convolutions that operate on the image. They extract low-level features from the underlying images. This is followed by a *body* of convolutional blocks. Finally, the *head* maps the features obtained so far to the required classification, segmentation, detection, or tracking problem at hand.
+and a cocktail of convolution kernels. It is arguably also the first network that exhibits a clear distinction among the stem (data ingest), body (data processing), and head (prediction) in a CNN. This design pattern has persisted ever since in the design of deep networks: the *stem* is given by the first 2--3 convolutions that operate on the image. They extract low-level features from the underlying images. This is followed by a *body* of convolutional blocks. Finally, the *head* maps the features obtained so far to the required classification, segmentation, detection, or tracking problem at hand.
 
 The key contribution in GoogLeNet was the design of the network body. It solved the problem of selecting
 convolution kernels in an ingenious way. While other works tried to identify which convolution, ranging from $1 \times 1$ to $11 \times 11$ would be best, it simply *concatenated* multi-branch convolutions.
-In what follows we introduce a slightly simplified version of GoogLeNet. The simplifications are due to the fact that  tricks to stabilize training, in particular intermediate loss functions, are no longer needed due to the availability of improved training algorithms.
+In what follows we introduce a slightly simplified version of GoogLeNet: the original design included a number of tricks to stabilize training through intermediate loss functions, applied to multiple layers of the network. 
+They are no longer necessary due to the availability of improved training algorithms. 
+
 
 ## (**Inception Blocks**)
 
@@ -37,7 +39,7 @@ The four branches all use appropriate padding to give the input and output the s
 Finally, the outputs along each branch are concatenated
 along the channel dimension and comprise the block's output.
 The commonly-tuned hyperparameters of the Inception block
-are the number of output channels per layer.
+are the number of output channels per layer, i.e., how to allocate capacity among convolutions of different size.
 
 ```{.python .input}
 %%tab mxnet
@@ -146,7 +148,7 @@ and global average pooling in its head to generate its estimates.
 Max-pooling between inception blocks reduces the dimensionality.
 At its stem, the first module is similar to AlexNet and LeNet.
 
-![The GoogLeNet architecture.](../img/inception-full.svg)
+![The GoogLeNet architecture.](../img/inception-full-90.svg)
 :label:`fig_inception_full`
 
 We can now implement GoogLeNet piece by piece. Let's begin with the stem.
@@ -315,6 +317,8 @@ def b5(self):
             tf.keras.layers.Flatten()])
 ```
 
+Now that we defined all blocks `b1` through `b5`, it's just a matter of assembling them all into a full network.
+
 ```{.python .input}
 %%tab all
 @d2l.add_to_class(GoogleNet)
@@ -337,9 +341,9 @@ def __init__(self, lr=0.1, num_classes=10):
 ```
 
 The GoogLeNet model is computationally complex. Note the large number of
-relatively arbitrary hyperparameters in terms of the number of channels chosen.
-This work was done before scientists started using automatic tools to
-optimize network designs.
+relatively arbitrary hyperparameters in terms of the number of channels chosen, the number of blocks prior to dimensionality reduction, the relative partitioning of capacity across channels, etc. Much of it is due to the 
+fact that at the time when GoogLeNet was introduced, automatic tools for network definition or design exploration 
+were not yet available. For instance, by now we take it for granted that a competent deep learning framework is capable of inferring dimensionalities of input tensors automatically. At the time, many such configurations had to be specified explicitly by the experimenter, thus often slowing down active experimentation. Moreover, the tools needed for automatic exploration were still in flux and initial experiments largely amounted to costly brute force exploration, genetic algorithms, and similar strategies. 
 
 For now the only modification we will carry out is to
 [**reduce the input height and width from 224 to 96
@@ -386,9 +390,7 @@ with d2l.try_gpu():
 
 A key feature of GoogLeNet is that it is actually *cheaper* to compute than its predecessors
 while simultaneously providing improved accuracy. This marks the beginning of a much more deliberate
-network design that trades off the cost of evaluating a network with a reduction in errors. It also marks the beginning of experimentation at a block level with network design hyperparameters, even though it was entirely manual at the time. This is largely due to the fact that deep learning frameworks in 2015 still lacked much of the design flexibility
-that we now take for granted. Moreover, full network optimization is costly and at the time training on ImageNet still
-proved computationally challenging.
+network design that trades off the cost of evaluating a network with a reduction in errors. It also marks the beginning of experimentation at a block level with network design hyperparameters, even though it was entirely manual at the time. We will revisit this topic in :numref:`sec_cnn-design` when discussing strategies for network structure exploration. 
 
 Over the following sections we will encounter a number of design choices (e.g., batch normalization, residual connections, and channel grouping) that allow us to improve networks significantly. For now, you can be proud to have implemented what is arguably the first truly modern CNN.
 
@@ -410,7 +412,7 @@ Over the following sections we will encounter a number of design choices (e.g.,
 1. Can you design a variant of GoogLeNet that works on Fashion-MNIST's native resolution of $28 \times 28$ pixels? How would you need to change the stem, the body, and the head of the network, if anything at all?
 1. Compare the model parameter sizes of AlexNet, VGG, NiN, and GoogLeNet. How do the latter two network
    architectures significantly reduce the model parameter size?
-1. Compare the amount of computation needed in GoogLeNet and AlexNet. How does this affect the design of an accelerator chip, e.g., in terms of memory size, amount of computation, and the benefit of specialized operations?
+1. Compare the amount of computation needed in GoogLeNet and AlexNet. How does this affect the design of an accelerator chip, e.g., in terms of memory size, memory bandwidth, cache size, the amount of computation, and the benefit of specialized operations?
 
 :begin_tab:`mxnet`
 [Discussions](https://discuss.d2l.ai/t/81)