Skip to content

Commit

Permalink
Merge branch 'master' into jax_synx
Browse files Browse the repository at this point in the history
  • Loading branch information
AnirudhDagar committed Aug 27, 2022
2 parents bcfc480 + 5571434 commit 98f01e8
Show file tree
Hide file tree
Showing 50 changed files with 3,553 additions and 5,904 deletions.
2 changes: 1 addition & 1 deletion Jenkinsfile
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ stage("Build and Publish") {

sh label:"Build HTML", script:"""set -ex
conda activate ${ENV_NAME}
./static/build_html.sh
./static/build_html.sh ${env.BRANCH_NAME} ${JOB_NAME}
"""

sh label:"Build PDF", script:"""set -ex
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ DistilBERT (lightweight via knowledge distillation) :cite:`sanh2019distilbert`,
and
ELECTRA (replaced token detection) :cite:`clark2019electra`.
Moreover, BERT inspired transformer pretraining in computer vision, such as with vision transformers
:cite:`Dosovitskiy.Beyer.Kolesnikov.ea.2021`, Swin transformers :cite:`liu2021swin`, and MAE (masked autoencoders) `he2022masked`.
:cite:`Dosovitskiy.Beyer.Kolesnikov.ea.2021`, Swin transformers :cite:`liu2021swin`, and MAE (masked autoencoders) :cite:`he2022masked`.

## Encoder-Decoder

Expand Down
8 changes: 5 additions & 3 deletions chapter_attention-mechanisms-and-transformers/transformer.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Notably,
self-attention
enjoys both parallel computation and
the shortest maximum path length.
Therefore natually,
Therefore naturally,
it is appealing to design deep architectures
by using self-attention.
Unlike earlier self-attention models
Expand Down Expand Up @@ -239,9 +239,11 @@ In :numref:`sec_batch_norm`,
we explained how batch normalization
recenters and rescales across the examples within
a minibatch.
Layer normalization is the same as batch normalization
As discussed in :numref:`subsec_layer-normalization-in-bn`,
layer normalization is the same as batch normalization
except that the former
normalizes across the feature dimension.
normalizes across the feature dimension,
thus enjoying benefits of scale independence and batch size independence.
Despite its pervasive applications
in computer vision,
batch normalization
Expand Down
92 changes: 50 additions & 42 deletions chapter_convolutional-modern/alexnet.md

Large diffs are not rendered by default.

114 changes: 59 additions & 55 deletions chapter_convolutional-modern/batch-norm.md

Large diffs are not rendered by default.

37 changes: 16 additions & 21 deletions chapter_convolutional-modern/densenet.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,7 @@ tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
:label:`sec_densenet`

ResNet significantly changed the view of how to parametrize the functions in deep networks. *DenseNet* (dense convolutional network) is to some extent the logical extension of this :cite:`Huang.Liu.Van-Der-Maaten.ea.2017`.
As a result,
DenseNet
is characterized by
both the connectivity pattern where
DenseNet is characterized by both the connectivity pattern where
each layer connects to all the preceding layers
and the concatenation operation (rather than the addition operator in ResNet) to preserve and reuse features
from earlier layers.
Expand All @@ -21,7 +18,7 @@ To understand how to arrive at it, let's take a small detour to mathematics.

Recall the Taylor expansion for functions. For the point $x = 0$ it can be written as

$$f(x) = f(0) + f'(0) x + \frac{f''(0)}{2!} x^2 + \frac{f'''(0)}{3!} x^3 + \ldots.$$
$$f(x) = f(0) + x \cdot \left[f'(0) + x \cdot \left[\frac{f''(0)}{2!} + x \cdot \left[\frac{f'''(0)}{3!} + \ldots \right]\right]\right].$$


The key point is that it decomposes a function into increasingly higher order terms. In a similar vein, ResNet decomposes functions into
Expand All @@ -30,9 +27,8 @@ $$f(\mathbf{x}) = \mathbf{x} + g(\mathbf{x}).$$

That is, ResNet decomposes $f$ into a simple linear term and a more complex
nonlinear one.
What if we want to capture (not necessarily add) information beyond two terms?
One solution was
DenseNet :cite:`Huang.Liu.Van-Der-Maaten.ea.2017`.
What if we wanted to capture (not necessarily add) information beyond two terms?
One such solution is DenseNet :cite:`Huang.Liu.Van-Der-Maaten.ea.2017`.

![The main difference between ResNet (left) and DenseNet (right) in cross-layer connections: use of addition and use of concatenation. ](../img/densenet-block.svg)
:label:`fig_densenet_block`
Expand All @@ -43,16 +39,17 @@ As a result, we perform a mapping from $\mathbf{x}$ to its values after applying
$$\mathbf{x} \to \left[
\mathbf{x},
f_1(\mathbf{x}),
f_2([\mathbf{x}, f_1(\mathbf{x})]), f_3([\mathbf{x}, f_1(\mathbf{x}), f_2([\mathbf{x}, f_1(\mathbf{x})])]), \ldots\right].$$
f_2\left(\left[\mathbf{x}, f_1\left(\mathbf{x}\right)\right]\right), f_3\left(\left[\mathbf{x}, f_1\left(\mathbf{x}\right), f_2\left(\left[\mathbf{x}, f_1\left(\mathbf{x}\right)\right]\right)\right]\right), \ldots\right].$$

In the end, all these functions are combined in MLP to reduce the number of features again. In terms of implementation this is quite simple:
rather than adding terms, we concatenate them. The name DenseNet arises from the fact that the dependency graph between variables becomes quite dense. The last layer of such a chain is densely connected to all previous layers. The dense connections are shown in :numref:`fig_densenet`.

![Dense connections in DenseNet.](../img/densenet.svg)
![Dense connections in DenseNet. Note how the dimensionality increases with depth.](../img/densenet.svg)
:label:`fig_densenet`


The main components that compose a DenseNet are *dense blocks* and *transition layers*. The former define how the inputs and outputs are concatenated, while the latter control the number of channels so that it is not too large.
The main components that compose a DenseNet are *dense blocks* and *transition layers*. The former define how the inputs and outputs are concatenated, while the latter control the number of channels so that it is not too large,
since the expansion $\mathbf{x} \to \left[\mathbf{x}, f_1(\mathbf{x}),
f_2\left(\left[\mathbf{x}, f_1\left(\mathbf{x}\right)\right]\right), \ldots \right]$ can be quite high-dimensional.


## [**Dense Blocks**]
Expand Down Expand Up @@ -111,7 +108,7 @@ class ConvBlock(tf.keras.layers.Layer):
return y
```

A *dense block* consists of multiple convolution blocks, each using the same number of output channels. In the forward propagation, however, we concatenate the input and output of each convolution block on the channel dimension.
A *dense block* consists of multiple convolution blocks, each using the same number of output channels. In the forward propagation, however, we concatenate the input and output of each convolution block on the channel dimension. Lazy evaluation allows us to adjust the dimensionality automatically.

```{.python .input}
%%tab mxnet
Expand All @@ -125,8 +122,7 @@ class DenseBlock(nn.Block):
def forward(self, X):
for blk in self.net:
Y = blk(X)
# Concatenate the input and output of each block on the channel
# dimension
# Concatenate input and output of each block along the channels
X = np.concatenate((X, Y), axis=1)
return X
```
Expand All @@ -144,8 +140,7 @@ class DenseBlock(nn.Module):
def forward(self, X):
for blk in self.net:
Y = blk(X)
# Concatenate the input and output of each block on the channel
# dimension
# Concatenate input and output of each block along the channels
X = torch.cat((X, Y), dim=1)
return X
```
Expand All @@ -167,7 +162,7 @@ class DenseBlock(tf.keras.layers.Layer):

In the following example,
we [**define a `DenseBlock` instance**] with 2 convolution blocks of 10 output channels.
When using an input with 3 channels, we will get an output with $3+2\times 10=23$ channels. The number of convolution block channels controls the growth in the number of output channels relative to the number of input channels. This is also referred to as the *growth rate*.
When using an input with 3 channels, we will get an output with $3 + 10 + 10=23$ channels. The number of convolution block channels controls the growth in the number of output channels relative to the number of input channels. This is also referred to as the *growth rate*.

```{.python .input}
%%tab all
Expand All @@ -185,7 +180,7 @@ Y.shape

## [**Transition Layers**]

Since each dense block will increase the number of channels, adding too many of them will lead to an excessively complex model. A *transition layer* is used to control the complexity of the model. It reduces the number of channels by using the $1\times 1$ convolutional layer and halves the height and width of the average pooling layer with a stride of 2, further reducing the complexity of the model.
Since each dense block will increase the number of channels, adding too many of them will lead to an excessively complex model. A *transition layer* is used to control the complexity of the model. It reduces the number of channels by using an $1\times 1$ convolution. Moreover, it halves the height and width via average pooling with a stride of 2.

```{.python .input}
%%tab mxnet
Expand Down Expand Up @@ -367,15 +362,15 @@ Although these concatenation operations
reuse features to achieve computational efficiency,
unfortunately they lead to heavy GPU memory consumption.
As a result,
applying DenseNet may require more complex memory-efficient implementations that may increase training time :cite:`pleiss2017memory`.
applying DenseNet may require more memory-efficient implementations that may increase training time :cite:`pleiss2017memory`.


## Exercises

1. Why do we use average pooling rather than max-pooling in the transition layer?
1. One of the advantages mentioned in the DenseNet paper is that its model parameters are smaller than those of ResNet. Why is this the case?
1. One problem for which DenseNet has been criticized is its high memory consumption.
1. Is this really the case? Try to change the input shape to $224\times 224$ to see the actual GPU memory consumption.
1. Is this really the case? Try to change the input shape to $224\times 224$ to see the actual GPU memory consumption empirically.
1. Can you think of an alternative means of reducing the memory consumption? How would you need to change the framework?
1. Implement the various DenseNet versions presented in Table 1 of the DenseNet paper :cite:`Huang.Liu.Van-Der-Maaten.ea.2017`.
1. Design an MLP-based model by applying the DenseNet idea. Apply it to the housing price prediction task in :numref:`sec_kaggle_house`.
Expand Down
24 changes: 13 additions & 11 deletions chapter_convolutional-modern/googlenet.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,13 @@ tab.interact_select(['mxnet', 'pytorch', 'tensorflow'])
In 2014, *GoogLeNet*
won the ImageNet Challenge :cite:`Szegedy.Liu.Jia.ea.2015`, using a structure
that combined the strengths of NiN :cite:`Lin.Chen.Yan.2013`, repeated blocks :cite:`Simonyan.Zisserman.2014`,
and a cocktail of convolution kernels. It is arguably also the first network that exhibits a clear distinction among the stem, body, and head in a CNN. This design pattern has persisted ever since in the design of deep networks: the *stem* is given by the first 2-3 convolutions that operate on the image. They extract low-level features from the underlying images. This is followed by a *body* of convolutional blocks. Finally, the *head* maps the features obtained so far to the required classification, segmentation, detection, or tracking problem at hand.
and a cocktail of convolution kernels. It is arguably also the first network that exhibits a clear distinction among the stem (data ingest), body (data processing), and head (prediction) in a CNN. This design pattern has persisted ever since in the design of deep networks: the *stem* is given by the first 2--3 convolutions that operate on the image. They extract low-level features from the underlying images. This is followed by a *body* of convolutional blocks. Finally, the *head* maps the features obtained so far to the required classification, segmentation, detection, or tracking problem at hand.

The key contribution in GoogLeNet was the design of the network body. It solved the problem of selecting
convolution kernels in an ingenious way. While other works tried to identify which convolution, ranging from $1 \times 1$ to $11 \times 11$ would be best, it simply *concatenated* multi-branch convolutions.
In what follows we introduce a slightly simplified version of GoogLeNet. The simplifications are due to the fact that tricks to stabilize training, in particular intermediate loss functions, are no longer needed due to the availability of improved training algorithms.
In what follows we introduce a slightly simplified version of GoogLeNet: the original design included a number of tricks to stabilize training through intermediate loss functions, applied to multiple layers of the network.
They are no longer necessary due to the availability of improved training algorithms.


## (**Inception Blocks**)

Expand All @@ -37,7 +39,7 @@ The four branches all use appropriate padding to give the input and output the s
Finally, the outputs along each branch are concatenated
along the channel dimension and comprise the block's output.
The commonly-tuned hyperparameters of the Inception block
are the number of output channels per layer.
are the number of output channels per layer, i.e., how to allocate capacity among convolutions of different size.

```{.python .input}
%%tab mxnet
Expand Down Expand Up @@ -146,7 +148,7 @@ and global average pooling in its head to generate its estimates.
Max-pooling between inception blocks reduces the dimensionality.
At its stem, the first module is similar to AlexNet and LeNet.

![The GoogLeNet architecture.](../img/inception-full.svg)
![The GoogLeNet architecture.](../img/inception-full-90.svg)
:label:`fig_inception_full`

We can now implement GoogLeNet piece by piece. Let's begin with the stem.
Expand Down Expand Up @@ -315,6 +317,8 @@ def b5(self):
tf.keras.layers.Flatten()])
```

Now that we defined all blocks `b1` through `b5`, it's just a matter of assembling them all into a full network.

```{.python .input}
%%tab all
@d2l.add_to_class(GoogleNet)
Expand All @@ -337,9 +341,9 @@ def __init__(self, lr=0.1, num_classes=10):
```

The GoogLeNet model is computationally complex. Note the large number of
relatively arbitrary hyperparameters in terms of the number of channels chosen.
This work was done before scientists started using automatic tools to
optimize network designs.
relatively arbitrary hyperparameters in terms of the number of channels chosen, the number of blocks prior to dimensionality reduction, the relative partitioning of capacity across channels, etc. Much of it is due to the
fact that at the time when GoogLeNet was introduced, automatic tools for network definition or design exploration
were not yet available. For instance, by now we take it for granted that a competent deep learning framework is capable of inferring dimensionalities of input tensors automatically. At the time, many such configurations had to be specified explicitly by the experimenter, thus often slowing down active experimentation. Moreover, the tools needed for automatic exploration were still in flux and initial experiments largely amounted to costly brute force exploration, genetic algorithms, and similar strategies.

For now the only modification we will carry out is to
[**reduce the input height and width from 224 to 96
Expand Down Expand Up @@ -386,9 +390,7 @@ with d2l.try_gpu():

A key feature of GoogLeNet is that it is actually *cheaper* to compute than its predecessors
while simultaneously providing improved accuracy. This marks the beginning of a much more deliberate
network design that trades off the cost of evaluating a network with a reduction in errors. It also marks the beginning of experimentation at a block level with network design hyperparameters, even though it was entirely manual at the time. This is largely due to the fact that deep learning frameworks in 2015 still lacked much of the design flexibility
that we now take for granted. Moreover, full network optimization is costly and at the time training on ImageNet still
proved computationally challenging.
network design that trades off the cost of evaluating a network with a reduction in errors. It also marks the beginning of experimentation at a block level with network design hyperparameters, even though it was entirely manual at the time. We will revisit this topic in :numref:`sec_cnn-design` when discussing strategies for network structure exploration.

Over the following sections we will encounter a number of design choices (e.g., batch normalization, residual connections, and channel grouping) that allow us to improve networks significantly. For now, you can be proud to have implemented what is arguably the first truly modern CNN.

Expand All @@ -410,7 +412,7 @@ Over the following sections we will encounter a number of design choices (e.g.,
1. Can you design a variant of GoogLeNet that works on Fashion-MNIST's native resolution of $28 \times 28$ pixels? How would you need to change the stem, the body, and the head of the network, if anything at all?
1. Compare the model parameter sizes of AlexNet, VGG, NiN, and GoogLeNet. How do the latter two network
architectures significantly reduce the model parameter size?
1. Compare the amount of computation needed in GoogLeNet and AlexNet. How does this affect the design of an accelerator chip, e.g., in terms of memory size, amount of computation, and the benefit of specialized operations?
1. Compare the amount of computation needed in GoogLeNet and AlexNet. How does this affect the design of an accelerator chip, e.g., in terms of memory size, memory bandwidth, cache size, the amount of computation, and the benefit of specialized operations?

:begin_tab:`mxnet`
[Discussions](https://discuss.d2l.ai/t/81)
Expand Down
Loading

0 comments on commit 98f01e8

Please sign in to comment.