🏷️sec_resnet
As we design increasingly deeper networks it becomes imperative to understand how adding layers can increase the complexity and expressiveness of the network. Even more important is the ability to design networks where adding layers makes networks strictly more expressive rather than just different. To make some progress we need a bit of mathematics.
Consider
It is only reasonable to assume that if we design a different and more powerful architecture fig_functionclasses
,
for non-nested function classes, a larger function class does not always move closer to the "truth" function $f^$. For instance,
on the left of :numref:fig_functionclasses
,
though $\mathcal{F}_3$ is closer to $f^$ than fig_functionclasses
,
we can avoid the aforementioned issue from the non-nested function classes.
Thus,
only if larger function classes contain the smaller ones are we guaranteed that increasing them strictly increases the expressive power of the network.
For deep neural networks,
if we can
train the newly-added layer into an identity function
This is the question that He et al. considered when working on very deep computer vision models :cite:He.Zhang.Ren.ea.2016
.
At the heart of their proposed residual network (ResNet) is the idea that every additional layer should
more easily
contain the identity function as one of its elements.
These considerations are rather profound but they led to a surprisingly simple
solution, a residual block.
With it, ResNet won the ImageNet Large Scale Visual Recognition Challenge in 2015. The design had a profound influence on how to
build deep neural networks.
Let us focus on a local part of a neural network, as depicted in :numref:fig_residual_block
. Denote the input by fig_residual_block
,
the portion within the dotted-line box
must directly learn the mapping fig_residual_block
illustrates the residual block of ResNet,
where the solid line carrying the layer input
ResNet follows VGG's full
from d2l import mxnet as d2l
from mxnet import np, npx
from mxnet.gluon import nn
npx.set_np()
class Residual(nn.Block): #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1, **kwargs):
super().__init__(**kwargs)
self.conv1 = nn.Conv2D(num_channels, kernel_size=3, padding=1,
strides=strides)
self.conv2 = nn.Conv2D(num_channels, kernel_size=3, padding=1)
if use_1x1conv:
self.conv3 = nn.Conv2D(num_channels, kernel_size=1,
strides=strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm()
self.bn2 = nn.BatchNorm()
def forward(self, X):
Y = npx.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
return npx.relu(Y + X)
#@tab pytorch
from d2l import torch as d2l
import torch
from torch import nn
from torch.nn import functional as F
class Residual(nn.Module): #@save
"""The Residual block of ResNet."""
def __init__(self, input_channels, num_channels,
use_1x1conv=False, strides=1):
super().__init__()
self.conv1 = nn.Conv2d(input_channels, num_channels,
kernel_size=3, padding=1, stride=strides)
self.conv2 = nn.Conv2d(num_channels, num_channels,
kernel_size=3, padding=1)
if use_1x1conv:
self.conv3 = nn.Conv2d(input_channels, num_channels,
kernel_size=1, stride=strides)
else:
self.conv3 = None
self.bn1 = nn.BatchNorm2d(num_channels)
self.bn2 = nn.BatchNorm2d(num_channels)
self.relu = nn.ReLU(inplace=True)
def forward(self, X):
Y = F.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3:
X = self.conv3(X)
Y += X
return F.relu(Y)
#@tab tensorflow
from d2l import tensorflow as d2l
import tensorflow as tf
class Residual(tf.keras.Model): #@save
"""The Residual block of ResNet."""
def __init__(self, num_channels, use_1x1conv=False, strides=1):
super().__init__()
self.conv1 = tf.keras.layers.Conv2D(
num_channels, padding='same', kernel_size=3, strides=strides)
self.conv2 = tf.keras.layers.Conv2D(
num_channels, kernel_size=3, padding='same')
self.conv3 = None
if use_1x1conv:
self.conv3 = tf.keras.layers.Conv2D(
num_channels, kernel_size=1, strides=strides)
self.bn1 = tf.keras.layers.BatchNormalization()
self.bn2 = tf.keras.layers.BatchNormalization()
def call(self, X):
Y = tf.keras.activations.relu(self.bn1(self.conv1(X)))
Y = self.bn2(self.conv2(Y))
if self.conv3 is not None:
X = self.conv3(X)
Y += X
return tf.keras.activations.relu(Y)
This code generates two types of networks: one where we add the input to the output before applying the ReLU nonlinearity whenever use_1x1conv=False
, and one where we adjust channels and resolution by means of a fig_resnet_block
illustrates this:
Now let us look at a situation where the input and output are of the same shape.
blk = Residual(3)
blk.initialize()
X = np.random.uniform(size=(4, 3, 6, 6))
blk(X).shape
#@tab pytorch
blk = Residual(3,3)
X = torch.rand(4, 3, 6, 6)
Y = blk(X)
Y.shape
#@tab tensorflow
blk = Residual(3)
X = tf.random.uniform((4, 6, 6, 3))
Y = blk(X)
Y.shape
We also have the option to halve the output height and width while increasing the number of output channels.
blk = Residual(6, use_1x1conv=True, strides=2)
blk.initialize()
blk(X).shape
#@tab pytorch
blk = Residual(3,6, use_1x1conv=True, strides=2)
blk(X).shape
#@tab tensorflow
blk = Residual(6, use_1x1conv=True, strides=2)
blk(X).shape
The first two layers of ResNet are the same as those of the GoogLeNet we described before: the
net = nn.Sequential()
net.add(nn.Conv2D(64, kernel_size=7, strides=2, padding=3),
nn.BatchNorm(), nn.Activation('relu'),
nn.MaxPool2D(pool_size=3, strides=2, padding=1))
#@tab pytorch
b1 = nn.Sequential(nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3),
nn.BatchNorm2d(64), nn.ReLU(),
nn.MaxPool2d(kernel_size=3, stride=2, padding=1))
#@tab tensorflow
b1 = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(64, kernel_size=7, strides=2, padding='same'),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Activation('relu'),
tf.keras.layers.MaxPool2D(pool_size=3, strides=2, padding='same')])
GoogLeNet uses four modules made up of Inception blocks. However, ResNet uses four modules made up of residual blocks, each of which uses several residual blocks with the same number of output channels. The number of channels in the first module is the same as the number of input channels. Since a maximum pooling layer with a stride of 2 has already been used, it is not necessary to reduce the height and width. In the first residual block for each of the subsequent modules, the number of channels is doubled compared with that of the previous module, and the height and width are halved.
Now, we implement this module. Note that special processing has been performed on the first module.
def resnet_block(num_channels, num_residuals, first_block=False):
blk = nn.Sequential()
for i in range(num_residuals):
if i == 0 and not first_block:
blk.add(Residual(num_channels, use_1x1conv=True, strides=2))
else:
blk.add(Residual(num_channels))
return blk
#@tab pytorch
def resnet_block(input_channels, num_channels, num_residuals,
first_block=False):
blk = []
for i in range(num_residuals):
if i == 0 and not first_block:
blk.append(Residual(input_channels, num_channels,
use_1x1conv=True, strides=2))
else:
blk.append(Residual(num_channels, num_channels))
return blk
#@tab tensorflow
class ResnetBlock(tf.keras.layers.Layer):
def __init__(self, num_channels, num_residuals, first_block=False,
**kwargs):
super(ResnetBlock, self).__init__(**kwargs)
self.residual_layers = []
for i in range(num_residuals):
if i == 0 and not first_block:
self.residual_layers.append(
Residual(num_channels, use_1x1conv=True, strides=2))
else:
self.residual_layers.append(Residual(num_channels))
def call(self, X):
for layer in self.residual_layers.layers:
X = layer(X)
return X
Then, we add all the modules to ResNet. Here, two residual blocks are used for each module.
net.add(resnet_block(64, 2, first_block=True),
resnet_block(128, 2),
resnet_block(256, 2),
resnet_block(512, 2))
#@tab pytorch
b2 = nn.Sequential(*resnet_block(64, 64, 2, first_block=True))
b3 = nn.Sequential(*resnet_block(64, 128, 2))
b4 = nn.Sequential(*resnet_block(128, 256, 2))
b5 = nn.Sequential(*resnet_block(256, 512, 2))
#@tab tensorflow
b2 = ResnetBlock(64, 2, first_block=True)
b3 = ResnetBlock(128, 2)
b4 = ResnetBlock(256, 2)
b5 = ResnetBlock(512, 2)
Finally, just like GoogLeNet, we add a global average pooling layer, followed by the fully-connected layer output.
net.add(nn.GlobalAvgPool2D(), nn.Dense(10))
#@tab pytorch
net = nn.Sequential(b1, b2, b3, b4, b5,
nn.AdaptiveAvgPool2d((1,1)),
nn.Flatten(), nn.Linear(512, 10))
#@tab tensorflow
# Recall that we define this as a function so we can reuse later and run it
# within `tf.distribute.MirroredStrategy`'s scope to utilize various
# computational resources, e.g. GPUs. Also note that even though we have
# created b1, b2, b3, b4, b5 but we will recreate them inside this function's
# scope instead
def net():
return tf.keras.Sequential([
# The following layers are the same as b1 that we created earlier
tf.keras.layers.Conv2D(64, kernel_size=7, strides=2, padding='same'),
tf.keras.layers.BatchNormalization(),
tf.keras.layers.Activation('relu'),
tf.keras.layers.MaxPool2D(pool_size=3, strides=2, padding='same'),
# The following layers are the same as b2, b3, b4, and b5 that we
# created earlier
ResnetBlock(64, 2, first_block=True),
ResnetBlock(128, 2),
ResnetBlock(256, 2),
ResnetBlock(512, 2),
tf.keras.layers.GlobalAvgPool2D(),
tf.keras.layers.Dense(units=10)])
There are 4 convolutional layers in each module (excluding the fig_resnet18
depicts the full ResNet-18.
Before training ResNet, let us observe how the input shape changes across different modules in ResNet. As in all the previous architectures, the resolution decreases while the number of channels increases up until the point where a global average pooling layer aggregates all features.
X = np.random.uniform(size=(1, 1, 224, 224))
net.initialize()
for layer in net:
X = layer(X)
print(layer.name, 'output shape:\t', X.shape)
#@tab pytorch
X = torch.rand(size=(1, 1, 224, 224))
for layer in net:
X = layer(X)
print(layer.__class__.__name__,'output shape:\t', X.shape)
#@tab tensorflow
X = tf.random.uniform(shape=(1, 224, 224, 1))
for layer in net().layers:
X = layer(X)
print(layer.__class__.__name__,'output shape:\t', X.shape)
We train ResNet on the Fashion-MNIST dataset, just like before.
#@tab all
lr, num_epochs, batch_size = 0.05, 10, 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size, resize=96)
d2l.train_ch6(net, train_iter, test_iter, num_epochs, lr)
- Nested function classes are desirable. Learning an additional layer in deep neural networks as an identity function (though this is an extreme case) should be made easy.
- The residual mapping can learn the identity function more easily, such as pushing parameters in the weight layer to zero.
- We can train an effective deep neural network by having residual blocks. Inputs can forward propagate faster through the residual connections across layers.
- ResNet had a major influence on the design of subsequent deep neural networks, both for convolutional and sequential nature.
- What are the major differences between the Inception block in :numref:
fig_inception
and the residual block? After removing some paths in the Inception block, how are they related to each other? - Refer to Table 1 in the ResNet paper :cite:
He.Zhang.Ren.ea.2016
to implement different variants. - For deeper networks, ResNet introduces a "bottleneck" architecture to reduce model complexity. Try to implement it.
- In subsequent versions of ResNet, the authors changed the "convolution, batch
normalization, and activation" structure to the "batch normalization,
activation, and convolution" structure. Make this improvement
yourself. See Figure 1 in :cite:
He.Zhang.Ren.ea.2016*1
for details. - Why can't we just increase the complexity of functions without bound, even if the function classes are nested?
:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab:
:begin_tab:tensorflow
Discussions
:end_tab: