From 5277c3d1e6b9122554f9156e578d4c04ba5014f5 Mon Sep 17 00:00:00 2001 From: Aston Zhang Date: Mon, 14 Jun 2021 04:46:22 +0000 Subject: [PATCH] add fcn --- chapter_computer-vision/fcn.md | 315 ++++++++++++ chapter_computer-vision/fcn_origin.md | 450 ++++++++++++++++++ .../semantic-segmentation-and-dataset.md | 6 +- chapter_computer-vision/transposed-conv.md | 20 +- .../transposed-conv_origin.md | 4 +- 5 files changed, 780 insertions(+), 15 deletions(-) create mode 100644 chapter_computer-vision/fcn.md create mode 100644 chapter_computer-vision/fcn_origin.md diff --git a/chapter_computer-vision/fcn.md b/chapter_computer-vision/fcn.md new file mode 100644 index 000000000..1bcc8eda3 --- /dev/null +++ b/chapter_computer-vision/fcn.md @@ -0,0 +1,315 @@ +# 完全卷积网络 +:label:`sec_fcn` + +正如 :numref:`sec_semantic_segmentation` 中所讨论的那样,语义分割以像素级别对图像进行分类。完全卷积网络 (FCN) 使用卷积神经网络将图像像素转换为像素类 :cite:`Long.Shelhamer.Darrell.2015`。与我们之前在图像分类或对象检测方面遇到的 CNN 不同,完全卷积网络将中间要素贴图的高度和宽度转换回输入图像的高度和宽度:这是通过 :numref:`sec_transposed_conv` 中引入的转置卷积图层实现的。因此,分类输出和输入图像在像素级别上具有一对应关系:任何输出像素处的通道尺寸都保存在同一空间位置的输入像素的分类结果。 + +```{.python .input} +%matplotlib inline +from d2l import mxnet as d2l +from mxnet import gluon, image, init, np, npx +from mxnet.gluon import nn + +npx.set_np() +``` + +```{.python .input} +#@tab pytorch +%matplotlib inline +from d2l import torch as d2l +import torch +import torchvision +from torch import nn +from torch.nn import functional as F +``` + +## 模型 + +下面我们描述完全卷积网络模型的基本设计。如 :numref:`fig_fcn` 所示,该模型首先使用 CNN 提取图像要素,然后通过 $1\times 1$ 卷积图层将通道数量转换为类数,最后通过引入的转置卷积将要素地图的高度和宽度转换为输入图像的高度和宽度在 :numref:`sec_transposed_conv` 中。因此,模型输出的高度和宽度与输入图像具有相同的高度和宽度,其中输出通道包含位于同一空间位置的输入像素的预测类。 + +![Fully convolutional network.](../img/fcn.svg) +:label:`fig_fcn` + +下面,我们 [** 使用在 ImageNet 数据集上预训练的 resnet-18 模型来提取影像要素 **] 并将模型实例表示为 `pretrained_net`。该模型的最后几层包括全局平均池层和完全连接的层:完全卷积网络中不需要它们。 + +```{.python .input} +pretrained_net = gluon.model_zoo.vision.resnet18_v2(pretrained=True) +pretrained_net.features[-3:], pretrained_net.output +``` + +```{.python .input} +#@tab pytorch +pretrained_net = torchvision.models.resnet18(pretrained=True) +list(pretrained_net.children())[-3:] +``` + +接下来,我们 [** 创建完全卷积网络实例 `net`**]。它复制 Resnet-18 中的所有预训练图层,但最终的全局平均池层和最接近输出的完全连接图层除外。 + +```{.python .input} +net = nn.HybridSequential() +for layer in pretrained_net.features[:-2]: + net.add(layer) +``` + +```{.python .input} +#@tab pytorch +net = nn.Sequential(*list(pretrained_net.children())[:-2]) +``` + +给定高度和宽度分别为 320 和 480 的输入,`net` 的正向传播会将输入高度和宽度降至原始高度和宽度的 1/32,即 10 和 15。 + +```{.python .input} +X = np.random.uniform(size=(1, 3, 320, 480)) +net(X).shape +``` + +```{.python .input} +#@tab pytorch +X = torch.rand(size=(1, 3, 320, 480)) +net(X).shape +``` + +接下来,我们 [** 使用 $1\times 1$ 卷积图层将输出通道的数量转换为 Pascal VOC2012 数据集的类数 (21)。**] 最后,我们需要(** 将要素地图的高度和宽度增加 32 倍 **)将其更改回输入图像的高度和宽度。回想一下如何计算 :numref:`sec_padding` 中卷积层的输出形状。自 $(320-64+16\times2+32)/32=10$ 和 $(480-64+16\times2+32)/32=15$ 以来,我们构建了一个跨度为 $32$ 的转置卷积层,将内核的高度和宽度设置为 $64$,填充为 $16$。一般来说,我们可以看到,对于步幅 $s$,填充 $s/2$(假设 $s/2$ 是整数)和内核 $2s$ 的高度和宽度 $2s$,转置卷积将使输入的高度和宽度增加 $s$ 倍。 + +```{.python .input} +num_classes = 21 +net.add(nn.Conv2D(num_classes, kernel_size=1), + nn.Conv2DTranspose( + num_classes, kernel_size=64, padding=16, strides=32)) +``` + +```{.python .input} +#@tab pytorch +num_classes = 21 +net.add_module('final_conv', nn.Conv2d(512, num_classes, kernel_size=1)) +net.add_module('transpose_conv', nn.ConvTranspose2d(num_classes, num_classes, + kernel_size=64, padding=16, stride=32)) +``` + +## [** 初始化转置的卷积图层 **] + +我们已经知道,转置的卷积图层可以增加要素地图的高度和宽度。在图像处理过程中,我们可能需要扩展图像,即 * upSpling*。 +*双线性插值 * +是常用的上采样技术之一。它也经常用于初始化转置卷积层。 + +为了解释双线性插值,假设给定输入图像,我们想要计算加样输出图像的每个像素。例如,为了计算坐标 $(x, y)$ 处的输出图像的像素,首先将 $(x, y)$ 映射到输入图像上的坐标 $(x', y')$,例如,根据输入大小与输出大小的比率对输入图像进行坐标 $(x', y')$。请注意,映射的 $x′$ and $y′ $ 是实数。然后,在输入图像上找到离坐标 $(x', y')$ 最接近的四个像素。最后,在坐标 $(x, y)$ 处输出图像的像素是根据输入图像上这四个最接近的像素及其与 $(x', y')$ 的相对距离计算的。 + +双线性插值的上采样可以通过转置卷积层实现,内核由以下 `bilinear_kernel` 函数构造。由于空间限制,我们只提供下面 `bilinear_kernel` 函数的实现,而不讨论其算法设计。 + +```{.python .input} +def bilinear_kernel(in_channels, out_channels, kernel_size): + factor = (kernel_size + 1) // 2 + if kernel_size % 2 == 1: + center = factor - 1 + else: + center = factor - 0.5 + og = (np.arange(kernel_size).reshape(-1, 1), + np.arange(kernel_size).reshape(1, -1)) + filt = (1 - np.abs(og[0] - center) / factor) * \ + (1 - np.abs(og[1] - center) / factor) + weight = np.zeros((in_channels, out_channels, kernel_size, kernel_size)) + weight[range(in_channels), range(out_channels), :, :] = filt + return np.array(weight) +``` + +```{.python .input} +#@tab pytorch +def bilinear_kernel(in_channels, out_channels, kernel_size): + factor = (kernel_size + 1) // 2 + if kernel_size % 2 == 1: + center = factor - 1 + else: + center = factor - 0.5 + og = (torch.arange(kernel_size).reshape(-1, 1), + torch.arange(kernel_size).reshape(1, -1)) + filt = (1 - torch.abs(og[0] - center) / factor) * \ + (1 - torch.abs(og[1] - center) / factor) + weight = torch.zeros((in_channels, out_channels, + kernel_size, kernel_size)) + weight[range(in_channels), range(out_channels), :, :] = filt + return weight +``` + +让我们 [** 试验双线性插值的上采样 **] 由转置卷积层实现。我们构建了一个将高度和重量加倍的转置卷积层,然后用 `bilinear_kernel` 函数初始化其内核。 + +```{.python .input} +conv_trans = nn.Conv2DTranspose(3, kernel_size=4, padding=1, strides=2) +conv_trans.initialize(init.Constant(bilinear_kernel(3, 3, 4))) +``` + +```{.python .input} +#@tab pytorch +conv_trans = nn.ConvTranspose2d(3, 3, kernel_size=4, padding=1, stride=2, + bias=False) +conv_trans.weight.data.copy_(bilinear_kernel(3, 3, 4)); +``` + +阅读图像 `X` 并将上采样输出分配给 `Y`。为了打印图像,我们需要调整频道尺寸的位置。 + +```{.python .input} +img = image.imread('../img/catdog.jpg') +X = np.expand_dims(img.astype('float32').transpose(2, 0, 1), axis=0) / 255 +Y = conv_trans(X) +out_img = Y[0].transpose(1, 2, 0) +``` + +```{.python .input} +#@tab pytorch +img = torchvision.transforms.ToTensor()(d2l.Image.open('../img/catdog.jpg')) +X = img.unsqueeze(0) +Y = conv_trans(X) +out_img = Y[0].permute(1, 2, 0).detach() +``` + +正如我们所看到的那样,转置的卷积图层将图像的高度和宽度增加两倍。除了坐标中的不同比例外,通过双线性插值缩放的图像和在 :numref:`sec_bbox` 中打印的原始图像看起来相同。 + +```{.python .input} +d2l.set_figsize() +print('input image shape:', img.shape) +d2l.plt.imshow(img.asnumpy()); +print('output image shape:', out_img.shape) +d2l.plt.imshow(out_img.asnumpy()); +``` + +```{.python .input} +#@tab pytorch +d2l.set_figsize() +print('input image shape:', img.permute(1, 2, 0).shape) +d2l.plt.imshow(img.permute(1, 2, 0)); +print('output image shape:', out_img.shape) +d2l.plt.imshow(out_img); +``` + +在完全卷积网络中,我们 [** 使用双线性插值的上采样初始化转置的卷积层。对于 $1\times 1$ 卷积层,我们使用 Xavier 初始化。**] + +```{.python .input} +W = bilinear_kernel(num_classes, num_classes, 64) +net[-1].initialize(init.Constant(W)) +net[-2].initialize(init=init.Xavier()) +``` + +```{.python .input} +#@tab pytorch +W = bilinear_kernel(num_classes, num_classes, 64) +net.transpose_conv.weight.data.copy_(W); +``` + +## [** 阅读数据集 **] + +我们阅读了 :numref:`sec_semantic_segmentation` 中介绍的语义分割数据集。随机裁剪的输出图像形状指定为 $320\times 480$:高度和宽度都可以被 $32$ 整除。 + +```{.python .input} +#@tab all +batch_size, crop_size = 32, (320, 480) +train_iter, test_iter = d2l.load_data_voc(batch_size, crop_size) +``` + +## [** 培训 **] + +现在我们可以训练我们构建的完全卷积网络。这里的损失函数和准确度计算与前面章节的图像分类中的计算并无根本不同。由于我们使用转置卷积图层的输出通道来预测每个像素的类,因此在损失计算中指定了通道维度。此外,准确度是根据所有像素的预测类的正确性来计算的。 + +```{.python .input} +num_epochs, lr, wd, devices = 5, 0.1, 1e-3, d2l.try_all_gpus() +loss = gluon.loss.SoftmaxCrossEntropyLoss(axis=1) +net.collect_params().reset_ctx(devices) +trainer = gluon.Trainer(net.collect_params(), 'sgd', + {'learning_rate': lr, 'wd': wd}) +d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices) +``` + +```{.python .input} +#@tab pytorch +def loss(inputs, targets): + return F.cross_entropy(inputs, targets, reduction='none').mean(1).mean(1) + +num_epochs, lr, wd, devices = 5, 0.001, 1e-3, d2l.try_all_gpus() +trainer = torch.optim.SGD(net.parameters(), lr=lr, weight_decay=wd) +d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices) +``` + +## [** 预测 **] + +预测时,我们需要对每个通道中的输入图像进行标准化,并将图像转换为 CNN 所需的四维输入格式。 + +```{.python .input} +def predict(img): + X = test_iter._dataset.normalize_image(img) + X = np.expand_dims(X.transpose(2, 0, 1), axis=0) + pred = net(X.as_in_ctx(devices[0])).argmax(axis=1) + return pred.reshape(pred.shape[1], pred.shape[2]) +``` + +```{.python .input} +#@tab pytorch +def predict(img): + X = test_iter.dataset.normalize_image(img).unsqueeze(0) + pred = net(X.to(devices[0])).argmax(dim=1) + return pred.reshape(pred.shape[1], pred.shape[2]) +``` + +为了 [** 可视化每个像素的预测类 **],我们将预测的类映射回其数据集中的标签颜色。 + +```{.python .input} +def label2image(pred): + colormap = np.array(d2l.VOC_COLORMAP, ctx=devices[0], dtype='uint8') + X = pred.astype('int32') + return colormap[X, :] +``` + +```{.python .input} +#@tab pytorch +def label2image(pred): + colormap = torch.tensor(d2l.VOC_COLORMAP, device=devices[0]) + X = pred.long() + return colormap[X, :] +``` + +测试数据集中的图像在大小和形状方面有所不同。由于模型使用步幅为 32 的转置卷积图层,因此当输入图像的高度或宽度不可分为 32 时,转置卷积图层的输出高度或宽度将与输入图像的形状有所偏差。为了解决这个问题,我们可以裁剪多个高度和宽度的矩形区域,这些区域是图像中 32 的整数倍数,然后单独对这些区域的像素进行向前传播。请注意,这些矩形区域的并集需要完全覆盖输入图像。当像素被多个矩形区域覆盖时,可以将该像素在不同区域中转置的卷积输出的平均值输入到 softmax 运算中以预测类。 + +为简单起见,我们只读了几张较大的测试图像,然后从图像的左上角开始裁剪 $320\times480$ 区域进行预测。对于这些测试图像,我们逐行打印它们的裁剪区域、预测结果和地面真相。 + +```{.python .input} +voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012') +test_images, test_labels = d2l.read_voc_images(voc_dir, False) +n, imgs = 4, [] +for i in range(n): + crop_rect = (0, 0, 480, 320) + X = image.fixed_crop(test_images[i], *crop_rect) + pred = label2image(predict(X)) + imgs += [X, pred, image.fixed_crop(test_labels[i], *crop_rect)] +d2l.show_images(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=2); +``` + +```{.python .input} +#@tab pytorch +voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012') +test_images, test_labels = d2l.read_voc_images(voc_dir, False) +n, imgs = 4, [] +for i in range(n): + crop_rect = (0, 0, 320, 480) + X = torchvision.transforms.functional.crop(test_images[i], *crop_rect) + pred = label2image(predict(X)) + imgs += [X.permute(1,2,0), pred.cpu(), + torchvision.transforms.functional.crop( + test_labels[i], *crop_rect).permute(1,2,0)] +d2l.show_images(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=2); +``` + +## 摘要 + +* 完全卷积网络首先使用 CNN 提取图像要素,然后通过 $1\times 1$ 卷积图层将通道数量转换为类数,最后通过转置卷积将要素地图的高度和宽度转换为输入图像的高度和宽度。 +* 在完全卷积网络中,我们可以使用双线性插值的上采样来初始化转置的卷积层。 + +## 练习 + +1. 如果我们在实验中使用 Xavier 初始化来转置卷积层,结果会如何改变? +1. 你能通过调整超参数来进一步提高模型的准确性吗? +1. 预测测试图像中所有像素的类别。 +1. 最初的完全卷积网络论文还使用 CNN 中间层 :cite:`Long.Shelhamer.Darrell.2015` 的输出。尝试实施这个想法。 + +:begin_tab:`mxnet` +[Discussions](https://discuss.d2l.ai/t/377) +:end_tab: + +:begin_tab:`pytorch` +[Discussions](https://discuss.d2l.ai/t/1582) +:end_tab: diff --git a/chapter_computer-vision/fcn_origin.md b/chapter_computer-vision/fcn_origin.md new file mode 100644 index 000000000..f10843d01 --- /dev/null +++ b/chapter_computer-vision/fcn_origin.md @@ -0,0 +1,450 @@ +# Fully Convolutional Networks +:label:`sec_fcn` + +As discussed in :numref:`sec_semantic_segmentation`, +semantic segmentation +classifies images in pixel level. +A fully convolutional network (FCN) +uses a convolutional neural network to +transform image pixels to pixel classes :cite:`Long.Shelhamer.Darrell.2015`. +Unlike the CNNs that we encountered earlier +for image classification +or object detection, +a fully convolutional network +transforms +the height and width of intermediate feature maps +back to those of the input image: +this is achieved by +the transposed convolutional layer +introduced in :numref:`sec_transposed_conv`. +As a result, +the classification output +and the input image +have a one-to-one correspondence +in pixel level: +the channel dimension at any output pixel +holds the classification results +for the input pixel at the same spatial position. + +```{.python .input} +%matplotlib inline +from d2l import mxnet as d2l +from mxnet import gluon, image, init, np, npx +from mxnet.gluon import nn + +npx.set_np() +``` + +```{.python .input} +#@tab pytorch +%matplotlib inline +from d2l import torch as d2l +import torch +import torchvision +from torch import nn +from torch.nn import functional as F +``` + +## The Model + +Here we describe the basic design of the fully convolutional network model. +As shown in :numref:`fig_fcn`, +this model first uses a CNN to extract image features, +then transforms the number of channels into +the number of classes +via a $1\times 1$ convolutional layer, +and finally transforms the height and width of +the feature maps +to those +of the input image via +the transposed convolution introduced in :numref:`sec_transposed_conv`. +As a result, +the model output has the same height and width as the input image, +where the output channel contains the predicted classes +for the input pixel at the same spatial position. + + +![Fully convolutional network.](../img/fcn.svg) +:label:`fig_fcn` + +Below, we [**use a ResNet-18 model pretrained on the ImageNet dataset to extract image features**] +and denote the model instance as `pretrained_net`. +The last few layers of this model +include a global average pooling layer +and a fully-connected layer: +they are not needed +in the fully convolutional network. + +```{.python .input} +pretrained_net = gluon.model_zoo.vision.resnet18_v2(pretrained=True) +pretrained_net.features[-3:], pretrained_net.output +``` + +```{.python .input} +#@tab pytorch +pretrained_net = torchvision.models.resnet18(pretrained=True) +list(pretrained_net.children())[-3:] +``` + +Next, we [**create the fully convolutional network instance `net`**]. +It copies all the pretrained layers in the ResNet-18 +except for the final global average pooling layer +and the fully-connected layer that are closest +to the output. + +```{.python .input} +net = nn.HybridSequential() +for layer in pretrained_net.features[:-2]: + net.add(layer) +``` + +```{.python .input} +#@tab pytorch +net = nn.Sequential(*list(pretrained_net.children())[:-2]) +``` + +Given an input with height and width of 320 and 480 respectively, +the forward propagation of `net` +reduces the input height and width to 1/32 of the original, namely 10 and 15. + +```{.python .input} +X = np.random.uniform(size=(1, 3, 320, 480)) +net(X).shape +``` + +```{.python .input} +#@tab pytorch +X = torch.rand(size=(1, 3, 320, 480)) +net(X).shape +``` + +Next, we [**use a $1\times 1$ convolutional layer to transform the number of output channels into the number of classes (21) of the Pascal VOC2012 dataset.**] +Finally, we need to (**increase the height and width of the feature maps by 32 times**) to change them back to the height and width of the input image. +Recall how to calculate +the output shape of a convolutional layer in :numref:`sec_padding`. +Since $(320-64+16\times2+32)/32=10$ and $(480-64+16\times2+32)/32=15$, we construct a transposed convolutional layer with stride of $32$, +setting +the height and width of the kernel +to $64$, the padding to $16$. +In general, +we can see that +for stride $s$, +padding $s/2$ (assuming $s/2$ is an integer), +and the height and width of the kernel $2s$, +the transposed convolution will increase +the height and width of the input by $s$ times. + +```{.python .input} +num_classes = 21 +net.add(nn.Conv2D(num_classes, kernel_size=1), + nn.Conv2DTranspose( + num_classes, kernel_size=64, padding=16, strides=32)) +``` + +```{.python .input} +#@tab pytorch +num_classes = 21 +net.add_module('final_conv', nn.Conv2d(512, num_classes, kernel_size=1)) +net.add_module('transpose_conv', nn.ConvTranspose2d(num_classes, num_classes, + kernel_size=64, padding=16, stride=32)) +``` + +## [**Initializing Transposed Convolutional Layers**] + + +We already know that +transposed convolutional layers can increase +the height and width of +feature maps. +In image processing, we may need to scale up +an image, i.e., *upsampling*. +*Bilinear interpolation* +is one of the commonly used upsampling techniques. +It is also often used for initializing transposed convolutional layers. + +To explain bilinear interpolation, +say that +given an input image +we want to +calculate each pixel +of the upsampled output image. +In order to calculate the pixel of the output image +at coordinate $(x, y)$, +first map $(x, y)$ to coordinate $(x', y')$ on the input image, for example, according to the ratio of the input size to the output size. +Note that the mapped $x′$ and $y′$ are real numbers. +Then, find the four pixels closest to coordinate +$(x', y')$ on the input image. +Finally, the pixel of the output image at coordinate $(x, y)$ is calculated based on these four closest pixels +on the input image and their relative distance from $(x', y')$. + +Upsampling of bilinear interpolation +can be implemented by the transposed convolutional layer +with the kernel constructed by the following `bilinear_kernel` function. +Due to space limitations, we only provide the implementation of the `bilinear_kernel` function below +without discussions on its algorithm design. + +```{.python .input} +def bilinear_kernel(in_channels, out_channels, kernel_size): + factor = (kernel_size + 1) // 2 + if kernel_size % 2 == 1: + center = factor - 1 + else: + center = factor - 0.5 + og = (np.arange(kernel_size).reshape(-1, 1), + np.arange(kernel_size).reshape(1, -1)) + filt = (1 - np.abs(og[0] - center) / factor) * \ + (1 - np.abs(og[1] - center) / factor) + weight = np.zeros((in_channels, out_channels, kernel_size, kernel_size)) + weight[range(in_channels), range(out_channels), :, :] = filt + return np.array(weight) +``` + +```{.python .input} +#@tab pytorch +def bilinear_kernel(in_channels, out_channels, kernel_size): + factor = (kernel_size + 1) // 2 + if kernel_size % 2 == 1: + center = factor - 1 + else: + center = factor - 0.5 + og = (torch.arange(kernel_size).reshape(-1, 1), + torch.arange(kernel_size).reshape(1, -1)) + filt = (1 - torch.abs(og[0] - center) / factor) * \ + (1 - torch.abs(og[1] - center) / factor) + weight = torch.zeros((in_channels, out_channels, + kernel_size, kernel_size)) + weight[range(in_channels), range(out_channels), :, :] = filt + return weight +``` + +Let us [**experiment with upsampling of bilinear interpolation**] +that is implemented by a transposed convolutional layer. +We construct a transposed convolutional layer that +doubles the height and weight, +and initialize its kernel with the `bilinear_kernel` function. + +```{.python .input} +conv_trans = nn.Conv2DTranspose(3, kernel_size=4, padding=1, strides=2) +conv_trans.initialize(init.Constant(bilinear_kernel(3, 3, 4))) +``` + +```{.python .input} +#@tab pytorch +conv_trans = nn.ConvTranspose2d(3, 3, kernel_size=4, padding=1, stride=2, + bias=False) +conv_trans.weight.data.copy_(bilinear_kernel(3, 3, 4)); +``` + +Read the image `X` and assign the upsampling output to `Y`. In order to print the image, we need to adjust the position of the channel dimension. + +```{.python .input} +img = image.imread('../img/catdog.jpg') +X = np.expand_dims(img.astype('float32').transpose(2, 0, 1), axis=0) / 255 +Y = conv_trans(X) +out_img = Y[0].transpose(1, 2, 0) +``` + +```{.python .input} +#@tab pytorch +img = torchvision.transforms.ToTensor()(d2l.Image.open('../img/catdog.jpg')) +X = img.unsqueeze(0) +Y = conv_trans(X) +out_img = Y[0].permute(1, 2, 0).detach() +``` + +As we can see, the transposed convolutional layer increases both the height and width of the image by a factor of two. +Except for the different scales in coordinates, +the image scaled up by bilinear interpolation and the original image printed in :numref:`sec_bbox` look the same. + +```{.python .input} +d2l.set_figsize() +print('input image shape:', img.shape) +d2l.plt.imshow(img.asnumpy()); +print('output image shape:', out_img.shape) +d2l.plt.imshow(out_img.asnumpy()); +``` + +```{.python .input} +#@tab pytorch +d2l.set_figsize() +print('input image shape:', img.permute(1, 2, 0).shape) +d2l.plt.imshow(img.permute(1, 2, 0)); +print('output image shape:', out_img.shape) +d2l.plt.imshow(out_img); +``` + +In a fully convolutional network, we [**initialize the transposed convolutional layer with upsampling of bilinear interpolation. For the $1\times 1$ convolutional layer, we use Xavier initialization.**] + +```{.python .input} +W = bilinear_kernel(num_classes, num_classes, 64) +net[-1].initialize(init.Constant(W)) +net[-2].initialize(init=init.Xavier()) +``` + +```{.python .input} +#@tab pytorch +W = bilinear_kernel(num_classes, num_classes, 64) +net.transpose_conv.weight.data.copy_(W); +``` + +## [**Reading the Dataset**] + +We read +the semantic segmentation dataset +as introduced in :numref:`sec_semantic_segmentation`. +The output image shape of random cropping is +specified as $320\times 480$: both the height and width are divisible by $32$. + +```{.python .input} +#@tab all +batch_size, crop_size = 32, (320, 480) +train_iter, test_iter = d2l.load_data_voc(batch_size, crop_size) +``` + +## [**Training**] + + +Now we can train our constructed +fully convolutional network. +The loss function and accuracy calculation here +are not essentially different from those in image classification of earlier chapters. +Because we use the output channel of the +transposed convolutional layer to +predict the class for each pixel, +the channel dimension is specified in the loss calculation. +In addition, the accuracy is calculated +based on correctness +of the predicted class for all the pixels. + +```{.python .input} +num_epochs, lr, wd, devices = 5, 0.1, 1e-3, d2l.try_all_gpus() +loss = gluon.loss.SoftmaxCrossEntropyLoss(axis=1) +net.collect_params().reset_ctx(devices) +trainer = gluon.Trainer(net.collect_params(), 'sgd', + {'learning_rate': lr, 'wd': wd}) +d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices) +``` + +```{.python .input} +#@tab pytorch +def loss(inputs, targets): + return F.cross_entropy(inputs, targets, reduction='none').mean(1).mean(1) + +num_epochs, lr, wd, devices = 5, 0.001, 1e-3, d2l.try_all_gpus() +trainer = torch.optim.SGD(net.parameters(), lr=lr, weight_decay=wd) +d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices) +``` + +## [**Prediction**] + + +When predicting, we need to standardize the input image +in each channel and transform the image into the four-dimensional input format required by the CNN. + + +```{.python .input} +def predict(img): + X = test_iter._dataset.normalize_image(img) + X = np.expand_dims(X.transpose(2, 0, 1), axis=0) + pred = net(X.as_in_ctx(devices[0])).argmax(axis=1) + return pred.reshape(pred.shape[1], pred.shape[2]) +``` + +```{.python .input} +#@tab pytorch +def predict(img): + X = test_iter.dataset.normalize_image(img).unsqueeze(0) + pred = net(X.to(devices[0])).argmax(dim=1) + return pred.reshape(pred.shape[1], pred.shape[2]) +``` + +To [**visualize the predicted class**] of each pixel, we map the predicted class back to its label color in the dataset. + +```{.python .input} +def label2image(pred): + colormap = np.array(d2l.VOC_COLORMAP, ctx=devices[0], dtype='uint8') + X = pred.astype('int32') + return colormap[X, :] +``` + +```{.python .input} +#@tab pytorch +def label2image(pred): + colormap = torch.tensor(d2l.VOC_COLORMAP, device=devices[0]) + X = pred.long() + return colormap[X, :] +``` + +Images in the test dataset vary in size and shape. +Since the model uses a transposed convolutional layer with stride of 32, +when the height or width of an input image is indivisible by 32, +the output height or width of the +transposed convolutional layer will deviate from the shape of the input image. +In order to address this issue, +we can crop multiple rectangular areas with height and width that are integer multiples of 32 in the image, +and perform forward propagation +on the pixels in these areas separately. +Note that +the union of these rectangular areas needs to completely cover the input image. +When a pixel is covered by multiple rectangular areas, +the average of the transposed convolution outputs +in separate areas for this same pixel +can be input to +the softmax operation +to predict the class. + + +For simplicity, we only read a few larger test images, +and crop a $320\times480$ area for prediction starting from the upper-left corner of an image. +For these test images, we +print their cropped areas, +prediction results, +and ground-truth row by row. + +```{.python .input} +voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012') +test_images, test_labels = d2l.read_voc_images(voc_dir, False) +n, imgs = 4, [] +for i in range(n): + crop_rect = (0, 0, 480, 320) + X = image.fixed_crop(test_images[i], *crop_rect) + pred = label2image(predict(X)) + imgs += [X, pred, image.fixed_crop(test_labels[i], *crop_rect)] +d2l.show_images(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=2); +``` + +```{.python .input} +#@tab pytorch +voc_dir = d2l.download_extract('voc2012', 'VOCdevkit/VOC2012') +test_images, test_labels = d2l.read_voc_images(voc_dir, False) +n, imgs = 4, [] +for i in range(n): + crop_rect = (0, 0, 320, 480) + X = torchvision.transforms.functional.crop(test_images[i], *crop_rect) + pred = label2image(predict(X)) + imgs += [X.permute(1,2,0), pred.cpu(), + torchvision.transforms.functional.crop( + test_labels[i], *crop_rect).permute(1,2,0)] +d2l.show_images(imgs[::3] + imgs[1::3] + imgs[2::3], 3, n, scale=2); +``` + +## Summary + +* The fully convolutional network first uses a CNN to extract image features, then transforms the number of channels into the number of classes via a $1\times 1$ convolutional layer, and finally transforms the height and width of the feature maps to those of the input image via the transposed convolution. +* In a fully convolutional network, we can use upsampling of bilinear interpolation to initialize the transposed convolutional layer. + + +## Exercises + +1. If we use Xavier initialization for the transposed convolutional layer in the experiment, how does the result change? +1. Can you further improve the accuracy of the model by tuning the hyperparameters? +1. Predict the classes of all pixels in test images. +1. The original fully convolutional network paper also uses outputs of some intermediate CNN layers :cite:`Long.Shelhamer.Darrell.2015`. Try to implement this idea. + +:begin_tab:`mxnet` +[Discussions](https://discuss.d2l.ai/t/377) +:end_tab: + +:begin_tab:`pytorch` +[Discussions](https://discuss.d2l.ai/t/1582) +:end_tab: diff --git a/chapter_computer-vision/semantic-segmentation-and-dataset.md b/chapter_computer-vision/semantic-segmentation-and-dataset.md index a10eb049e..8eabe4fb1 100644 --- a/chapter_computer-vision/semantic-segmentation-and-dataset.md +++ b/chapter_computer-vision/semantic-segmentation-and-dataset.md @@ -1,7 +1,7 @@ # 语义分割和数据集 :label:`sec_semantic_segmentation` -在 :numref:`sec_bbox`—:numref:`sec_rcnn` 中讨论物体检测任务时,矩形边界框用于标记和预测图像中的对象。本节将讨论 * 语义分割 * 的问题,重点介绍如何将图像划分为属于不同语义类的区域。与对象检测不同的是,语义分割可以识别并理解像素级别图像中的内容:其语义区域的标注和预测以像素级别为单位。:numref:`fig_segmentation` 显示语义分割中图像的狗、猫和背景的标签。与对象检测相比,在语义分段中标记的像素级边框显然更加细粒度。 +在 :numref:`sec_bbox`—:numref:`sec_rcnn` 中讨论物体检测任务时,矩形边界框用于标记和预测图像中的对象。本节将讨论 * 语义分割 * 的问题,重点介绍如何将图像划分为属于不同语义类的区域。与对象检测不同,语义分割可以识别并理解像素级别图像中的内容:其语义区域的标注和预测是以像素级别进行的。:numref:`fig_segmentation` 显示语义分割中图像的狗、猫和背景的标签。与对象检测相比,在语义分段中标记的像素级边框显然更加细粒度。 ![Labels of the dog, cat, and background of the image in semantic segmentation.](../img/segmentation.svg) :label:`fig_segmentation` @@ -123,7 +123,7 @@ VOC_CLASSES = ['background', 'aeroplane', 'bicycle', 'bird', 'boat', 'potted plant', 'sheep', 'sofa', 'train', 'tv/monitor'] ``` -通过上面定义的两个常量,我们可以方便地 [** 查找标签中每个像素的类索引 **]。我们定义了 `voc_colormap2label` 函数来构建从上述 RGB 颜色值到类别索引的映射,而 `voc_label_indices` 函数将任何 RGB 值映射到此 Pascal VOC2012 数据集中的类索引。 +通过上面定义的两个常量,我们可以方便地 [** 查找标签中每个像素的类索引 **]。我们定义了 `voc_colormap2label` 函数来构建从上述 RGB 颜色值到类索引的映射,而 `voc_label_indices` 函数将任何 RGB 值映射到此 Pascal VOC2012 数据集中的类索引。 ```{.python .input} #@save @@ -216,7 +216,7 @@ d2l.show_images(imgs[::2] + imgs[1::2], 2, n); ### [** 自定义语义分段数据集类 **] -我们通过继承高级 API 提供的 `Dataset` 类来定义自定义语义分割数据集类 `VOCSegDataset`。通过实现 `__getitem__` 函数,我们可以任意访问数据集中索引为 `idx` 的输入图像以及该图像中每个像素的类索引。由于数据集中的某些图像的尺寸小于随机裁剪的输出大小,因此这些示例将通过自定义 `filter` 函数过滤掉。此外,我们还定义了 `normalize_image` 函数来标准化输入图像的三个 RGB 通道的值。 +我们通过继承高级 API 提供的 `Dataset` 类来定义自定义语义分割数据集类 `VOCSegDataset`。通过实现 `__getitem__` 函数,我们可以任意访问数据集中索引为 `idx` 的输入图像以及该图像中每个像素的类索引。由于数据集中的某些图像的大小小小于随机裁剪的输出大小,因此通过自定义 `filter` 函数过滤掉这些示例。此外,我们还定义了 `normalize_image` 函数来标准化输入图像的三个 RGB 通道的值。 ```{.python .input} #@save diff --git a/chapter_computer-vision/transposed-conv.md b/chapter_computer-vision/transposed-conv.md index 4ebc837a2..598140658 100644 --- a/chapter_computer-vision/transposed-conv.md +++ b/chapter_computer-vision/transposed-conv.md @@ -1,7 +1,7 @@ # 转置卷积 :label:`sec_transposed_conv` -到目前为止,我们看到的 CNN 图层,例如卷积图层 (:numref:`sec_conv_layer`) 和汇集图层 (:numref:`sec_pooling`),通常会减少(向下采样)输入的空间维度(高度和宽度),或者保持它们不变。在按像素级进行分类的语义分段中,如果输入和输出的空间维度相同,将很方便。例如,一个输出像素处的通道维度可以在同一空间位置保存输入像素的分类结果。 +到目前为止,我们看到的 CNN 图层,例如卷积图层 (:numref:`sec_conv_layer`) 和合并图层 (:numref:`sec_pooling`),通常会减少(向下采样)输入的空间维度(高度和宽度),或者保持它们不变。在按像素级进行分类的语义分段中,如果输入和输出的空间维度相同,将很方便。例如,一个输出像素处的通道维度可以在同一空间位置保存输入像素的分类结果。 为了实现这一点,特别是在空间维度被 CNN 图层减小后,我们可以使用另一种类型的 CNN 图层,这种类型可以增加(上采样)中间要素地图的空间维度。在本节中,我们将介绍 *转置卷积 *,也称为 * 分数步长卷积 * :cite:`Dumoulin.Visin.2016`, @@ -24,7 +24,7 @@ from d2l import torch as d2l ## 基本操作 -现在忽略渠道,让我们从基本的转置卷积操作开始,步幅为 1 且没有填充。假设我们得到了一个 $n_h \times n_w$ 输入张量和一个 $k_h \times k_w$ 内核。滑动内核窗口的步幅为 1,每行 $n_w$ 次,每列 $n_h$ 次,共产生 $n_h n_w$ 个中间结果。每个中间结果都是一个 $(n_h + k_h - 1) \times (n_w + k_w - 1)$ 张量,初始化为零。为了计算每个中间张量,输入张量中的每个元素都乘以内核,以便产生的 $k_h \times k_w$ 张量替换每个中间张量中的一个部分。请注意,每个中间张量中被替换部分的位置对应于用于计算的输入张量中元素的位置。最后,对所有中间结果进行总结以产生产出结果。 +现在忽略渠道,让我们从基本的转置卷积操作开始,步幅为 1 且没有填充。假设我们获得了一个 $n_h \times n_w$ 输入张量和一个 $k_h \times k_w$ 内核。滑动内核窗口的步幅为 1,每行 $n_w$ 次,每列 $n_h$ 次,共产生 $n_h n_w$ 个中间结果。每个中间结果都是一个 $(n_h + k_h - 1) \times (n_w + k_w - 1)$ 张量,初始化为零。为了计算每个中间张量,输入张量中的每个元素都乘以内核,以便产生的 $k_h \times k_w$ 张量替换每个中间张量中的一部分。请注意,每个中间张量中被替换部分的位置对应于用于计算的输入张量中元素的位置。最后,对所有中间结果进行总结以产生产出结果。 例如,:numref:`fig_trans_conv` 说明了如何为 $2\times 2$ 输入张量计算 $2\times 2$ 内核的转置卷积。 @@ -46,7 +46,7 @@ def trans_conv(X, K): 与通过内核减少 * 输入元素的常规卷积(在 :numref:`sec_conv_layer` 中)相比,转置的卷积 *广播 * 输入元素 -通过内核,从而产生大于输入的输出。我们可以构造基本二维转置卷积运算的输入张量 `X` 和从 :numref:`fig_trans_conv` 到 [** 验证上述实现的输出 **] 的内核张量 `X` 和内核张量 `K`。 +通过内核,从而产生大于输入的输出。我们可以构造基本二维转置卷积运算的输入张量 `X` 和内核张量 `K` 从 :numref:`fig_trans_conv` 到 [** 验证上述实现的输出 **]。 ```{.python .input} #@tab all @@ -89,7 +89,7 @@ tconv.weight.data = K tconv(X) ``` -在转置卷积中,步幅指定为中间结果(因此输出),而不是输入。使用 :numref:`fig_trans_conv` 的相同输入和内核张量,将步幅从 1 改为 2 可以增加中间张量的高度和权重,因此输出张量在 :numref:`fig_trans_conv_stride2` 中。 +在转置卷积中,步幅指定为中间结果(因此输出),而不是输入。使用 :numref:`fig_trans_conv` 的相同输入和内核张量,将步幅从 1 更改为 2 会增加中间张量的高度和权重,因此输出张量在 :numref:`fig_trans_conv_stride2` 中。 ![Transposed convolution with a $2\times 2$ kernel with stride of 2. The shaded portions are a portion of an intermediate tensor as well as the input and kernel tensor elements used for the computation.](../img/trans_conv_stride2.svg) :label:`fig_trans_conv_stride2` @@ -109,9 +109,9 @@ tconv.weight.data = K tconv(X) ``` -对于多个输入和输出通道,转置卷积的工作方式与常规卷积相同。假设输入有 $c_i$ 通道,并且转置卷积为每个输入通道分配一个 $k_h\times k_w$ 内核张量。当指定多个输出通道时,我们将为每个输出通道有一个 $c_i\times k_h\times k_w$ 内核。 +对于多个输入和输出通道,转置卷积的工作方式与常规卷积相同。假设输入有 $c_i$ 个通道,并且转置卷积为每个输入通道分配一个 $k_h\times k_w$ 内核张量。当指定多个输出通道时,我们将为每个输出通道有一个 $c_i\times k_h\times k_w$ 内核。 -总而言之,如果我们将 $\mathsf{X}$ 馈入卷积层 $f$ 以输出 $\mathsf{Y}=f(\mathsf{X})$ 并创建一个与 $f$ 相同的超参数的转置卷积层 $g$,但输出通道数量是 $\mathsf{X}$ 中的通道数,那么 $g(Y)$ 将具有与 $f$ 相同的超参数的转置卷积层 $g$,那么 $g(Y)$ 的形状将与 $g(Y)$ 相同$\mathsf{X}$。可以在下面的示例中说明这一点。 +同样,如果我们将 $\mathsf{X}$ 馈入卷积层 $f$ 以输出 $\mathsf{Y}=f(\mathsf{X})$ 并创建一个与 $f$ 相同的超参数的转置卷积层 $g$,但输出通道数量是 $\mathsf{X}$ 中的通道数,那么 $g(Y)$ 的形状将与 $\mathsf{X}$ 相同,那么 $g(Y)$ 的形状将与 $g(Y)$ 相同$\mathsf{X}$。可以在下面的示例中说明这一点。 ```{.python .input} X = np.random.uniform(size=(1, 10, 16, 16)) @@ -143,7 +143,7 @@ Y = d2l.corr2d(X, K) Y ``` -接下来,我们将卷积内核 `K` 重写为包含大量零的稀疏权重矩阵 `W`。权重矩阵的形状是($4$、$9$),其中非零元素来自卷积内核 `K`。 +接下来,我们将卷积内核 `K` 重写为包含大量零的稀疏权重矩阵 `W`。权重矩阵的形状是($4$,$9$),其中非零元素来自卷积内核 `K`。 ```{.python .input} #@tab all @@ -164,7 +164,7 @@ W Y == d2l.matmul(W, d2l.reshape(X, -1)).reshape(2, 2) ``` -同样,我们可以使用矩阵乘法来实现转置卷积。在下面的示例中,我们从上面的常规卷积中取 $2 \times 2$ 输出 `Y` 作为转置卷积的输入。要通过乘以矩阵来实现这个操作,我们只需要将权重矩阵 `W` 用新的形状 $(9, 4)$ 转置为 $(9, 4)$。 +同样,我们可以使用矩阵乘法来实现转置卷积。在下面的示例中,我们将上面的常规卷积的 $2 \times 2$ 输出 `Y` 作为转置卷积的输入。为了通过乘以矩阵来实现这个操作,我们只需要将权重矩阵 `W` 用新的形状 $(9, 4)$ 转置为 $(9, 4)$。 ```{.python .input} #@tab all @@ -172,12 +172,12 @@ Z = trans_conv(Y, K) Z == d2l.matmul(W.T, d2l.reshape(Y, -1)).reshape(3, 3) ``` -考虑通过乘以矩阵来实现卷积。给定输入向量 $\mathbf{x}$ 和权重矩阵 $\mathbf{W}$,卷积的正向传播函数可以通过将其输入与权重矩阵相乘并输出向量 $\mathbf{y}=\mathbf{W}\mathbf{x}$ 来实现。由于反向传播遵循链规则和 $\nabla_{\mathbf{x}}\mathbf{y}=\mathbf{W}^\top$,因此卷积的反向传播函数可以通过将其输入与转置的权重矩阵 $\mathbf{W}^\top$ 相乘来实现。因此,转置卷积层只能交换卷积层的正向传播函数和反向传播函数:它的正向传播和反向传播函数分别将输入向量与 $\mathbf{W}^\top$ 和 $\mathbf{W}$ 相乘。 +考虑通过乘以矩阵来实现卷积。给定输入向量 $\mathbf{x}$ 和权重矩阵 $\mathbf{W}$,卷积的正向传播函数可以通过将其输入与权重矩阵相乘并输出向量 $\mathbf{y}=\mathbf{W}\mathbf{x}$ 来实现。由于反向传播遵循链规则和 $\nabla_{\mathbf{x}}\mathbf{y}=\mathbf{W}^\top$,因此卷积的反向传播函数可以通过将其输入与转置的权重矩阵 $\mathbf{W}^\top$ 相乘来实现。因此,转置卷积层只能交换卷积层的正向传播函数和反向传播函数:它的正向传播和反向传播函数分别将其输入向量与 $\mathbf{W}^\top$ 和 $\mathbf{W}$ 相乘。 ## 摘要 * 与通过内核减少输入元素的常规卷积相反,转置的卷积通过内核广播输入元素,从而产生的输出大于输入。 -* 如果我们将 $\mathsf{X}$ 输入卷积层 $f$ 以输出 $\mathsf{Y}=f(\mathsf{X})$ 并创建一个与 $f$ 相同的超参数的转置卷积层 $g$,但输出通道数是 $\mathsf{X}$ 中的通道数量,那么 $g(Y)$ 将具有与 $\mathsf{X}$ 相同的超参数,那么 $g(Y)$ 的形状将与 $\mathsf{X}$ 相同。 +* 如果我们将 $\mathsf{X}$ 输入卷积层 $f$ 以输出 $\mathsf{Y}=f(\mathsf{X})$ 并创建一个与 $f$ 相同的超参数的转置卷积层 $g$,但输出通道数是 $\mathsf{X}$ 中的通道数量,那么 $g(Y)$ 的形状将与 $\mathsf{X}$ 相同。 * 我们可以使用矩阵乘法来实现卷积。转置的卷积层只能交换正向传播函数和卷积层的反向传播函数。 ## 练习 diff --git a/chapter_computer-vision/transposed-conv_origin.md b/chapter_computer-vision/transposed-conv_origin.md index c439f2bdd..ccb0ee142 100644 --- a/chapter_computer-vision/transposed-conv_origin.md +++ b/chapter_computer-vision/transposed-conv_origin.md @@ -201,7 +201,7 @@ are specified, we will have a $c_i\times k_h\times k_w$ kernel for each output channel. -As in all, if we feed $\mathsf{X}$ into a convolutional layer $f$ to output $\mathsf{Y}=f(\mathsf{X})$ and create a transposed convolution layer $g$ with the same hyperparameters as $f$ except +As in all, if we feed $\mathsf{X}$ into a convolutional layer $f$ to output $\mathsf{Y}=f(\mathsf{X})$ and create a transposed convolutional layer $g$ with the same hyperparameters as $f$ except for the number of output channels being the number of channels in $\mathsf{X}$, then $g(Y)$ will have the same shape as $\mathsf{X}$. @@ -317,7 +317,7 @@ $\mathbf{W}^\top$ and $\mathbf{W}$, respectively. ## Summary * In contrast to the regular convolution that reduces input elements via the kernel, the transposed convolution broadcasts input elements via the kernel, thereby producing an output that is larger than the input. -* If we feed $\mathsf{X}$ into a convolutional layer $f$ to output $\mathsf{Y}=f(\mathsf{X})$ and create a transposed convolution layer $g$ with the same hyperparameters as $f$ except for the number of output channels being the number of channels in $\mathsf{X}$, then $g(Y)$ will have the same shape as $\mathsf{X}$. +* If we feed $\mathsf{X}$ into a convolutional layer $f$ to output $\mathsf{Y}=f(\mathsf{X})$ and create a transposed convolutional layer $g$ with the same hyperparameters as $f$ except for the number of output channels being the number of channels in $\mathsf{X}$, then $g(Y)$ will have the same shape as $\mathsf{X}$. * We can implement convolutions using matrix multiplications. The transposed convolutional layer can just exchange the forward propagation function and the backpropagation function of the convolutional layer.