If you use social sharing apps or happen to be an amateur photographer, you are familiar with filters. Filters can alter the color styles of photos to make the background sharper or people's faces whiter. However, a filter generally can only change one aspect of a photo. To create the ideal photo, you often need to try many different filter combinations. This process is as complex as tuning the hyper-parameters of a model.
In this section, we will discuss how we can use convolution neural networks (CNNs) to automatically apply the style of one image to another image, an operation known as style transfer[1]. Here, we need two input images, one content image and one style image. We use a neural network to alter the content image so that its style mirrors that of the style image. In Figure 9.12, the content image is a landscape photo the author took in Mount Rainier National Part near Seattle. The style image is an oil painting of oak trees in autumn. The output composite image retains the overall shapes of the objects in the content image, but applies the oil painting brushwork of the style image and makes the overall color more vivid.
Figure 9.13 shows an output of the CNN-based style transfer method. First, we initialize the composite image. For example, we can initialize it as the content image. This composite image is the only variable that needs to be updated in the style transfer process, i.e. the model parameter to be updated in style transfer. Then, we select a pre-trained CNN to extract image features. These model parameters do not need to be updated during training. The deep CNN uses multiple neural layers that successively extract image features. We can select the output of certain layers to use as content features or style features. If we use the structure in Figure 9.13, the pretrained neural network contains three convolutional layers. The second layer outputs the image content features, while the outputs of the first and third layers are used as style features. Next, we use forward propagation (in the direction of the solid lines) to compute the style transfer loss function and backward propagation (in the direction of the dotted lines) to update the model parameter, constantly updating the composite image. The loss functions used in style transfer generally have three parts: 1. Content loss is used to make the composite image approximate the content image as regards content features. 2. Style loss is used to make the composite image approximate the style image in terms of style features. 3. Total variation loss helps reduce the noise in the composite image. Finally, after we finish training the model, we output the style transfer model parameters to obtain the final composite image.
Next, we will perform an experiment to help us better understand the technical details of style transfer.
First, we read the content and style images. By printing out the image coordinate axes, we can see that they have different dimensions.
import sys
sys.path.insert(0, '..')
%matplotlib inline
import d2l
from mxnet import autograd, gluon, image, init, nd
from mxnet.gluon import model_zoo, nn
import time
d2l.set_figsize()
content_img = image.imread('../img/rainier.jpg')
d2l.plt.imshow(content_img.asnumpy());
style_img = image.imread('../img/autumn_oak.jpg')
d2l.plt.imshow(style_img.asnumpy());
Below, we define the functions for image preprocessing and postprocessing. The preprocess
function normalizes each of the three RGB channels of the input images and transforms the results to a format that can be input to the CNN. The postprocess
function restores the pixel values in the output image to their original values before normalization. Because the image printing function requires that each pixel has a floating point value from 0 to 1, we use the clip
function to replace values smaller than 0 or greater than 1 with 0 or 1, respectively.
rgb_mean = nd.array([0.485, 0.456, 0.406])
rgb_std = nd.array([0.229, 0.224, 0.225])
def preprocess(img, image_shape):
img = image.imresize(img, *image_shape)
img = (img.astype('float32') / 255 - rgb_mean) / rgb_std
return img.transpose((2, 0, 1)).expand_dims(axis=0)
def postprocess(img):
img = img[0].as_in_context(rgb_std.context)
return (img.transpose((1, 2, 0)) * rgb_std + rgb_mean).clip(0, 1)
We use the VGG-19 model pre-trained on the ImageNet data set to extract image features[1].
pretrained_net = model_zoo.vision.vgg19(pretrained=True)
To extract image content and style features, we can select the outputs of certain layers in the VGG network. In general, the closer an output is to the input layer, the easier it is to extract image detail information. The farther away an output is, the easier it is to extract global information. To prevent the composite image from retaining too many details from the content image, we select a VGG network layer near the output layer to output the image content features. This layer is called the content layer. We also select the outputs of different layers from the VGG network for matching local and global styles. These are called the style layers. As we mentioned in the "Networks Using Duplicates (VGG)" section, VGG networks have five convolutional blocks. In this experiment, we select the last convolutional layer of the fourth convolutional block as the content layer and the first layer of each block as style layers. We can obtain the indexes for these layers by printing the pretrained_net
instance.
style_layers, content_layers = [0, 5, 10, 19, 28], [25]
During feature extraction, we only need to use all the VGG layers from the input layer to the content or style layer nearest the output layer. Below, we build a new network, net
, which only retains the layers in the VGG network we need to use. We then use net
to extract features.
net = nn.Sequential()
for i in range(max(content_layers + style_layers) + 1):
net.add(pretrained_net.features[i])
Given input X
, if we simply call the forward computation net(X)
, we can only obtain the output of the last layer. Because we also need the outputs of the intermediate layers, we need to perform layer-by-layer computation and retain the content and style layer outputs.
def extract_features(X, content_layers, style_layers):
contents = []
styles = []
for i in range(len(net)):
X = net[i](X)
if i in style_layers:
styles.append(X)
if i in content_layers:
contents.append(X)
return contents, styles
Next, we define two functions: The get_contents
function obtains the content features extracted from the content image, while the get_styles
function obtains the style features extracted from the style image. Because we do not need to change the parameters of the pre-trained VGG model during training, we can extract the content features from the content image and style features from the style image before the start of training. As the composite image is the model parameter that must be updated during style transfer, we can only call the extract_features
function during training to extract the content and style features of the composite image.
def get_contents(image_shape, ctx):
content_X = preprocess(content_img, image_shape).copyto(ctx)
contents_Y, _ = extract_features(content_X, content_layers, style_layers)
return content_X, contents_Y
def get_styles(image_shape, ctx):
style_X = preprocess(style_img, image_shape).copyto(ctx)
_, styles_Y = extract_features(style_X, content_layers, style_layers)
return style_X, styles_Y
Next, we will look at the loss function used for style transfer. The loss function includes the content loss, style loss, and total variation loss.
Similar to the loss function used in linear regression, content loss uses a square error function to measure the difference in content features between the composite image and content image. The two inputs of the square error function are both content layer outputs obtained from the extract_features
function.
def content_loss(Y_hat, Y):
return (Y_hat - Y).square().mean()
Style loss, similar to content loss, uses a square error function to measure the difference in style between the composite image and style image. To express the styles output by the style layers, we first use the extract_features
function to compute the style layer output. Assuming that the output has 1 example, gram
function below to divide the Gram matrix by the number of its elements, i.e.
def gram(X):
num_channels, n = X.shape[1], X.size // X.shape[1]
X = X.reshape((num_channels, n))
return nd.dot(X, X.T) / (num_channels * n)
Naturally, the two Gram matrix inputs of the square error function for style loss are taken from the composite image and style image style layer outputs. Here, we assume that the Gram matrix of the style image, gram_Y
, has been computed in advance.
def style_loss(Y_hat, gram_Y):
return (gram(Y_hat) - gram_Y).square().mean()
Sometimes, the composite images we learn have a lot of high-frequency noise, particularly bright or dark pixels. One common noise reduction method is total variation denoising. We assume that
We try to make the values of neighboring pixels as similar as possible.
def tv_loss(Y_hat):
return 0.5 * ((Y_hat[:, :, 1:, :] - Y_hat[:, :, :-1, :]).abs().mean() +
(Y_hat[:, :, :, 1:] - Y_hat[:, :, :, :-1]).abs().mean())
The loss function for style transfer is the weighted sum of the content loss, style loss, and total variance loss. By adjusting these weight hyper-parameters, we can balance the retained content, transferred style, and noise reduction in the composite image according to their relative importance.
content_weight, style_weight, tv_weight = 1, 1e3, 10
def compute_loss(X, contents_Y_hat, styles_Y_hat, contents_Y, styles_Y_gram):
# Calculate the content, style, and total variance losses respectively
contents_l = [content_loss(Y_hat, Y) * content_weight for Y_hat, Y in zip(
contents_Y_hat, contents_Y)]
styles_l = [style_loss(Y_hat, Y) * style_weight for Y_hat, Y in zip(
styles_Y_hat, styles_Y_gram)]
tv_l = tv_loss(X) * tv_weight
# Add up all the losses
l = nd.add_n(*styles_l) + nd.add_n(*contents_l) + tv_l
return contents_l, styles_l, tv_l, l
In style transfer, the composite image is the only variable that needs to be updated. Therefore, we can define a simple model, GeneratedImage
, and treat the composite image as a model parameter. In the model, forward computation only returns the model parameter.
class GeneratedImage(nn.Block):
def __init__(self, img_shape, **kwargs):
super(GeneratedImage, self).__init__(**kwargs)
self.weight = self.params.get('weight', shape=img_shape)
def forward(self):
return self.weight.data()
Next, we define the get_inits
function. This function creates a composite image model instance and initializes it to the image X
. The Gram matrix for the various style layers of the style image, styles_Y_gram
, is computed prior to training.
def get_inits(X, ctx, lr, styles_Y):
gen_img = GeneratedImage(X.shape)
gen_img.initialize(init.Constant(X), ctx=ctx, force_reinit=True)
trainer = gluon.Trainer(gen_img.collect_params(), 'adam',
{'learning_rate': lr})
styles_Y_gram = [gram(Y) for Y in styles_Y]
return gen_img(), styles_Y_gram, trainer
During model training, we constantly extract the content and style features of
the composite image and calculate the loss function. Recall our discussion of
how synchronization functions force the front end to wait for computation
results in :numref:chapter_async
. Because we only call the asscalar
synchronization function every 50
epochs, the process may occupy a great deal of memory. Therefore, we call the
waitall
synchronization function during every epoch.
def train(X, contents_Y, styles_Y, ctx, lr, max_epochs, lr_decay_epoch):
X, styles_Y_gram, trainer = get_inits(X, ctx, lr, styles_Y)
for i in range(max_epochs):
start = time.time()
with autograd.record():
contents_Y_hat, styles_Y_hat = extract_features(
X, content_layers, style_layers)
contents_l, styles_l, tv_l, l = compute_loss(
X, contents_Y_hat, styles_Y_hat, contents_Y, styles_Y_gram)
l.backward()
trainer.step(1)
nd.waitall()
if i % 50 == 0 and i != 0:
print('epoch %3d, content loss %.2f, style loss %.2f, '
'TV loss %.2f, %.2f sec'
% (i, nd.add_n(*contents_l).asscalar(),
nd.add_n(*styles_l).asscalar(), tv_l.asscalar(),
time.time() - start))
if i % lr_decay_epoch == 0 and i != 0:
trainer.set_learning_rate(trainer.learning_rate * 0.1)
print('change lr to %.1e' % trainer.learning_rate)
return X
Next, we start to train the model. First, we set the height and width of the content and style images to 150 by 225 pixels. We use the content image to initialize the composite image.
ctx, image_shape = d2l.try_gpu(), (225, 150)
net.collect_params().reset_ctx(ctx)
content_X, contents_Y = get_contents(image_shape, ctx)
_, styles_Y = get_styles(image_shape, ctx)
output = train(content_X, contents_Y, styles_Y, ctx, 0.01, 500, 200)
Next, we save the trained composite image. As you can see, the composite image in Figure 9.14 retains the scenery and objects of the content image, while introducing the color of the style image. Because the image is relatively small, the details are a bit fuzzy.
d2l.plt.imsave('../img/neural-style-1.png', postprocess(output).asnumpy())
To obtain a clearer composite image, we train the model using a larger image size:
image_shape = (450, 300)
_, content_Y = get_contents(image_shape, ctx)
_, style_Y = get_styles(image_shape, ctx)
X = preprocess(postprocess(output) * 255, image_shape)
output = train(X, content_Y, style_Y, ctx, 0.01, 300, 100)
d2l.plt.imsave('../img/neural-style-2.png', postprocess(output).asnumpy())
As you can see, each epoch takes more time due to the larger image size. As shown in Figure 9.15, the composite image produced retains more detail due to its larger size. The composite image not only has large blocks of color like the style image, but these blocks even have the subtle texture of brush strokes.
- The loss functions used in style transfer generally have three parts: 1. Content loss is used to make the composite image approximate the content image as regards content features. 2. Style loss is used to make the composite image approximate the style image in terms of style features. 3. Total variation loss helps reduce the noise in the composite image.
- We can use a pre-trained CNN to extract image features and minimize the loss function to continuously update the composite image.
- We use a Gram matrix to represent the style output by the style layers.
- How does the output change when you select different content and style layers?
- Adjust the weight hyper-parameters in the loss function. Does the output retain more content or have less noise?
- Use different content and style images. Can you create more interesting composite images?
[1] Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2414-2423).