The aim of this repository is to document the code and our work on our CS 534 AI term project at Worcester Polytechnic Institute (WPI), MA.
The goal is to use deep learning techniques, specfically generative modelling to achieve unpaired image translation from
- Unmasked face domain to masked face domain
- Masked face domain to unmasked face domain
(Presentation) (Paper)
We use 2 different data-sets and curate them according to our use-case:
- Flickr-Faces-HQ (FFHQ) data-set for unmasked images
- MaskedFace-Net data-set for masked images
For unmasked faces, FFHQ is a high-quality image dataset of human faces, originally created as a benchmark for generative adversarial networks (GAN). The data-set consists of 70,000 high-quality PNG images at 1024×1024 resolution with variation and diversity in terms of subjects and objects in the frame.
For our masked face data-set, we use the MaskedFace-Net dataset, which is a dataset of human faces with a correctly or incorrectly worn mask (133,783 images) based on the Flickr-Faces-HQ (FFHQ) data-set. The masks are photo-shopped onto the faces. Although the dataset is based on FFHQ, the facemasks are incorrectly masked for most of the images.
For both of these data-sets, we curate images based on the number of faces in images, mask placement, mask clarity, and realistic effect of the mask on the image. We use a subset of 6000 masked images and 6000 unmasked images for training and 1000 test images from each of them were used for testing.
Sample image from the unmasked real domain
Sample image from the masked real domain
The curated dataset can be found in the following drive link: https://drive.google.com/drive/folders/1qKIMJx949qAPC71GlGS1cGPceBISQGma?usp=sharing
From here on, I would be referring to
- The unmasked domain as domain A
- The masked domain as as domain B
We use the CycleGAN architecture to transform the images of unmasked faces into masked faces, and transform masked faced to unmasked faces. The above problem is made more intricate by the fact that there does not exist a pair-to-pair data-set for masked to unmasked faces. We define pair to pair data-set as the same face without the mask and with the mask.
This leads us to the task of unpaired image-to-image translation. CycleGANs have previously been used for tasks such as horse-zebra, summer-winter etc. This task can be formulated as an image from a source domain X to a target domain Y in the absence of paired examples. We define the goal as mapping G : X → Y such that the distribution of images from G(X) is indistinguishable from the distribution Y, using an adversarial loss. CycleGAN introduced another loss because this mapping is highly under-constrained. To address this, they couple it with an inverse mapping F : Y → X and introduce a cycle consistency loss to enforce F(G(X)) ≈ X and G(F(Y)) ≈ Y.
To explain GANs, Generative Adversarial Networks include two networks, a Generator G(x), and a Discriminator D(x). The generator tries to generate the data based on the underlying distribution of the training data whereas the discriminator tries to tell apart the fake images from the real ones. They play an adversarial game where the generator tries to fool the discriminator by generating data similar to those in the training set. The Discriminator is fooled when the generated fakes are so real that it cant tell them apart. Both of them are trained simultaneously on data-sets of images, videos, and audio files. The generator G(x) model generates images from random noise and then learns the data distribution of how to generate realistic images. Random noise is given to the generator which outputs the fake images and the real image from the training set is given to the discriminator that learns how to differentiate fake images from real images. The output of Discriminator D(x) is the probability that the given input is real if the output is 1.0, and if the output is 0 the given input is identified as fake. Thus our goal is to get the output 1 (real) for all the fake images.
In a nutshell, we have a Generator takes random noise as input and outputs a translated image (fake). On the otherhand, a Discriminator takes the generated fake by the discriminator and images from domain B as input and outputs 1 for real / 0 for fake. If the discriminator correctly identifies the fake, the generator is penalized and the discriminator is rewarded. The input to the generator is just random noise. If the discriminator incorrectly identifies the fake as real, the generator is rewarded for fooling the discriminator and the discriminator is penalized. The generator and the discriminator play this adversarial game, while improving and updatinng each other.
In essence, we have
- a generator GAB, a CNN that takes random noise as input and learns to translate images from domain A to domain B.
- a discriminator DAB, a classifier that learns to distinguish between the images from the real domain B and the fakes generated by GAB.
- a generator GBA, a CNN that takes the output of GAB as input and learns to translate images from domain B to domain A.
- a discriminator DBA, a classifier that learns to distinguish between the images from the real domain A and the fakes generated by GBA.
The generator network architecture consists of three convolutions, several residual blocks, two fractionally-strided convolutions with stride 1/2, and one convolution. CycleGAN uses 6 blocks for 128 × 128 images and 9 blocks for 256×256 and higher-resolution training images. For the discriminator networks we use 70 × 70 PatchGANs, which aim to classify whether 70 × 70 overlapping image patches are real or fake. To train the network we used a number of different hyper-parameter such as batch size, batch sequence, normalization type, types of optimization algorithm, change in loss from L1, log-loss, or L2 loss. We performed the below experiments using 4 Nvidia Tesla V100 GPUs and took around 12 hours to run 200 epochs.
In picture below, we present fakes generated through the training process, as we progress through epochs.
In picture below, we present the various GAN lossess and its varaiation as a function of epochs.
Firstly, we present the fakes generated by our model in both domain A and domain B.
To evaluate our work, we use a manual evaluation in the form of visual inspection of images.
We also use the qualitative metric, a visual study is similar to the likes of perceptual studies of Amazon Mechanical Turks with participants shown a sequence of images asking them to label them as real or fake. Rating and Preference Judgment is the most used qualitative method: Images are often presented in pairs and the human judge is asked which image they prefer, e.g., which image is more realistic. We conducted a survey, where the user was asked to guess and filter out the ground truth from the GAN-generated outputs out of 22 images that were randomly presented to the user. Each correct answer fetches the user 1 point. As it stands, we received 100 responses and the following is the analysis: Of the 2200 predictions, the users wrongly guessed 881 times. This gives us an accuracy of 40.08% which implies, the generated fakes were able to dupe the user 40.08% of time, which is impressive, considering the discerning sight we humans possess, thus validating our model through visual inspection. The statistics for the survey can be seen in figure below, where theaverage score of the user is 13.24.
We also use a quantitative metric in the form of FID score. The Frechet Inception Distance (FID), is a metric for evaluating the quality of generated images and is generally used to assess the performance of generative adversarial networks. FID measures the distance between the distributions of generated and real samples. Lower FID is better, meaning they are more similar to real and generated samples as measured by the distance between their distributions. Finally, We evaluated our generator models based on Fréchet inception distance (FID), a quantitative GAN evaluation metric.
Through the above training trials, we achieved the least FIDAB of 17.07 with a generator model GAB, which was trained with the following values of hyper-parameters: batch size 16, instance normalization, linear learning policy and lsgan optimization loss. We also observe that the FIDAB score ranges between 17.07 and 32.18. The best score of 17.07, coupled with small range means that the generator network GAB is finding it relatively easier to learn how to apply a fake mask onto an unmasked face, irrespective of the variations in the hyper parameters. Similarly, we achieved the least FIDBA of 48.39 with a generator model GBA, which was trained with batch size 32, instance normalization, linear learning policy and vanilla optimization loss. The best score of 48.39 implies the generator network GBA is finding it slightly difficult to generate the masked area of the face. We also observe that FIDBA score ranges between values 48.39 and 261.78, which implies the performance of the network varies significantly with change in hyper parameters.
- https://github.com/cabani/MaskedFace-Net (for face masked images) - Most of the images in this dataset are not well masked. We will be only selecting the images which are properly masked.\
- https://github.com/NVlabs/ffhq-dataset (for unmasked images) - Flickr-Faces-HQ (FFHQ) is a high-quality image dataset of 70,000 high-quality PNG images at 1024×1024 resolution of human faces.
- https://towardsdatascience.com/demystifying-gans-cc1ac011355
- Cycle GAN paper : https://arxiv.org/abs/1703.10593
- https://towardsdatascience.com/cycle-gan-with-pytorch-ebe5db947a99