Skip to content

Implementations of (theoretical) generative adversarial networks and comparison without cherry-picking

Notifications You must be signed in to change notification settings

kihosuh/tf.gans-comparison

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

99 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GANs comparison without cherry-picking

Implementations of some theoretical generative adversarial nets: DCGAN, EBGAN, LSGAN, WGAN, WGAN-GP, BEGAN, DRAGAN and CoulombGAN.

I implemented the structure of model equal to the structure in paper and compared it on the CelebA dataset and LSUN dataset without cherry-picking.

Table of Contents

Features

  • Model architectures are same as the architectures proposed in each paper
  • Each model was not much tuned, so the results can be improved
  • Well-structured (was my goal at the start, but I don't know whether it succeed!)
    • TensorFlow queue runner is used for input pipeline
    • Single trainer (and single evaluator) - multi model structure
    • Logs in training and configuration are recorded on the TensorBoard

Models

  • DCGAN
  • LSGAN
  • WGAN
  • WGAN-GP
  • EBGAN
  • BEGAN
  • DRAGAN
  • CoulombGAN

The family of conditional GANs are excluded (CGAN, acGAN, and so on).

Dataset

CelebA

http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html

  • All experiments were performed on 64x64 CelebA dataset
  • The dataset has 202599 images
  • 1 epoch consists of about 1.58k iterations for batch size 128

LSUN bedroom

http://lsun.cs.princeton.edu/2017/

  • The dataset has 3033042 images
  • 1 epoch consists of about 23.7k iterations for batch size 128

This dataset is provided in LMDB format. https://github.com/fyu/lsun provides documentation and demo code to use it.

Results

  • I implemented the same as the proposed model in each paper, but ignored some details (or the paper did not describe details of model)
    • Granted, a little details make great differences in the results due to the very unstable GAN training
    • So if you had a better results, let me know the settings 🙂
  • Default batch_size=128 and z_dim=100 (from DCGAN)

DCGAN

Radford, Alec, Luke Metz, and Soumith Chintala. "Unsupervised representation learning with deep convolutional generative adversarial networks." arXiv preprint arXiv:1511.06434 (2015).

  • Relatively simple networks
  • Learning rate for discriminator (D_lr) is 2e-4 and learning rate for generator (G_lr) is 2e-4 (proposed in the paper) and 1e-3
G_lr=2e-4 G_lr=1e-3
50k 30k
dcgan.G2e-4.50k dcgan.G1e-3.30k

Second row (50k, 30k) indicates each training iteration.

Higher learning rate (1e-3) for generator made better results. In this case, however, the generator has been collapsed sometimes due to its large learning rate. Lowering both learning rate may bring stability like https://ajolicoeur.wordpress.com/cats/ in which suggested D_lr=5e-5 and G_lr=2e-4.

LSUN
100k
dcgan.100k

EBGAN

Zhao, Junbo, Michael Mathieu, and Yann LeCun. "Energy-based generative adversarial network." arXiv preprint arXiv:1609.03126 (2016).

  • I like energy concept, so this paper is very interesting for me :)
  • Anyway, the energy concept and autoencoder based loss function are impressive, and the results are also fine
  • But I have a question for Pulling-away Term (PT), which prevents mode-collapse theoretically. This is the same idea as minibatch discrimination (T. Salimans et al).
pt weight = 0.1 No pt loss
30k 30k
ebgan.pt.30k ebgan.nopt.30k

The model using PT generates slightly better sample visually. However, it is not clear from this results whether PT prevents mode-collapse. Furthermore, I could not distinguish what setting is better from repeated experiments.

pt weight = 0.1 No pt loss
ebgan.pt.graph ebgan.nopt.graph

pt_loss decreases a little faster in the left which used pt_weight=0.1 but there is no big difference and even at the end the right which used no pt_loss showed a lower pt_loss. So I wonder: is the PT loss really working for preventing mode-collapse as described in the paper?

LSUN
80k
ebgan.80k

LSGAN

Mao, Xudong, et al. "Least squares generative adversarial networks." arXiv preprint ArXiv:1611.04076 (2016).

  • Unusually, LSGAN used large latent space dimension (z_dim=1024)
  • But in my experiment, z_dim=100 makes better results than z_dim=1024 which is originally used in paper
z_dim=100 z_dim=1024
30k 30k
lsgan.100.30k lsgan.1024.30k
LSUN
150k
lsgan.150k

WGAN

Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein gan." arXiv preprint arXiv:1701.07875 (2017).

  • The samples from WGAN are not that impressive - compared to the very impressive theory
  • Also no specific network structure proposed, so DCGAN architecture was used for experiments
  • In the author's implementation, they used higher n_critic in the early stage of training and per 500 iterations
30k W distance
wgan.30k wgan.w_dist
LSUN
230k
wgan.230k

WGAN-GP

Gulrajani, Ishaan, et al. "Improved training of wasserstein gans." arXiv preprint arXiv:1704.00028 (2017).

  • I tried two network architectures, which are DCGAN architecture and ResNet architecture in appendix C
  • ResNet has more complicated architecture and better performance than DCGAN architecture
  • The interesting thing is that the visual quality of samples improves very quickly (ResNet WGAN-GP has best samples on 7k iterations) and it gets worse when continue training
  • According to DRAGAN, constraints of WGAN are too restrictive to learn good generator
DCGAN architecture ResNet architecture
30k 7k, batch size = 64
wgan-gp.dcgan.30k wgan-gp.good.7k
LSUN
100k, ResNet architecture
wgan-gp.150k

Face collapse phenomenon

WGAN-GP was collapsed more than other models when the iteration increases.

DCGAN architecture

10k 20k 30k
wgan-gp.dcgan.10k wgan-gp.dcgan.20k wgan-gp.dcgan.30k

ResNet architecture

ResNet architecture showed the best visual quality sample in the very early stage, 7k iteration in my criteria. This maybe due to the residual architecture.

batch_size=64.

5k 7k 10k 15k
wgan-gp.good.5k wgan-gp.good.7k wgan-gp.good.10k wgan-gp.good.15k
20k 25k 30k 40k
wgan-gp.good.20k wgan-gp.good.25k wgan-gp.good.30k wgan-gp.good.40k

Regardless of the face collapse phenomenon, the Wasserstein distance decreased steadily. It should come from that the critic (discriminator) network failed to find the supremum and K-Lipschitz function.

DCGAN architecture ResNet architecture
wgan-gp.dcgan.w_dist wgan-gp.good.w_dist
wgan-gp.dcgan.w_dist.expand wgan-gp.good.w_dist.expand

The plots in the last row of the table are just expanded version of the plots in the second row.

It is interesting that W_dist < 0 at the end of the training. This indicates that E[fake] > E[real] and, in the point of original GAN view, it means the generator dominates the discriminator.

BEGAN

Berthelot, David, Tom Schumm, and Luke Metz. "Began: Boundary equilibrium generative adversarial networks." arXiv preprint arXiv:1703.10717 (2017).

  • The best model that generates samples with the best visual quality as far as I know
  • It also showed the best performance in this project
    • Even though optional improvements was not implemented (section 3.5.1 in the paper)
  • However, the samples generated by BEGAN give a slightly different feel from other models - it seems like disappearing details.
  • So I just wonder what the results are for different datasets

batch_size=16, z_dim=64, gamma=0.5.

30k 50k 75k
began.30k began.50k began.75k
Convergence measure M
began.M

BEGAN in the LSUN datset works terribly. Not only severe mode-collapse was observed, but also generated images were not realistic.

LSUN LSUN
100k 150k
began.100k began.150k
200k 250k
began.200k began.250k

Image-whitening phenomenon

  • The images become whity and cloudy as the learning progresses. I don't know if it is right to express, but it certainly does.
  • This phenomenon has been seen in CycleGAN before. CycleGAN is based on LSGAN, but LSGAN does not have this whitening phenomenon.

same hyperparameters, only difference is gamma=0.4

50k 100k 150k
began.gm4.50k began.gm4.100k began.gm4.150k
200k 250k 290k
began.gm4.200k began.gm4.250k began.gm4.290k

(There is no particular reason why the last experiment is 290k instead of 300k...)

I also tried to reduce speck-like artifacts as suggested in Heumi/BEGAN-tensorflow, but it did not go away. In spite of gamma=0.4, you can still see speck-like artifacts in the above experiments.

DRAGAN

Kodali, Naveen, et al. "How to Train Your DRAGAN." arXiv preprint arXiv:1705.07215 (2017).

  • Different with other papers, DRAGAN was motivated from the game theory for improving performance of GAN
  • This approach through the game theory is highly unique and interesting
  • But, IMHO, there is not much real contribution. The algorithm is similar to WGAN-GP
DCGAN architecture
120k
dragan.30k

The original paper has some bugs. One of those is image x is pertured only positive-sided. I applied two-sided perturbation as the author admitted this bug on the GitHub.

LSUN
200k
dragan.200k

CoulombGAN

Unterthiner, Thomas, et al. "Coulomb GANs: Provably Optimal Nash Equilibria via Potential Fields." arXiv preprint arXiv:1708.08819 (2017).

  • CoulombGAN has also very interesting perspective - "Coulomb potential".
  • It is very interesting but I don't know whether it is GAN.
  • CoulombGAN tried to solve the diversity problem (mode collapse)

G_lr=5e-4, D_lr=25e-5, z_dim=32.

DCGAN architecture
200k
coulombgan.200k

The disadvantage of this model is that it takes a very long time to train despite the simplicity of network architecture. Further, like original GAN, there is no convergence measure. I thought that the potentials of fake samples served as a convergence measure, but it did not.

Usage

Download CelebA dataset:

$ python download.py celebA
$ python download.py lsun

Convert images to tfrecords format:
Options for converting are hard-coded, so ensure to modify it before run convert.py. In particular, LSUN dataset is provided in LMDB format.

$ python convert.py

Train:
If you want to change the settings of each model, you must also modify code directly.

$ python train.py --help
usage: train.py [-h] [--num_epochs NUM_EPOCHS] [--batch_size BATCH_SIZE]
                [--num_threads NUM_THREADS] --model MODEL [--name NAME]
                --dataset DATASET [--ckpt_step CKPT_STEP] [--renew]

optional arguments:
  -h, --help            show this help message and exit
  --num_epochs NUM_EPOCHS
                        default: 20
  --batch_size BATCH_SIZE
                        default: 128
  --num_threads NUM_THREADS
                        # of data read threads (default: 4)
  --model MODEL         DCGAN / LSGAN / WGAN / WGAN-GP / EBGAN / BEGAN /
                        DRAGAN / CoulombGAN
  --name NAME           default: name=model
  --dataset DATASET, -D DATASET
                        CelebA / LSUN
  --ckpt_step CKPT_STEP
                        # of steps for saving checkpoint (default: 5000)
  --renew               train model from scratch - clean saved checkpoints and
                        summaries

Monitor through TensorBoard:

$ tensorboard --logdir=summary/dataset/name

Evaluate (generate fake samples):

$ python eval.py --help
usage: eval.py [-h] --model MODEL [--name NAME] --dataset DATASET
               [--sample_size SAMPLE_SIZE]

optional arguments:
  -h, --help            show this help message and exit
  --model MODEL         DCGAN / LSGAN / WGAN / WGAN-GP / EBGAN / BEGAN /
                        DRAGAN / CoulombGAN
  --name NAME           default: name=model
  --dataset DATASET, -D DATASET
                        CelebA / LSUN
  --sample_size SAMPLE_SIZE, -N SAMPLE_SIZE
                        # of samples. It should be a square number. (default:
                        16)

Requirements

  • python 2.7
  • tensorflow >= 1.2 (verified on 1.2 and 1.3)
  • tqdm
  • (optional) pynvml - for automatic gpu selection

Similar works

About

Implementations of (theoretical) generative adversarial networks and comparison without cherry-picking

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%