-
Notifications
You must be signed in to change notification settings - Fork 5
/
pdf?id=B1e0X3C9tQ.txt
1925 lines (1041 loc) · 68.8 KB
/
pdf?id=B1e0X3C9tQ.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Under review as a conference paper at ICLR 2019
DIAGNOSING AND ENHANCING VAE MODELS
Anonymous authors Paper under double-blind review
ABSTRACT
Although variational autoencoders (VAEs) represent a widely influential deep generative model, many aspects of the underlying energy function remain poorly understood. In particular, it is commonly believed that Gaussian encoder/decoder assumptions reduce the effectiveness of VAEs in generating realistic samples. In this regard, we rigorously analyze the VAE objective, differentiating situations where this belief is and is not actually true. We then leverage the corresponding insights to develop a simple VAE enhancement that requires no additional hyperparameters or sensitive tuning. Quantitatively, this proposal produces crisp samples and stable FID scores that are actually competitive with state-of-the-art GAN models, all while retaining desirable attributes of the original VAE architecture.
1 INTRODUCTION
Our starting point is the desire to learn a probabilistic generative model of observable variables x , where is a r-dimensional manifold embedded in Rd. Note that if r = d, then this assumption places no restriction on the distribution of x Rd whatsoever; however, the added formalism is introduced to handle the frequently encountered case where x possesses low-dimensional structure
relative to a high-dimensional ambient space, i.e., r d. In fact, the very utility of generative mod-
els, and their attendant low-dimensional representations, often hinges on this assumption (Bengio
et al., 2013). It therefore behooves us to explicitly account for this situation.
Beyond this, we assume that is a simple Riemannian manifold, which means there exists a diffeomorphism between and Rr, or more explicitly, the mapping : Rr is invertible and
differentiable. Denote a ground-truth probability measure on as µgt such that the probability mass of an infinitesimal dx on the manifold is µgt(dx) and µgt(dx) = 1.
The variational autoencoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014) attempts to approximate this ground-truth measure using a parameterized density p(x) defined across all of Rd
since any underlying generative manifold is unknown in advance. This density is further assumed to admit the latent decomposition p(x) = p(x|z)p(z)dz, where z R serves as a lowdimensional representation, with r and prior p(z) = N (z|0, I).
Ideally we might like to minimize the negative log-likelihood - log p(x) averaged across the ground-truth measure µgt, i.e., solve min - log p(x)µgt(dx). Unfortunately though, the required marginalization over z is generally infeasible. Instead the VAE model relies on tractable
encoder q(z|x) and decoder p(x|z) distributions, where represents additional trainable param-
eters. The canonical VAE cost is a bound on the average negative log-likelihood given by
L(, ) {- log p(x) + KL [q(z|x)||p(z|x)]} µgt(dx) - log p(x)µgt(dx), (1)
where the inequality follows directly from the non-negativity of the KL-divergence. Here can be viewed as tuning the tightness of bound, while dictates the actual estimation of µgt. Using a few standard manipulations, this bound can also be expressed as
L(, ) = -Eq(z|x) [log p (x|z)] + KL [q(z|x)||p(z)] µgt(dx),
(2)
which explicitly involves the encoder/decoder distributions and is conveniently amenable to SGD optimization of {, } via a reparameterization trick (Kingma & Welling, 2014; Rezende et al., 2014). The first term in (2) can be viewed as a reconstruction cost (or a stochastic analog of a traditional autoencoder), while the second penalizes posterior deviations from the prior p(z). Additionally, for any realizable implementation via SGD, the integration over must be approximated via a finite sum across training samples {x(i)}ni=1 drawn from µgt. Nonetheless, examining the true objective L(, ) can lead to important, practically-relevant insights.
1
Under review as a conference paper at ICLR 2019
At least in principle, q(z|x) and p(x|z) can be arbitrary distributions, in which case we could simply enforce q(z|x) = p(z|x) p(x|z)p(z) such that the bound from (1) is tight. Unfortunately though, this is essentially always an intractable undertaking. Consequently, largely to facilitate practical implementation, the most commonly adopted distributional assumption is that both q(z|x) and p(x|z) are Gaussian. This design choice has previously been cited as a key limitation of VAEs (Burda et al., 2015; Kingma et al., 2016), and existing quantitative tests of generative modeling quality thus far dramatically favor contemporary alternatives such as generative adversarial networks (GAN) (Goodfellow et al., 2014b). Regardless, because the VAE possesses certain desirable properties relative to GAN models (e.g., stable training (Tolstikhin et al., 2018), interpretable encoder/inference network (Brock et al., 2016), outlier-robustness (Dai et al., 2018), etc.), it remains a highly influential paradigm worthy of examination and enhancement.
In Section 2 we closely investigate the implications of VAE Gaussian assumptions leading to a number of interesting diagnostic conclusions. In particular, we differentiate the situation where r = d, in which case we prove that recovering the ground-truth distribution is actually possible iff the VAE global optimum is reached, and r < d, in which case the VAE global optimum can be reached by solutions that reflect the ground-truth distribution almost everywhere, but not necessarily uniquely so. In other words, there could exist alternative solutions that both reach the global optimum and yet do not assign the same probability measure as µgt.
Section 3 then further probes this non-uniqueness issue by inspecting necessary conditions of global optima when r < d. This analysis reveals that an optimal VAE parameterization will provide an encoder/decoder pair capable of perfectly reconstructing all x using any z drawn from q(z|x). Moreover, we demonstrate that the VAE accomplishes this using a degenerate latent code whereby only r dimensions are effectively active. Collectively, these results indicate that the VAE global optimum can in fact uniquely learn a mapping to the correct ground-truth manifold when r < d, but not necessarily the correct probability measure within this manifold, a critical distinction.
Next we leverage these analytical results in Section 4 to motivate an almost trivially-simple, twostage VAE enhancement for addressing typical regimes when r < d. In brief, the first stage just learns the manifold per the allowances from Section 3, and in doing so, provides a mapping to a lower dimensional intermediate representation with no degenerate dimensions that mirrors the r = d regime. The second (much smaller) stage then only needs to learn the correct probability measure on this intermediate representation, which is possible per the analysis from Section 2. Experiments from Section 5 reveal that this procedure can generate high-quality crisp samples, avoiding the blurriness often attributed to VAE models in the past (Dosovitskiy & Brox, 2016; Larsen et al., 2015). And to the best of our knowledge, this is the first demonstration of a VAE pipeline that can produce stable FID scores, an influential recent metric for evaluating generated sample quality (Heusel et al., 2017), that equal or exceed even multiple state-of-the-art GAN models. Moreover, this is accomplished without additional penalty functions, cost function modifications, or sensitive tuning parameters.
2 HIGH-LEVEL IMPACT OF VAE GAUSSIAN ASSUMPTIONS
Conventional wisdom suggests that VAE Gaussian assumptions will introduce a gap between L(, ) and the ideal negative log-likelihood - log p(x)µgt(dx), compromising efforts to learn the ground-truth measure. However, we will now argue that this pessimism is in some sense premature. In fact, we will demonstrate that, even with the stated Gaussian distributions, there exist parameters and that can simultaneously: (i) Globally optimize the VAE objective and, (ii) Recover the ground-truth probability measure in a certain sense described below. This is possible because, at least for some coordinated values of and , q(z|x) and p(z|x) can indeed become arbitrarily close. Before presenting the details, we first formalize a -simple VAE, which is merely a VAE model with explicit Gaussian assumptions and parameterizations:
Definition 1 A -simple VAE is defined as a VAE model with dim[z] = latent dimensions, the Gaussian encoder q(z|x) = N (z|µz, z), and the Gaussian decoder p(x|z) = N (x|µx, x). Moreover, the encoder moments are defined as µz = fµz (x; ) and z = SzSz with Sz = fSz (x; ). Likewise, the decoder moments are µx = fµx (z; ) and x = I. Here > 0 is a tunable scalar, while fµz , fSz and fµx specify parameterized differentiable functional forms that can be arbitrarily complex, e.g., a deep neural network.
2
Under review as a conference paper at ICLR 2019
Equipped with these definitions, we will now demonstrate that a -simple VAE, with r, can achieve the optimality criteria (i) and (ii) from above. In doing so, we first consider the simpler case where r = d, followed by the extended scenario with r < d. The distinction between these two cases turns out to be significant, with practical implications to be explored in Section 4.
2.1 MANIFOLD DIMENSION EQUAL TO AMBIENT SPACE DIMENSION (r = d)
We first analyze the specialized situation where r = d. Assuming pgt(x) µgt(dx)/dx exists everywhere in Rd, then pgt(x) represents the ground-truth probability density with respect to the standard Lebesgue measure in Euclidean space. Given these considerations, the minimal possible
value of (1) will necessarily occur if
KL [q(z|x)||p(z|x)] = 0 and p(x) = pgt(x) almost everywhere.
(3)
This follows because by VAE design it must be that L(, ) - pgt(x) log pgt(x)dx, and in the present context, this lower bound is achievable iff the conditions from (3) hold. Collectively, this implies that the approximate posterior produced by the encoder q(z|x) is in fact perfectly matched to the actual posterior p(z|x), while the corresponding marginalized data distribution p(x) is perfectly matched the ground-truth density pgt(x) as desired. Perhaps surprisingly, a -simple VAE can actually achieve such a solution:
Theorem 1 Suppose that r = d and there exists a density pgt(x) associated with the ground-truth measure µgt that is nonzero everywhere on Rd.1. Then for any r, there is a sequence of -simple VAE model parameters {t, t } such that
lim KL
t
qt (z|x)||pt (x|x)
=0
and
lim
t
pt
(x)
=
pgt(x)
almost
everywhere.
(4)
All the proofs can be found in the supplementary file. So at least when r = d, the VAE Gaussian assumptions need not actually prevent the optimal ground-truth probability measure from being recovered, as long as the latent dimension is sufficiently large (i.e., r). And contrary to popular notions, a richer class of distributions is not required to achieve this. Of course Theorem 1 only applies to a restricted case that excludes d > r; however, later we will demonstrate that a key consequence of this result can nonetheless be leveraged to dramatically enhance VAE performance.
2.2 MANIFOLD DIMENSION LESS THAN AMBIENT SPACE DIMENSION (r < d)
When r < d, additional subtleties are introduced that will be unpacked both here and in the sequel. To begin, if both q(z|x) and p(x|z) are arbitrary/unconstrained (i.e., not necessarily Gaussian), then inf, L(, ) = -. To achieve this global optimum, we need only choose such that q(z|x) = p(z|x) (minimizing the KL term from (1)) while selecting such that all probability mass collapses to the correct manifold . In this scenario the density p(x) will become unbounded on and zero elsewhere, such that - log p(x)µgt(dx) will approach negative infinity.
But of course the stated Gaussian assumptions from the -simple VAE model could ostensibly prevent this from occurring by causing the KL term to blow up, counteracting the negative loglikelihood factor. We will now analyze this case to demonstrate that this need not happen. Before proceeding to this result, we first define a manifold density p~gt(x) as the probability density (assuming it exists) of µgt with respect to the volume measure of the manifold . If d = r then this volume measure reduces to the standard Lebesgue measure in Rd and p~gt(x) = pgt(x); however, when d > r a density pgt(x) defined in Rd will not technically exist, while p~gt(x) is still perfectly well-defined. We then have the following:
Theorem 2 Assume r < d and that there exists a manifold density p~gt(x) associated with the ground-truth measure µgt that is nonzero everywhere on . Then for any r, there is a sequence of -simple VAE model parameters {t, t } such that
i
lim KL
t
qt (z|x)||pt (z|x)
=0
and
lim
t
- log pt (x)µgt(dx) = -,
(5)
1This nonzero assumption can be replaced with a much looser condition. Specifically, if there exists a
diffeomorphism between the set {x|pgt(x) = 0} and Rd, then it can be shown that Theorem 1 still holds even
if
pgt(x)
=
0
for
some
x
d
R
.
3
Under review as a conference paper at ICLR 2019
ii
lim
t
xA pt (x)dx = µgt(A )
(6)
for all measurable sets A Rd with µgt(A ) = 0, where A is the boundary of A.
Technical details notwithstanding, Theorem 2 admits a very intuitive interpretation. First, (5) directly implies that the VAE Gaussian assumptions do not prevent minimization of L(, ) from converging to minus infinity, which can be trivially viewed as a globally optimum solution. Furthermore, based on (6), this solution can be achieved with a limiting density estimate that will assign a probability mass to most all measurable subsets of Rd that is indistinguishable from the groundtruth measure (which confines all mass to ). Hence this solution is more-or-less an arbitrarily-good approximation to µgt for all practical purposes.2
Regardless, there is an absolutely crucial distinction between Theorem 2 and the simpler case quantified by Theorem 1. Although both describe conditions whereby the -simple VAE can achieve the minimal possible objective, in the r = d case achieving the lower bound (whether the specific parameterization for doing so is unique or not) necessitates that the ground-truth probability measure has been recovered almost everywhere. But the r < d situation is quite different because we have not ruled out the possibility that a different set of parameters {, } could push L(, ) to - and yet not achieve (6). In other words, the VAE could reach the lower bound but fail to closely approximate µgt. And we stress that this uniqueness issue is not a consequence of the VAE Gaussian assumptions per se; even if q(z|x) were unconstrained the same lack of uniqueness can persist.
Rather, the intrinsic difficulty is that, because the VAE model does not have access to the groundtruth low-dimensional manifold, it must implicitly rely on a density p(x) defined across all of Rd as mentioned previously. Moreover, if this density converges towards infinity on the manifold during training without increasing the KL term at the same rate, the VAE cost can be unbounded from below, even in cases where (6) is not satisfied, meaning incorrect assignment of probability mass.
To conclude, the key take-home message from this section is that, at least in principle, VAE Gaussian assumptions need not actually be the root cause of any failure to recover ground-truth distributions. Instead we expose a structural deficiency that lies elsewhere, namely, the non-uniqueness of solutions that can optimize the VAE objective without necessarily learning a close approximation to µgt. But to probe this issue further and motivate possible workarounds, it is critical to further disambiguate these optimal solutions and their relationship with ground-truth manifolds. This will be the task of Section 3, where we will explicitly differentiate the problem of locating the correct groundtruth manifold, from the task of learning the correct probability measure within the manifold.
Note that the only comparable prior work we are aware of related to the results in this section comes from Doersch (2016), where the implications of adopting Gaussian encoder/decoder pairs in the specialized case of r = d = 1 are briefly considered. Moreover, the analysis there requires additional much stronger assumptions than ours, namely, that pgt(x) should be nonzero and infinitely differentiable everywhere in the requisite 1D ambient space. These requirements of course exclude essentially all practical usage regimes where d = r > 1 or d > r, or when ground-truth densities are not sufficiently smooth.
3 OPTIMAL SOLUTIONS AND THE GROUND TRUTH MANIFOLD
We will now more closely examine the properties of optimal -simple VAE solutions, and in particular, the degree to which we might expect them to at least reflect the true , even if perhaps not the correct probability measure µgt defined within . To do so, we must first consider some necessary conditions for VAE optima:
Theorem 3 Let {, } denote an optimal -simple VAE solution (with r) where the decoder variance is fixed (i.e., it is the sole unoptimized parameter). Moreover, we assume that µgt is not a Gaussian distribution when d = r.3 Then for any > 0, there exists a < such that L( , ) < L(, ).
2Note that (6) is only framed in this technical way to accommodate the difficulty of comparing a measure µgt restricted to with the VAE density p(x) defined everywhere in Rd. See the supplementary for details.
3This requirement is only included to avoid a practically irrelevant form of non-uniqueness that exists with full, non-degenerate Gaussian distributions.
4
Under review as a conference paper at ICLR 2019
This result implies that we can always reduce the VAE cost by choosing a smaller value of , and hence, if is not constrained, it must be that 0 if we wish to minimize (2). Despite this necessary optimality condition, in existing practical VAE applications, it is standard to fix 1 during
training. This is equivalent to simply adopting a non-adaptive squared-error loss for the decoder and,
at least in part, likely contributes to unrealistic/blurry VAE-generated samples. Regardless, there are more significant consequences of this intrinsic favoritism for 0, in particular as related to reconstructing data drawn from the ground-truth manifold :
Theorem 4 Applying the same conditions and definitions as in Theorem 3, then for all x drawn from µgt, we also have that
lim
0
fµx
fµz (x; ) + fSz (x; );
=
lim
0
fµx
fµz (x; );
= x,
R.
(7)
By design any random draw z q (z|x) can be expressed as fµz (x; ) + fSz (x; ) for some N (|0, I). From this vantage point then, (7) effectively indicates that any x will be
perfectly reconstructed by the VAE encoder/decoder pair at globally optimal solutions, achieving this necessary condition despite any possible stochastic corrupting factor fSz (x; ).
But still further insights can be obtained when we more closely inspect the VAE objective func-
tion behavior at arbitrarily small but explicitly nonzero values of . In particular, when = r
(meaning z has no superfluous capacity), Theorem 4 and attendant analyses in the supplementary
ultimately imply that the squared eigenvalues of fSz (x; ) will become arbitrarily small at a rate
proportional
to
,
meaning
1
fSz
(x;
)
O(1)
under
mild
conditions.
It then follows that
the VAE data term integrand from (2), in the neighborhood around optimal solutions, behaves as
-2Eq (z|x) log p (x|z) =
2Eq (z|x)
1
x - fµx z;
2 2
+ d log 2 Eq (z|x) [O(1)] + d log 2 = d log + O(1).
(8)
This expression can be derived by excluding the higher-order terms of a Taylor series approximation of fµx fµz (x; ) + fSz (x; ); around the point fµz (x; ), which will be relatively
tight under the stated conditions. But because 2Eq (z|x)
1
x - fµx z;
2 2
0, a theoret-
ical lower bound on (8) is given by d log 2 d log + O(1). So in this sense (8) cannot be
significantly lowered further.
This observation is significant when we consider the inclusion of addition latent dimensions by allowing > r. Clearly based on the analysis above, adding dimensions to z cannot improve the value of the VAE data term in any meaningful way. However, it can have a detrimental impact on the the KL regularization factor in the 0 regime, where
2KL [q(z|x)||p(z)] trace [z] +
µz
2 2
-
log
|z
|
-r^ log
+
O(1).
(9)
Here r^ denotes the number of eigenvalues {j()}j=1 of fSz (x; ) (or equivalently z) that satisfy j() 0 if 0. r^ can be viewed as an estimate of how many low-noise latent dimensions the VAE model is preserving to reconstruct x. Based on (9), there is obvious pressure to make r^
as small as possible, at least without disrupting the data fit. The smallest possible value is r^ = r,
since it is not difficult to show that any value below this will contribute consequential reconstruction
errors, causing 2Eq (z|x)
1
x - fµx z;
2 2
to grow at a rate of
1
, pushing the entire
cost function towards infinity.4
Therefore, in the neighborhood of optimal solutions the VAE will naturally seek to produce perfect
reconstructions using the fewest number of clean, low-noise latent dimensions, meaning dimensions whereby q (z|x) has negligible variance. For superfluous dimensions that are unnecessary for representing x, the associated encoder variance in these directions can be pushed to one. This will optimize KL [q(z|x)||p(z)] along these directions, and the decoder can selectively block the residual randomness to avoid influencing the reconstructions per Theorem 4. So in this sense the
VAE is capable of learning a minimal representation of the ground-truth manifold when r < .
4Note
that
inf >0
C
+
log
=
for
any
C
>
0.
5
Under review as a conference paper at ICLR 2019
But we must emphasize that the VAE can learn independently of the actual distribution µgt within . Addressing the latter is a completely separate issue from achieving the perfect reconstruction error defined by Theorem 4. This fact can be understood within the context of a traditional PCAlike model, which is perfectly capable of learning a low-dimensional subspace containing some training data without actually learning the distribution of the data within this subspace. The central issue is that there exists an intrinsic bias associated with the VAE objective such that fitting the distribution within the manifold will be completely neglected whenever there exists the chance for even an infinitesimally better approximation of the manifold itself.
Stated differently, if VAE model parameters have learned a near optimal, parsimonious latent mapping onto using 0, then the VAE cost will scale as (d - r) log regardless of µgt. Hence there remains a huge incentive to reduce the reconstruction error still further, allowing to push even closer to zero and the cost closer to -. And if we constrain to be sufficiently large so as to prevent this from happening, then we risk degrading/blurring the reconstructions and widening the gap between q(z|x) and p(z|x), which can also compromise estimation of µgt. Fortunately though, as will be discussed next there is a convenient way around this dilemma by exploiting the fact that this dominanting (d - r) log factor goes away when d = r.
4 FROM THEORY TO PRACTICAL VAE ENHANCEMENTS
Sections 2 and 3 have exposed a collection of VAE properties with useful diagnostic value in and of themselves. But the practical utility of these results, beyond the underappreciated benefit of learning , warrant further exploration. In this regard, suppose we wish to develop a generative model of high-dimensional data x where unknown low-dimensional structure is significant (i.e., the r < d case with r unknown). The results from Section 3 indicate that the VAE can partially handle this situation by learning a parsimonious representation of low-dimensional manifolds, but not necessarily the correct probability measure µgt within such a manifold. In quantitative terms, this means that a decoder p(x|z) will map all samples from an encoder q(z|x) to the correct manifold such that the reconstruction error is negligible for any x . But if the measure µgt on has not been accurately estimated, then
q(z) q(z|x)µgt(dx) Rd p(z|x)p(x)dx = Rd p(x|z)p(z)dx = N (z|0, I), (10)
where q(z) is sometimes referred to as the aggregated posterior (Makhzani et al., 2016). In other words, the distribution of the latent samples drawn from the encoder distribution, when averaged across the training data, will have lingering latent structure that is errantly incongruous with the original isotropic Gaussian prior. This then disrupts the pivotal ancestral sampling capability of the VAE, implying that samples drawn from N (z|0, I) and then passed through the decoder p(x|z) will not closely approximate µgt. Fortunately, our analysis suggests the following two-stage remedy:
1. Given n observed samples {x(i)}in=1, train a -simple VAE, with r, to estimate the unknown r-dimensional ground-truth manifold embedded in Rd using a minimal number of active latent dimensions. Generate latent samples {z(i)}ni=1 via z(i) q(z|x(i)). By design, these samples will be distributed as q(z), but likely not N (z|0, I).
2. Train a second -simple VAE, with independent parameters { , } and latent representation u, to learn the unknown distribution q(z), i.e., treat q(z) as a new ground-truth distribution and use samples {z(i)}in=1 to learn it.
3. Samples approximating the original ground-truth µgt can then be formed via the extended ancestral process u N (u|0, I), z p (z|u), and finally x p(x|z).
The efficacy of the second-stage VAE from above is based on the following. If the first stage was successful, then even though they will not generally resemble N (z|0, I), samples from q(z) will nonetheless have nonzero measure across the full ambient space R. If = r, this occurs because the entire latent space is needed to represent an r-dimensional manifold, and if > r, then the extra latent dimensions will be naturally filled in via randomness introduced along dimensions associated with nonzero eigenvalues of the decoder covariance z per the analysis in Section 3.
Consequently, as long as we set r, the operational regime of the second-stage VAE is effectively equivalent to the situation described in Section 2.1 where the manifold dimension is equal to
6
Under review as a conference paper at ICLR 2019
log. Squared Pixel Error
Value
0 0.08 Learnable . Fix . =1 0.06
-2
0.04
-4 0.02
-6 0 0 2 4 6 8 10 Training Iterations #104
= 1.00, Image Variance = 0 = 0.02, Image Variance = 37.7 = 0.01, Image Variance = 357
550 1 First Stage VAE
500 Second Stage VAE 0.8 Ideal Gaussian
450 0.6
400 0.4
350 0.2
300 0 0 10 20 30 40 50 60 Singular Value Index
Figure 1: Demonstrating VAE properties. (Left) Validation of Theorem 3 and the influence on image
reconstructions. (Center) Validation of Theorem 4. (Right) Motivation for two separate VAE stages by comparing the aggregated posteriors q(z) (1st stage) vs. q (u) (enhanced 2nd stage).
the ambient dimension.5 And as we have already shown there via Theorem 1, the VAE can readily handle this situation, since in the narrow context of the second-stage VAE, d = r = , the troublesome (d - r) log factor becomes zero, and any globally minimizing solution is uniquely matched to the new ground-truth distribution q(z). Consequently, the revised aggregated posterior q (u) produced by the second-stage VAE should now closely resemble N (u|0, I). And finally, because we generally assume that d r, we have found that the second-stage VAE can be quite small.
5 EMPIRICAL EVALUATION OF VAE TWO-STAGE ENHANCEMENT
We initially describe experiments explicitly designed to corroborate some of our previous analytical results using VAE models trained on CelebA (Liu et al., 2015) data; please see the supplementary for training details and more related experiments. First, the leftmost plot of Figure 1 presents support for Theorem 3, where indeed the decoder variance does tend towards zero during training. This then allows for tighter image reconstructions with lower average squared error, i.e., a better manifold fit as expected. The center plot bolsters Theorem 4 and the analysis that follows by showcasing the dissimilar impact of noise factors applied to different directions in the latent space before passage through the decoder mean network fµx . In a direction where an eigenvalue j of z is large (i.e., a superfluous dimension), a random perturbation is completely muted by the decoder as predicted. In contrast, in directions where such eigenvalues are small (i.e., needed for representing the manifold), varying the input causes large changes in the image space reflecting reasonable movement along the correct manifold. Finally, the rightmost plot of Figure 1 displays the singular value spectrum of latent sample matrices drawn from the first- and second-stage VAE models. As expected, the latter is much closer to the spectrum from an analogous i.i.d. N (0, I) matrix. This indicates a superior latent representation, providing high-level support for our two-stage VAE proposal.
Next we present quantitative evaluation of novel generated samples using the large-scale testing protocol of GAN models from (Lucic et al., 2018). In this regard, GANs are well-known to dramatically outperform existing VAE approaches in terms of the Fre´chet Inception Distance (FID) score (Heusel et al., 2017) and related quantitative metrics. For fair comparison, (Lucic et al., 2018) adopted a common neutral architecture for all models, with generator and discriminator networks based on InfoGAN (Chen et al., 2016a); the point here is standardized comparisons, not tuning arbitrarily-large networks to achieve the lowest possible absolute FID values. We applied the same architecture to our first-stage VAE decoder and encoder networks respectively for direct comparison. For the low-dimensional second-stage VAE we used small, 3-layer networks contributing negligible additional parameters beyond the first stage (see the supplementary for further design details).6
We compared our proposed two-stage VAE pipeline against three baseline VAE models differing only in the decoder output layer: a Gaussian layer with fixed , a Gaussian layer with a learned , and a cross-entropy layer as has been adopted in several previous applications involving images
5Note that if a regular autoencoder were used to replace the first-stage VAE, then this would no longer be the case, so indeed a VAE is required for both stages.
6It should also be emphasized that concatenating the two stages and jointly training does not improve the performance. If trained jointly the few extra second-stage parameters are simply hijacked by the dominant objective from the first stage and forced to work on an incrementally better fit of the manifold. As expected then, on empirical tests (not shown) we have found that this does not improve upon standard VAE baselines.
7
Under review as a conference paper at ICLR 2019
MM GAN NS GAN LSGAN WGAN WGAN GP DRAGAN BEGAN VAE (cross-entr.) VAE (fixed ) VAE (learned ) 2-Stage VAE 2-Stage VAE*
MNIST 9.8 ± 0.9 6.8 ± 0.5 7.8 ± 0.6 6.7 ± 0.4 20.3 ± 5.0 7.6 ± 0.4 13.1 ± 1.0
23.8 ± 0.6 51.2 ± 0.8 47.0 ± 0.9
13.4 ± 1.3 11.2 ± 0.5
Fashion 29.6 ± 1.6 26.5 ± 1.6 30.7 ± 2.2 21.5 ± 1.6 24.5 ± 2.1 27.7 ± 1.2 22.9 ± 0.9 58.7 ± 1.2 104.5 ± 1.3 51.5 ± 1.0 22.0 ± 0.6 21.3 ± 0.4
CIFAR-10 72.7 ± 3.6 58.5 ± 1.9 87.1 ± 47.5 55.2 ± 2.3 55.8 ± 0.9 69.8 ± 2.0 71.4 ± 1.6
155.7 ± 11.6 113.0 ± 0.7 80.1 ± 0.6 71.0 ± 0.6 68.0 ± 0.8
CelebA 65.6 ± 4.2 55.0 ± 3.3 53.9 ± 2.8 41.3 ± 2.0 30.3 ± 1.0 42.3 ± 3.0 38.9 ± 0.9 85.7 ± 3.8 119.8 ± 0.9 67.4 ± 2.1 45.9 ± 1.4 23.8 ± 0.5
Mean 44.4 ± 2.6 36.7 ± 1.8 44.9 ± 13.3 31.2 ± 1.6 32.7 ± 2.3 36.9 ± 1.7 36.6 ± 1.1 81 ± 4.3 97.1 ± 0.9 61.5 ± 1.2 38.1 ± 1.0 31.1 ± 0.6
Table 1: FID score comparisons. For all GAN-based models, the reported values represent the best FID obtained across a large-scale hyperparameter search conducted separately for each dataset; default settings are considerably worse (Lucic et al., 2018). Likewise outlier cases (e.g., severe mode collapse) were omitted, which would have otherwise degraded these GAN scores and increased standard deviations still further. In contrast, for the VAE results we used only default training settings across all models and datasets (no tuning), except for the 2-Stage VAE*. Here we simply tested a couple different values for and picked the best result for each dataset. Note that specialized architectures and/or random seed optimization can potentially improve the FID score for all models.
(Chen et al., 2016b). We also present results from (Lucic et al., 2018) involving numerous stateof-the-art GAN models, including MM GAN (Goodfellow et al., 2014a), WGAN (Arjovsky et al., 2017), WGAN-GP (Gulrajani et al., 2017), NS GAN (Fedus et al., 2017), DRAGAN (Kodali et al., 2017), LS GAN (Mao et al., 2017) and BEGAN (Berthelot et al., 2017). Testing is conducted across four significantly different datasets: MNIST (LeCun et al., 1998), Fashion MNIST (Xiao et al., 2017), CIFAR-10 (Krizhevsky & Hinton, 2009) and CelebA (Liu et al., 2015).
For each dataset we executed 10 independent trials and report the mean and standard deviation of the FID scores in Table 1. Despite the fact that all GAN models benefited from a large-scale hyperparameter search executed independently across each dataset to achieve the best results, our proposed two-stage VAE with minimal tuning is capable of equaling or exceeding the performance of all the GAN models and VAE baselines (see Table 1 caption for more details). This is the first demonstration of a VAE pipeline capable of competing with GANs in the arena of generated sample quality. For example, note the poor performance of VAE baselines relative to GANs in Table 1). Representative samples generated using our two-stage VAE model are in the supplementary.
As a final point of reference, although not an exact VAE model per se, an autoencoder-based architecture that substitutes a Wassenstein distance measure for the KL regularizer from (2) has also recently been proposed (Tolstikhin et al., 2018). Two variants of this approach, termed WAE-MMD and WAE-GAN (because different MMD and GAN regularization factors are included), were evaluated using FID scores, with penalty weights and encoder/decoder networks specifically adapted for use with the CelebA dataset (FID values were not provided for other datasets). A baseline VAE using these networks achieved an FID of 63, which is somewhat better than our VAE baselines presumably because of this tuning for CelebA data. In contrast, the corresponding WAE-MMD and WAE-GAN scores were 55 and 42 respectively. Although these values represent an improvement over the VAE baseline, they are considerably worse than the absolute score of 23.8 achieved by our generic two-stage VAE model with neutral architecture borrowed from (Lucic et al., 2018).
6 CONCLUSION
It is often assumed that there exists an unavoidable trade-off between the stable training, valuable attendant encoder network, and resistance to mode collapse of VAEs, versus the impressive visual quality of images produced by GANs. While we certainly are not claiming that our two-stage VAE model is necessarily superior to the latest and greatest GAN-based model in terms of the realism of generated samples, we do strongly believe that this work at least narrows that gap substantially such that VAEs are worth considering in a broader range of applications.
8
Under review as a conference paper at ICLR 2019
REFERENCES
Martin Arjovsky, Soumith Chintala, and Le´on Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pp. 214223, 2017.
Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):17981828, 2013.
David Berthelot, Thomas Schumm, and Luke Metz. BEGAN: Boundary equilibrium generative adversarial networks. arXiv:1703.10717, 2017.
Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Neural photo editing with introspective adversarial networks. arXiv:1609.07093, 2016.
Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv:1509.00519, 2015.
Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 21722180, 2016a.
Xi Chen, Diederik Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. arXiv:1611.02731, 2016b.
Bin Dai, Yu Wang, John Aston, Gang Hua, and David Wipf. Hidden talents of the variational autoencoder. arXiv:1706.05148, 2018.
Carl Doersch. Tutorial on variational autoencoders. arXiv:1606.05908, 2016.
Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. In Advances in Neural Information Processing Systems, pp. 658666, 2016.
William Fedus, Mihaela Rosca, Balaji Lakshminarayanan, Andrew M Dai, Shakir Mohamed, and Ian Goodfellow. Many paths to equilibrium: GANs do not need to decrease a divergence at every step. arXiv:1710.08446, 2017.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 26722680, 2014a.
I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks. In arXiv:1406.2661, 2014b.
Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems, pp. 57675777, 2017.
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems, pp. 66266637, 2017.
Diederik Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
Diederik Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pp. 47434751, 2016.
Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On convergence and stability of GANs. arXiv:1705.07215, 2017.
Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
9
Under review as a conference paper at ICLR 2019
Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. arXiv:1512.09300, 2015.
Yann LeCun, Le´on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):22782324, 1998.
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In IEEE International Conference on Computer Vision, pp. 37303738, 2015.
Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are GANs created equal? A large-scale study. arXiv:1711.10337v3, 2018.
Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv:1511.05644, 2016.
Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In IEEE International Conference on Computer Vision, pp. 28132821, 2017.
D.J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.
Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein autoencoders. International Conference on Learning Representations, 2018.
Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747, 2017.
10
SUPPLEMENTARY FILE Diagnosing and Enhancing VAE Models
1. Introduction
This document contains companion technical material regarding our ICLR 2019 submission. Note that herein all equation numbers referencing back to the main submission document will be be prefixed with an `M-' to avoid confusion, i.e, (M-#) will refer to equation (#) from the main text. Similar notation differentiates sections, tables, and figures, e.g., Section M-#, etc.
2. Contents
The remainder of this document includes the following contents: · Section 3 - Comparison of novel samples generated from our model. · Section 4 - Example reconstructions of training data. · Section 5 - Additional experimental results validating theoretical predictions. · Section 6 - Network structure and experimental settings. · Section 7 - Proof of Theorem M.1. · Section 8 - Proof of Theorem M.2. · Section 9 - Proof of Theorem M.3. · Section 10 - Proof of Theorem M.4. · Section 11 - Further analysis of the VAE cost as becomes small.
3. Comparison of Novel Samples Generated from our Model
Generation results for CelebA, MNIST, Fashion-MNIST and CIFAR-10 datasets are shown in Figures 1-4 respectively. When is fixed to be one, the generated samples are very blurry. If a learnable is used, the samples becomes sharper; however, there are many lingering artifacts as expected. In contrast, the proposed 2-Stage VAE can remove these artifacts and generate more realistic samples. For comparison purposes, we also show the results from WAE-MMD, WAE-GAN (Tolstikhin et al., 2018) and WGAN-GP (Gulrajani et al., 2017) for the CelebA dataset.
1
(a) WAE-MMD
(b) WAE-GAN
(c) WGAN-GP
(d) VAE (Fix = 1)
(e) VAE (Learnable )
(f) 2-Stage VAE
Figure 1: Randomly generated samples on the CelebaA dataset (i.e., no cherry-picking).
(a) VAE (Fix = 1)
(b) VAE (Learnable )
(c) 2-Stage VAE
Figure 2: Randomly generated samples on the MNIST dataset (i.e., no cherry-picking).
2
(a) VAE (Fix = 1)
(b) VAE (Learnable )
(c) 2-Stage VAE
Figure 3: Randomly Generated Samples on Fashion-MNIST Dataset (i.e., no cherrypicking).
(a) VAE (Fix = 1)
(b) VAE (Learnable )
(c) 2-Stage VAE
Figure 4: Randomly Generated Samples on CIFAR-10 Dataset (i.e., no cherry-picking).
3
4. Example Reconstructions of Training Data
Reconstruction results for MNIST, Fashion-MNIST, CIFAR-10 and CelebA datasets are shown in Figures 5-8 respectively. On relatively simple datasets like MNIST and FashionMNIST, the VAE with learnable achieves almost exact reconstruction because of a better estimate of the underlying manifold consistent with theory. However, the VAE with fixed = 1 produces blurry reconstructions as expected. Note that the reconstruction of a 2Stage VAE is the same as that of a VAE with learnable because the second-stage VAE has nothing to do with facilitating the reconstruction task.
(a) Ground Truth
(b) VAE (Fix = 1)
(c) VAE (Learnable )
Figure 5: Reconstructions on CelebA Dataset.
(a) Ground Truth
(b) VAE (Fix = 1)
(c) VAE (Learnable )
Figure 6: Reconstructions on MNIST Dataset.
4
(a) Ground Truth
(b) VAE (Fix = 1)
(c) VAE (Learnable )
Figure 7: Reconstructions on Fashion-MNIST Dataset.
(a) Ground Truth
(b) VAE (Fix = 1)
(c) VAE (Learnable )
Figure 8: Reconstructions on CIFAR-10 Dataset.
5
= 0.005, Image Variance = 27.33 = 0.005, Image Variance = 27.20 = 0.007, Image Variance = 19.64 = 0.008, Image Variance = 13.90 = 0.010, Image Variance = 12.78 = 1.000, Image Variance = 0.000
(a) MNIST
= 0.005, Image Variance = 63.18 = 0.005, Image Variance = 72.89 = 0.009, Image Variance = 24.56 = 0.011, Image Variance = 21.05 = 0.030, Image Variance = 5.243 = 1.001, Image Variance = 0.000
(b) Fashion-MNIST
= 0.008, Image Variance = 135.2
= 0.009, Image Variance = 115.0
= 0.015, Image Variance = 42.41
= 0.114, Image Variance = 1.156
= 1.013, Image Variance = 0.000
(c) CelebA Figure 9: More examples similar to Figure M.1(center ).
5. Additional Experimental Results Validating Theoretical Predictions
We first present more examples similar to Figure M.1(center ) from the main paper. Random noise is added to µz along different directions and the result is passed through the decoder network. Each row corresponds to a certain direction in the latent space and 15 samples are shown for each direction. These dimensions/rows are ordered by the eigenvalues j of z. The larger j is, the less impact a random perturbation along this direction will have as quantified by the reported image variance values. In the first two or three rows, the noise generates some images from different classes/objects/identities, indicating a significant visual difference. For a slightly larger j, the corresponding dimensions encode relatively less significant attributes as predicted. For example, the fifth row of both MNIST and Fashion-MNIST contains images from the same class but with a slightly different style. The images in the fourth row of the CelebA dataset have very subtle differences. When j = 1, the corresponding dimensions become completely inactive and all the output images are exactly the same, as shown in the last rows for all the three datasets.
Additionally, as discussed in the main text and below in Section 11, there are likely to be r eigenvalues of z converging to zero and - r eigenvalues converging to one. We plot
6
5 #106
5 #106
44
Frequency Frequency
33
22
11
00 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2
66
jj
(a) Hist of j on MNIST
(b) Hist of j on CelebA
Figure 10: Histogram of j values. There are more values around 0 for CelebA because it is more complicated than MNIST and therefore requres more active dimensions to model the underlying manifold.
the histogram of j values for both MNIST and CelebA datasets in Figure 10. For both datasets, j approximately converges to either to zero or one. However, since CelebA is a more complicated dataset than MNIST, the ground-truth manifold dimension of CelebA is likely to be much larger than that of MNIST. So more eigenvalues are expected to be near zero for the CelebA dataset. This is indeed the case, demonstrating that VAE has the ability to detect the manifold dimension and select the proper number of latent dimensions in practical environments.
6. Network Structure and Experimental Settings
We first describe the network and training details used in producing Figure M.1 from the main file, and for generating samples and reconstructions in the supplementary. The firststage VAE network is shown in Figure 11. Basically we use two Residual Blocks for each resolution scale, and we double the number of channels when downsampling and halve it when upsampling. The specific settings such as the number of channels and the number of scales are specified in the caption. The second VAE is much simpler. Both the encoder and decoder have three 2048-dimensional hidden layers. Finally, the training details are presented below. Note that these settings were not tuned, we simply chose more epochs for more complex data sets and fewer for datasets with larger training samples. For each dataset just a single setting was tested as follows:
· MNIST and Fashion-MNIST: The batch size is specified to be 100. We use the ADAM optimizer with the default hyperparameters in TensorFlow. Standard weight decay is set as 5 × 10-4. The first VAE is trained for 400 epochs. The initial learning rate is 0.0001 and we halve it every 150 epochs. The second VAE is trained for 800 epochs with the same initial learning rate, halved every 300 epochs.
7
input
ScaleBlock
ResBlock
ResBlock
bn+relu
bn+relu
conv ScaleBlock 1 downsample ScaleBlock 2
fc Reshape upsample
conv/fc
conv/fc
downsample
ScaleBlock 1
... ...
bn+relu
bn+relu
ScaleBlock N
upsample
conv/fc
conv/fc
output
Flatten ScaleBlock
ScaleBlock M conv
fc
fc+exp
sigmoid
Figure 11: Network structure of the first-stage VAE used in producing Figure M.1, and for generating samples and reconstructions. (Left) The basic building block of the network called a Scale Block, which consists of two Residual Blocks. (Center ) The encoder network. For an input image x, we use a convolutional layer to transform it into 32 channels. We then pass it to a Scale Block. After each Scale Block, we downsample using a convolutional layer with stride 2 and double the channels. After N Scale Blocks, the feature map is flattened to a vector. In our experiments, we used N = 4 for CelebA dataset and 3 for other datasets. The vector is then passed through another Scale Block, the convolutional layers of which are replaced with fully connected layers of 512 dimensions. The output of this Scale Block is used to produce the -dimensional latent code, with = 64. (Right) The decoder network. A latent code z is first passed through a fully connected layer. The dimension is 4096 for CelebA dataset and 2048 for other datasets. Then it is reshaped to 2 × 2 resolution. We upsample the feature map using a deconvolution layer and half the number of channels at the same time. It then goes through some Scale Blocks and upsampling layers until the feature map size becomes the desired value. Then we use a convolutional layer to transform the feature map, which should have 32 channels, to 3 channels for RGB datasets and 1 channel for gray scale datasets.
· CIFAR-10: Since CIFAR-10 is more complicated than MNIST and Fashion-MNIST, we use more epochs for training. Specifically, we use 1000 and 2000 epochs for the two VAEs respectively and half the learning rate every 300 and 600 epochs for the two stages. The other settings are the same as that for MNIST.
· CelebA: Because CelebA has many more examples, in the first stage we train 120 epochs and half the learning rate every 48 epochs. In the second stage, we train 300 epochs and half the learning rate every 120 epochs. The other settings are the same as that for MNIST, etc.
8
Finally, to fairly compare against various GAN models and VAE baselines using FID scores on a neutral architecture (i.e., the results from Table M.1), we simply adopt the InfoGAN network structure consistent with the neutral setup from (Lucic et al., 2018) for the first-stage VAE. For the second-stage VAE we just use three 1024-dimensional hidden layers, which contribute less than 5% to the total number of parameters. The only hyperparameter we considered tuning was , coarsely testing just a few different values based on dataset complexity. Note that the small number of additional parameters contributing to the second stage do not improve the other VAE baselines when aggregated and trained jointly.
7. Proof of Theorem M.1
We first consider the case where the latent dimension equals the manifold dimension r and then extend the proof to allow for > r. The intuition is to build a bijection between and Rr that transforms the ground-truth distribution pgt(x) to a normal Gaussian distribution. The way to build such a bijection is shown in Figure 12. We now fill in the details.
-1
0,1
-1
Example:
2D Ground Truth Distribution
0,1 2
2D normal Gaussian
Figure 12: The relationship between different variables.
7.1 Finding a Sequence of Decoders such that pt(x) Converges to pgt(x) Define the function F : Rr [0, 1]r as
F (x) = [F1(x1), F2(x2; x1), ..., Fr(xr; x1:r-1)] ,
xi
Fi(xi; x1:i-1) =
pgt(xi|x1:i-1)dxi.
xi=-
Per this definition, we have that
(1) (2)
dF (x) = pgt(x)dx. 9
(3)
Also, since pgt(x) is nonzero everywhere, F (·) is invertible. Similarly, we define another differentiable and invertible function G : Rr [0, 1]r as
Then
G(z) = [G1(z1), G2(z2), ..., Gr(zr)] ,
zi
Gi(zi) =
N (zi|0, 1)dzi.
zi=-
dG(z) = p(z)dz = N (z|0, I)dz.
Now let the decoder be
fµx (z; t) = F -1 G(z),
t
=
1 .
t
Then we have
(4) (5)
(6)
(7) (8)
pt (x) = Rr pt (x|z)p(z)dz = Additionally, let = G(z) such that
N x|F -1 G(z), tI dG(z).
Rr
(9)
pt (x) =
N
[0,1]r
x|F -1(), tI
d,
(10)
and let x = F -1() such that d = dF (x ) = pgt(x )dx . Plugging this expression into the previous p(x) we obtain
pt (x) =
N
Rr
x|x , tI
pgt(x )dx .
(11)
As t , t becomes infinitely small and N (x|x , tI) becomes a Dirac-delta function, resulting in
lim
t
pt
(x)
=
(x - x)pgt(x )dx = pgt(x).
(12)
7.2 Finding a Sequence of Encoders such that KL qt (z|x)||pt(z|x) Converges to 0
Assume the encoder networks satisfy
fµz (x; t ) = G-1 F (x) = fµ-x1(x; t),
fSz (x; t) =
-1
t fµx (fµz (x; t ); t) fµx (fµz (x; t); t) ,
(13) (14)
where fµx(·) is a d × r Jacobian matrix. We omit the arguments t and t in fµz (·), fSz (·) and fµx(·) hereafter to avoid unnecessary clutter. We first explain why fµx(·) is differentiable. Since fµx(·) is a composition of F -1(·) and G(·) according to (7), we only
10
need to explain that both functions are differentiable. For F -1(·), it is the inverse of a
differentiable function F (·). Moreover, the derivative of F (x) is pgt(x), which is nonzero everywhere. So F -1(·) and therefore fµx(·) are both differentiable.
The true posterior pt(z|x) and the approximate posterior are
pt (z|x)
=
N (z|0, I)N (x|fµx(z), tI) , pt (x)
qt (z|x)
=
N
z|fµz (x), t
fµx (fµz (x))
-1
fµx (fµz (x))
(15) (16)
respectively. We now prove that qt (z|x)/pt(z|x) converges to a constant not related to z as t goes to . If this is true, the constant must be 1 since both qt (z|x) and pt(z|x) are probability distributions. Then the KL divergence between them converges to 0 as t .
We denote fµx (fµz (x)) fµx (fµz (x)) -1 as ~ z(x) for short. In addition, we define
z = fµz (x). Given these definitions, it follows that
qt (z|x) pt (z|x)
=
N z|z, t~ z pt (x) N (z|0, I)N (x|fµx(z), tI)
=
(2)d/2t(d-r)/2
~ z
-1/2
exp
- (z - z)
~ -z 1 (z - z) 2t
+ ||z||22 2
+
||x
- fµx (z)||22 2t
pt (x).
(17)
At this point, let
z = z + tz~
(18)
According to Lagrangian's mean value theorem, there exists a z between z and z such
that
fµx (z) = fµx (z) + fµx (z )(z - z) = x + fµx (z ) tz~,
(19)
where z = z + tz~ is between z and z and is a value between 0 and 1 (z = z
if = 0 and z = z if = 1). Use C(x) to represent the terms not related to z, i.e.,
(2)d/2t(d-r)/2
~ z
-1/2
pt (x).
Plug (18) and (19) into (17) and consider the limit given
by
lim qt (z|x) = lim C(x) exp - z~ ~z-1z~ + ||z + tz~||22 + ||fµx z + tz~ z~||22
t pt (z|x)
t
22
2
= C(x) exp - z~ ~z-1z~ + ||z||22 + ||fµx (z) z~||22
22
2
= C(x) exp - z~ ~z-1z~ + ||z||22 + z~ fµx (z) fµx (z) z~
22
2
= C(x) exp ||z||22 2
(20)
11
The fourth equality comes from the fact that fµx (z) fµx (z) = fµx (fµz (x)) fµx (fµz (x)) = ~ z(x)-1. This expression is not related to z. Considering both qt(z|x) and pt(z|x) are probability distributions, the ratio should be equal to 1. The KL divergence between them
thus converges to 0 as t .
7.3 Generalization to the Case with > r
When > r, we use the first r latent dimensions to build a projection between z and x and leave the remaining - r latent dimensions unused. Specifically, let fµx(z) = f~µx(z1:r), where f~µx(z1:r) is defined as in (7) and t = 1/t. Again consider the case that t . Then this decoder can also satisfy limt pt(x) = pgt(x) because it produces exactly the same distribution as the decoder defined by (7) and (8). The last - r dimensions
contribute nothing to the generation process.
Now define the encoder as
fµz (x)1:r = f~µ-x1(x)
fµz (x)r+1: = 0
f~Sz (x)
fSz (x)
=
nr+1 ...
n
(21) (22)
(23)
where f~Sz (x) is defined as (14). Denote {ni}i=r+1 as a set of -dimensional column vectors satisfying
f~Sz (x)ni = 0
(24)
ni nj = 1i=j
(25)
Such a set always exists because f~Sz (x) is a r × matrix. So the dimension of the null space of f~Sz (x) is at least - r. Assuming that {ni}i=r+1 are - r basis vectors of null(f~Sz ), then the conditions (24) and (25) will be satisfied. The variance of the approximate posterior
then becomes
z = fSz (x)fSz (x)
=
f~Sz (x)f~Sz (x) 0
0 I -r
(26)
The first r dimensions can exactly match the true posterior as we have already shown.
The remaining - r dimensions follow a standardized Gaussian distribution. Since these
dimensions contribute nothing to generating x, the true posterior should be the same as
the prior, i.e. a standardized Gaussian distribution. Moreover, any of these dimensions
is independent of all the other dimensions, so the corresponding off-diagonal elements of
the covariance of the true posterior should equal 0. Thus the approximate posterior also
matches the true posterior for the last - r dimensions. As a result, we again have
limt KL qt (z|x)||pt (z|x) = 0.
8. Proof of Theorem M.2
Similar to Section 7, we also construct a bijection between and Rr which transforms the ground-truth measure µgt to a normal Gaussian distribution. But in this construction, we
12
need one more step that bijects between and Rr using the diffeomorphism (·), as shown in Figure 13. We will now go into the details.
-1
0,1