text/papers/pdf?id=B1e0X3C9tQ.txt

Under review as a conference paper at ICLR 2019

DIAGNOSING AND ENHANCING VAE MODELS

Anonymous authors Paper under double-blind review
ABSTRACT
Although variational autoencoders (VAEs) represent a widely influential deep generative model, many aspects of the underlying energy function remain poorly understood. In particular, it is commonly believed that Gaussian encoder/decoder assumptions reduce the effectiveness of VAEs in generating realistic samples. In this regard, we rigorously analyze the VAE objective, differentiating situations where this belief is and is not actually true. We then leverage the corresponding insights to develop a simple VAE enhancement that requires no additional hyperparameters or sensitive tuning. Quantitatively, this proposal produces crisp samples and stable FID scores that are actually competitive with state-of-the-art GAN models, all while retaining desirable attributes of the original VAE architecture.

1 INTRODUCTION

Our starting point is the desire to learn a probabilistic generative model of observable variables x  , where  is a r-dimensional manifold embedded in Rd. Note that if r = d, then this assumption places no restriction on the distribution of x  Rd whatsoever; however, the added formalism is introduced to handle the frequently encountered case where x possesses low-dimensional structure
relative to a high-dimensional ambient space, i.e., r d. In fact, the very utility of generative mod-
els, and their attendant low-dimensional representations, often hinges on this assumption (Bengio
et al., 2013). It therefore behooves us to explicitly account for this situation.

Beyond this, we assume that  is a simple Riemannian manifold, which means there exists a diffeomorphism  between  and Rr, or more explicitly, the mapping  :   Rr is invertible and
differentiable. Denote a ground-truth probability measure on  as µgt such that the probability mass of an infinitesimal dx on the manifold is µgt(dx) and  µgt(dx) = 1.

The variational autoencoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014) attempts to approximate this ground-truth measure using a parameterized density p(x) defined across all of Rd
since any underlying generative manifold is unknown in advance. This density is further assumed to admit the latent decomposition p(x) = p(x|z)p(z)dz, where z  R serves as a lowdimensional representation, with   r and prior p(z) = N (z|0, I).

Ideally we might like to minimize the negative log-likelihood - log p(x) averaged across the ground-truth measure µgt, i.e., solve min  - log p(x)µgt(dx). Unfortunately though, the required marginalization over z is generally infeasible. Instead the VAE model relies on tractable
encoder q(z|x) and decoder p(x|z) distributions, where  represents additional trainable param-
eters. The canonical VAE cost is a bound on the average negative log-likelihood given by

L(, )  {- log p(x) + KL [q(z|x)||p(z|x)]} µgt(dx)   - log p(x)µgt(dx), (1)
where the inequality follows directly from the non-negativity of the KL-divergence. Here  can be viewed as tuning the tightness of bound, while  dictates the actual estimation of µgt. Using a few standard manipulations, this bound can also be expressed as

L(, ) =  -Eq(z|x) [log p (x|z)] + KL [q(z|x)||p(z)] µgt(dx),

(2)

which explicitly involves the encoder/decoder distributions and is conveniently amenable to SGD optimization of {, } via a reparameterization trick (Kingma & Welling, 2014; Rezende et al., 2014). The first term in (2) can be viewed as a reconstruction cost (or a stochastic analog of a traditional autoencoder), while the second penalizes posterior deviations from the prior p(z). Additionally, for any realizable implementation via SGD, the integration over  must be approximated via a finite sum across training samples {x(i)}ni=1 drawn from µgt. Nonetheless, examining the true objective L(, ) can lead to important, practically-relevant insights.

1

Under review as a conference paper at ICLR 2019
At least in principle, q(z|x) and p(x|z) can be arbitrary distributions, in which case we could simply enforce q(z|x) = p(z|x)  p(x|z)p(z) such that the bound from (1) is tight. Unfortunately though, this is essentially always an intractable undertaking. Consequently, largely to facilitate practical implementation, the most commonly adopted distributional assumption is that both q(z|x) and p(x|z) are Gaussian. This design choice has previously been cited as a key limitation of VAEs (Burda et al., 2015; Kingma et al., 2016), and existing quantitative tests of generative modeling quality thus far dramatically favor contemporary alternatives such as generative adversarial networks (GAN) (Goodfellow et al., 2014b). Regardless, because the VAE possesses certain desirable properties relative to GAN models (e.g., stable training (Tolstikhin et al., 2018), interpretable encoder/inference network (Brock et al., 2016), outlier-robustness (Dai et al., 2018), etc.), it remains a highly influential paradigm worthy of examination and enhancement.
In Section 2 we closely investigate the implications of VAE Gaussian assumptions leading to a number of interesting diagnostic conclusions. In particular, we differentiate the situation where r = d, in which case we prove that recovering the ground-truth distribution is actually possible iff the VAE global optimum is reached, and r < d, in which case the VAE global optimum can be reached by solutions that reflect the ground-truth distribution almost everywhere, but not necessarily uniquely so. In other words, there could exist alternative solutions that both reach the global optimum and yet do not assign the same probability measure as µgt.
Section 3 then further probes this non-uniqueness issue by inspecting necessary conditions of global optima when r < d. This analysis reveals that an optimal VAE parameterization will provide an encoder/decoder pair capable of perfectly reconstructing all x   using any z drawn from q(z|x). Moreover, we demonstrate that the VAE accomplishes this using a degenerate latent code whereby only r dimensions are effectively active. Collectively, these results indicate that the VAE global optimum can in fact uniquely learn a mapping to the correct ground-truth manifold when r < d, but not necessarily the correct probability measure within this manifold, a critical distinction.
Next we leverage these analytical results in Section 4 to motivate an almost trivially-simple, twostage VAE enhancement for addressing typical regimes when r < d. In brief, the first stage just learns the manifold per the allowances from Section 3, and in doing so, provides a mapping to a lower dimensional intermediate representation with no degenerate dimensions that mirrors the r = d regime. The second (much smaller) stage then only needs to learn the correct probability measure on this intermediate representation, which is possible per the analysis from Section 2. Experiments from Section 5 reveal that this procedure can generate high-quality crisp samples, avoiding the blurriness often attributed to VAE models in the past (Dosovitskiy & Brox, 2016; Larsen et al., 2015). And to the best of our knowledge, this is the first demonstration of a VAE pipeline that can produce stable FID scores, an influential recent metric for evaluating generated sample quality (Heusel et al., 2017), that equal or exceed even multiple state-of-the-art GAN models. Moreover, this is accomplished without additional penalty functions, cost function modifications, or sensitive tuning parameters.
2 HIGH-LEVEL IMPACT OF VAE GAUSSIAN ASSUMPTIONS
Conventional wisdom suggests that VAE Gaussian assumptions will introduce a gap between L(, ) and the ideal negative log-likelihood  - log p(x)µgt(dx), compromising efforts to learn the ground-truth measure. However, we will now argue that this pessimism is in some sense premature. In fact, we will demonstrate that, even with the stated Gaussian distributions, there exist parameters  and  that can simultaneously: (i) Globally optimize the VAE objective and, (ii) Recover the ground-truth probability measure in a certain sense described below. This is possible because, at least for some coordinated values of  and , q(z|x) and p(z|x) can indeed become arbitrarily close. Before presenting the details, we first formalize a -simple VAE, which is merely a VAE model with explicit Gaussian assumptions and parameterizations:
Definition 1 A -simple VAE is defined as a VAE model with dim[z] =  latent dimensions, the Gaussian encoder q(z|x) = N (z|µz, z), and the Gaussian decoder p(x|z) = N (x|µx, x). Moreover, the encoder moments are defined as µz = fµz (x; ) and z = SzSz with Sz = fSz (x; ). Likewise, the decoder moments are µx = fµx (z; ) and x = I. Here  > 0 is a tunable scalar, while fµz , fSz and fµx specify parameterized differentiable functional forms that can be arbitrarily complex, e.g., a deep neural network.
2

Under review as a conference paper at ICLR 2019

Equipped with these definitions, we will now demonstrate that a -simple VAE, with   r, can achieve the optimality criteria (i) and (ii) from above. In doing so, we first consider the simpler case where r = d, followed by the extended scenario with r < d. The distinction between these two cases turns out to be significant, with practical implications to be explored in Section 4.

2.1 MANIFOLD DIMENSION EQUAL TO AMBIENT SPACE DIMENSION (r = d)

We first analyze the specialized situation where r = d. Assuming pgt(x) µgt(dx)/dx exists everywhere in Rd, then pgt(x) represents the ground-truth probability density with respect to the standard Lebesgue measure in Euclidean space. Given these considerations, the minimal possible
value of (1) will necessarily occur if

KL [q(z|x)||p(z|x)] = 0 and p(x) = pgt(x) almost everywhere.

(3)

This follows because by VAE design it must be that L(, )  - pgt(x) log pgt(x)dx, and in the present context, this lower bound is achievable iff the conditions from (3) hold. Collectively, this implies that the approximate posterior produced by the encoder q(z|x) is in fact perfectly matched to the actual posterior p(z|x), while the corresponding marginalized data distribution p(x) is perfectly matched the ground-truth density pgt(x) as desired. Perhaps surprisingly, a -simple VAE can actually achieve such a solution:

Theorem 1 Suppose that r = d and there exists a density pgt(x) associated with the ground-truth measure µgt that is nonzero everywhere on Rd.1. Then for any   r, there is a sequence of -simple VAE model parameters {t, t } such that

lim KL
t

qt (z|x)||pt (x|x)

=0

and

lim
t

pt

(x)

=

pgt(x)

almost

everywhere.

(4)

All the proofs can be found in the supplementary file. So at least when r = d, the VAE Gaussian assumptions need not actually prevent the optimal ground-truth probability measure from being recovered, as long as the latent dimension is sufficiently large (i.e.,   r). And contrary to popular notions, a richer class of distributions is not required to achieve this. Of course Theorem 1 only applies to a restricted case that excludes d > r; however, later we will demonstrate that a key consequence of this result can nonetheless be leveraged to dramatically enhance VAE performance.

2.2 MANIFOLD DIMENSION LESS THAN AMBIENT SPACE DIMENSION (r < d)
When r < d, additional subtleties are introduced that will be unpacked both here and in the sequel. To begin, if both q(z|x) and p(x|z) are arbitrary/unconstrained (i.e., not necessarily Gaussian), then inf, L(, ) = -. To achieve this global optimum, we need only choose  such that q(z|x) = p(z|x) (minimizing the KL term from (1)) while selecting  such that all probability mass collapses to the correct manifold . In this scenario the density p(x) will become unbounded on  and zero elsewhere, such that  - log p(x)µgt(dx) will approach negative infinity.
But of course the stated Gaussian assumptions from the -simple VAE model could ostensibly prevent this from occurring by causing the KL term to blow up, counteracting the negative loglikelihood factor. We will now analyze this case to demonstrate that this need not happen. Before proceeding to this result, we first define a manifold density p~gt(x) as the probability density (assuming it exists) of µgt with respect to the volume measure of the manifold . If d = r then this volume measure reduces to the standard Lebesgue measure in Rd and p~gt(x) = pgt(x); however, when d > r a density pgt(x) defined in Rd will not technically exist, while p~gt(x) is still perfectly well-defined. We then have the following:

Theorem 2 Assume r < d and that there exists a manifold density p~gt(x) associated with the ground-truth measure µgt that is nonzero everywhere on . Then for any   r, there is a sequence of -simple VAE model parameters {t, t } such that

i

lim KL
t

qt (z|x)||pt (z|x)

=0

and

lim
t

 - log pt (x)µgt(dx) = -,

(5)

1This nonzero assumption can be replaced with a much looser condition. Specifically, if there exists a

diffeomorphism between the set {x|pgt(x) = 0} and Rd, then it can be shown that Theorem 1 still holds even

if

pgt(x)

=

0

for

some

x


d
R

.

3

Under review as a conference paper at ICLR 2019

ii

lim
t

xA pt (x)dx = µgt(A  )

(6)

for all measurable sets A  Rd with µgt(A  ) = 0, where A is the boundary of A.

Technical details notwithstanding, Theorem 2 admits a very intuitive interpretation. First, (5) directly implies that the VAE Gaussian assumptions do not prevent minimization of L(, ) from converging to minus infinity, which can be trivially viewed as a globally optimum solution. Furthermore, based on (6), this solution can be achieved with a limiting density estimate that will assign a probability mass to most all measurable subsets of Rd that is indistinguishable from the groundtruth measure (which confines all mass to ). Hence this solution is more-or-less an arbitrarily-good approximation to µgt for all practical purposes.2
Regardless, there is an absolutely crucial distinction between Theorem 2 and the simpler case quantified by Theorem 1. Although both describe conditions whereby the -simple VAE can achieve the minimal possible objective, in the r = d case achieving the lower bound (whether the specific parameterization for doing so is unique or not) necessitates that the ground-truth probability measure has been recovered almost everywhere. But the r < d situation is quite different because we have not ruled out the possibility that a different set of parameters {, } could push L(, ) to - and yet not achieve (6). In other words, the VAE could reach the lower bound but fail to closely approximate µgt. And we stress that this uniqueness issue is not a consequence of the VAE Gaussian assumptions per se; even if q(z|x) were unconstrained the same lack of uniqueness can persist.
Rather, the intrinsic difficulty is that, because the VAE model does not have access to the groundtruth low-dimensional manifold, it must implicitly rely on a density p(x) defined across all of Rd as mentioned previously. Moreover, if this density converges towards infinity on the manifold during training without increasing the KL term at the same rate, the VAE cost can be unbounded from below, even in cases where (6) is not satisfied, meaning incorrect assignment of probability mass.
To conclude, the key take-home message from this section is that, at least in principle, VAE Gaussian assumptions need not actually be the root cause of any failure to recover ground-truth distributions. Instead we expose a structural deficiency that lies elsewhere, namely, the non-uniqueness of solutions that can optimize the VAE objective without necessarily learning a close approximation to µgt. But to probe this issue further and motivate possible workarounds, it is critical to further disambiguate these optimal solutions and their relationship with ground-truth manifolds. This will be the task of Section 3, where we will explicitly differentiate the problem of locating the correct groundtruth manifold, from the task of learning the correct probability measure within the manifold.
Note that the only comparable prior work we are aware of related to the results in this section comes from Doersch (2016), where the implications of adopting Gaussian encoder/decoder pairs in the specialized case of r = d = 1 are briefly considered. Moreover, the analysis there requires additional much stronger assumptions than ours, namely, that pgt(x) should be nonzero and infinitely differentiable everywhere in the requisite 1D ambient space. These requirements of course exclude essentially all practical usage regimes where d = r > 1 or d > r, or when ground-truth densities are not sufficiently smooth.

3 OPTIMAL SOLUTIONS AND THE GROUND TRUTH MANIFOLD
We will now more closely examine the properties of optimal -simple VAE solutions, and in particular, the degree to which we might expect them to at least reflect the true , even if perhaps not the correct probability measure µgt defined within . To do so, we must first consider some necessary conditions for VAE optima:
Theorem 3 Let {,  } denote an optimal -simple VAE solution (with   r) where the decoder variance  is fixed (i.e., it is the sole unoptimized parameter). Moreover, we assume that µgt is not a Gaussian distribution when d = r.3 Then for any  > 0, there exists a  <  such that L( ,  ) < L(,  ).
2Note that (6) is only framed in this technical way to accommodate the difficulty of comparing a measure µgt restricted to  with the VAE density p(x) defined everywhere in Rd. See the supplementary for details.
3This requirement is only included to avoid a practically irrelevant form of non-uniqueness that exists with full, non-degenerate Gaussian distributions.

4

Under review as a conference paper at ICLR 2019

This result implies that we can always reduce the VAE cost by choosing a smaller value of , and hence, if  is not constrained, it must be that   0 if we wish to minimize (2). Despite this necessary optimality condition, in existing practical VAE applications, it is standard to fix   1 during
training. This is equivalent to simply adopting a non-adaptive squared-error loss for the decoder and,
at least in part, likely contributes to unrealistic/blurry VAE-generated samples. Regardless, there are more significant consequences of this intrinsic favoritism for   0, in particular as related to reconstructing data drawn from the ground-truth manifold :

Theorem 4 Applying the same conditions and definitions as in Theorem 3, then for all x drawn from µgt, we also have that

lim
0

fµx

fµz (x;  ) + fSz (x;  ); 

=

lim
0

fµx

fµz (x;  ); 

= x,

  R.

(7)

By design any random draw z  q (z|x) can be expressed as fµz (x; ) + fSz (x;  ) for some   N (|0, I). From this vantage point then, (7) effectively indicates that any x   will be
perfectly reconstructed by the VAE encoder/decoder pair at globally optimal solutions, achieving this necessary condition despite any possible stochastic corrupting factor fSz (x; ).

But still further insights can be obtained when we more closely inspect the VAE objective func-

tion behavior at arbitrarily small but explicitly nonzero values of . In particular, when  = r

(meaning z has no superfluous capacity), Theorem 4 and attendant analyses in the supplementary

ultimately imply that the squared eigenvalues of fSz (x;  ) will become arbitrarily small at a rate

proportional

to

,

meaning

1 

fSz

(x;


)


O(1)

under

mild

conditions.

It then follows that

the VAE data term integrand from (2), in the neighborhood around optimal solutions, behaves as

-2Eq (z|x) log p (x|z) =

2Eq (z|x)

1 

x - fµx z; 

2 2

+ d log 2  Eq (z|x) [O(1)] + d log 2 = d log  + O(1).

(8)

This expression can be derived by excluding the higher-order terms of a Taylor series approximation of fµx fµz (x; ) + fSz (x;  );  around the point fµz (x; ), which will be relatively

tight under the stated conditions. But because 2Eq (z|x)

1 

x - fµx z; 

2 2

 0, a theoret-

ical lower bound on (8) is given by d log 2  d log  + O(1). So in this sense (8) cannot be

significantly lowered further.

This observation is significant when we consider the inclusion of addition latent dimensions by allowing  > r. Clearly based on the analysis above, adding dimensions to z cannot improve the value of the VAE data term in any meaningful way. However, it can have a detrimental impact on the the KL regularization factor in the   0 regime, where

2KL [q(z|x)||p(z)]  trace [z] +

µz

2 2

-

log

|z

|


-r^ log


+

O(1).

(9)

Here r^ denotes the number of eigenvalues {j()}j=1 of fSz (x; ) (or equivalently z) that satisfy j()  0 if   0. r^ can be viewed as an estimate of how many low-noise latent dimensions the VAE model is preserving to reconstruct x. Based on (9), there is obvious pressure to make r^

as small as possible, at least without disrupting the data fit. The smallest possible value is r^ = r,

since it is not difficult to show that any value below this will contribute consequential reconstruction

errors, causing 2Eq (z|x)

1 

x - fµx z; 

2 2

to grow at a rate of 

1 

, pushing the entire

cost function towards infinity.4

Therefore, in the neighborhood of optimal solutions the VAE will naturally seek to produce perfect
reconstructions using the fewest number of clean, low-noise latent dimensions, meaning dimensions whereby q (z|x) has negligible variance. For superfluous dimensions that are unnecessary for representing x, the associated encoder variance in these directions can be pushed to one. This will optimize KL [q(z|x)||p(z)] along these directions, and the decoder can selectively block the residual randomness to avoid influencing the reconstructions per Theorem 4. So in this sense the
VAE is capable of learning a minimal representation of the ground-truth manifold  when r < .

4Note

that

inf  >0

C 

+

log 

=


for

any

C

>

0.

5

Under review as a conference paper at ICLR 2019
But we must emphasize that the VAE can learn  independently of the actual distribution µgt within . Addressing the latter is a completely separate issue from achieving the perfect reconstruction error defined by Theorem 4. This fact can be understood within the context of a traditional PCAlike model, which is perfectly capable of learning a low-dimensional subspace containing some training data without actually learning the distribution of the data within this subspace. The central issue is that there exists an intrinsic bias associated with the VAE objective such that fitting the distribution within the manifold will be completely neglected whenever there exists the chance for even an infinitesimally better approximation of the manifold itself.
Stated differently, if VAE model parameters have learned a near optimal, parsimonious latent mapping onto  using   0, then the VAE cost will scale as (d - r) log  regardless of µgt. Hence there remains a huge incentive to reduce the reconstruction error still further, allowing  to push even closer to zero and the cost closer to -. And if we constrain  to be sufficiently large so as to prevent this from happening, then we risk degrading/blurring the reconstructions and widening the gap between q(z|x) and p(z|x), which can also compromise estimation of µgt. Fortunately though, as will be discussed next there is a convenient way around this dilemma by exploiting the fact that this dominanting (d - r) log  factor goes away when d = r.
4 FROM THEORY TO PRACTICAL VAE ENHANCEMENTS
Sections 2 and 3 have exposed a collection of VAE properties with useful diagnostic value in and of themselves. But the practical utility of these results, beyond the underappreciated benefit of learning , warrant further exploration. In this regard, suppose we wish to develop a generative model of high-dimensional data x   where unknown low-dimensional structure is significant (i.e., the r < d case with r unknown). The results from Section 3 indicate that the VAE can partially handle this situation by learning a parsimonious representation of low-dimensional manifolds, but not necessarily the correct probability measure µgt within such a manifold. In quantitative terms, this means that a decoder p(x|z) will map all samples from an encoder q(z|x) to the correct manifold such that the reconstruction error is negligible for any x  . But if the measure µgt on  has not been accurately estimated, then
q(z)  q(z|x)µgt(dx)  Rd p(z|x)p(x)dx = Rd p(x|z)p(z)dx = N (z|0, I), (10)
where q(z) is sometimes referred to as the aggregated posterior (Makhzani et al., 2016). In other words, the distribution of the latent samples drawn from the encoder distribution, when averaged across the training data, will have lingering latent structure that is errantly incongruous with the original isotropic Gaussian prior. This then disrupts the pivotal ancestral sampling capability of the VAE, implying that samples drawn from N (z|0, I) and then passed through the decoder p(x|z) will not closely approximate µgt. Fortunately, our analysis suggests the following two-stage remedy:
1. Given n observed samples {x(i)}in=1, train a -simple VAE, with   r, to estimate the unknown r-dimensional ground-truth manifold  embedded in Rd using a minimal number of active latent dimensions. Generate latent samples {z(i)}ni=1 via z(i)  q(z|x(i)). By design, these samples will be distributed as q(z), but likely not N (z|0, I).
2. Train a second -simple VAE, with independent parameters { ,  } and latent representation u, to learn the unknown distribution q(z), i.e., treat q(z) as a new ground-truth distribution and use samples {z(i)}in=1 to learn it.
3. Samples approximating the original ground-truth µgt can then be formed via the extended ancestral process u  N (u|0, I), z  p (z|u), and finally x  p(x|z).
The efficacy of the second-stage VAE from above is based on the following. If the first stage was successful, then even though they will not generally resemble N (z|0, I), samples from q(z) will nonetheless have nonzero measure across the full ambient space R. If  = r, this occurs because the entire latent space is needed to represent an r-dimensional manifold, and if  > r, then the extra latent dimensions will be naturally filled in via randomness introduced along dimensions associated with nonzero eigenvalues of the decoder covariance z per the analysis in Section 3.
Consequently, as long as we set   r, the operational regime of the second-stage VAE is effectively equivalent to the situation described in Section 2.1 where the manifold dimension is equal to
6

Under review as a conference paper at ICLR 2019

log. Squared Pixel Error
Value

0 0.08 Learnable . Fix . =1 0.06
-2
0.04
-4 0.02
-6 0 0 2 4 6 8 10 Training Iterations #104

 = 1.00, Image Variance = 0  = 0.02, Image Variance = 37.7  = 0.01, Image Variance = 357

550 1 First Stage VAE
500 Second Stage VAE 0.8 Ideal Gaussian
450 0.6
400 0.4
350 0.2
300 0 0 10 20 30 40 50 60 Singular Value Index

Figure 1: Demonstrating VAE properties. (Left) Validation of Theorem 3 and the influence on image
reconstructions. (Center) Validation of Theorem 4. (Right) Motivation for two separate VAE stages by comparing the aggregated posteriors q(z) (1st stage) vs. q (u) (enhanced 2nd stage).

the ambient dimension.5 And as we have already shown there via Theorem 1, the VAE can readily handle this situation, since in the narrow context of the second-stage VAE, d = r = , the troublesome (d - r) log  factor becomes zero, and any globally minimizing solution is uniquely matched to the new ground-truth distribution q(z). Consequently, the revised aggregated posterior q (u) produced by the second-stage VAE should now closely resemble N (u|0, I). And finally, because we generally assume that d   r, we have found that the second-stage VAE can be quite small.
5 EMPIRICAL EVALUATION OF VAE TWO-STAGE ENHANCEMENT
We initially describe experiments explicitly designed to corroborate some of our previous analytical results using VAE models trained on CelebA (Liu et al., 2015) data; please see the supplementary for training details and more related experiments. First, the leftmost plot of Figure 1 presents support for Theorem 3, where indeed the decoder variance  does tend towards zero during training. This then allows for tighter image reconstructions with lower average squared error, i.e., a better manifold fit as expected. The center plot bolsters Theorem 4 and the analysis that follows by showcasing the dissimilar impact of noise factors applied to different directions in the latent space before passage through the decoder mean network fµx . In a direction where an eigenvalue j of z is large (i.e., a superfluous dimension), a random perturbation is completely muted by the decoder as predicted. In contrast, in directions where such eigenvalues are small (i.e., needed for representing the manifold), varying the input causes large changes in the image space reflecting reasonable movement along the correct manifold. Finally, the rightmost plot of Figure 1 displays the singular value spectrum of latent sample matrices drawn from the first- and second-stage VAE models. As expected, the latter is much closer to the spectrum from an analogous i.i.d. N (0, I) matrix. This indicates a superior latent representation, providing high-level support for our two-stage VAE proposal.
Next we present quantitative evaluation of novel generated samples using the large-scale testing protocol of GAN models from (Lucic et al., 2018). In this regard, GANs are well-known to dramatically outperform existing VAE approaches in terms of the Fre´chet Inception Distance (FID) score (Heusel et al., 2017) and related quantitative metrics. For fair comparison, (Lucic et al., 2018) adopted a common neutral architecture for all models, with generator and discriminator networks based on InfoGAN (Chen et al., 2016a); the point here is standardized comparisons, not tuning arbitrarily-large networks to achieve the lowest possible absolute FID values. We applied the same architecture to our first-stage VAE decoder and encoder networks respectively for direct comparison. For the low-dimensional second-stage VAE we used small, 3-layer networks contributing negligible additional parameters beyond the first stage (see the supplementary for further design details).6
We compared our proposed two-stage VAE pipeline against three baseline VAE models differing only in the decoder output layer: a Gaussian layer with fixed , a Gaussian layer with a learned , and a cross-entropy layer as has been adopted in several previous applications involving images
5Note that if a regular autoencoder were used to replace the first-stage VAE, then this would no longer be the case, so indeed a VAE is required for both stages.
6It should also be emphasized that concatenating the two stages and jointly training does not improve the performance. If trained jointly the few extra second-stage parameters are simply hijacked by the dominant objective from the first stage and forced to work on an incrementally better fit of the manifold. As expected then, on empirical tests (not shown) we have found that this does not improve upon standard VAE baselines.

7

Under review as a conference paper at ICLR 2019

MM GAN NS GAN LSGAN WGAN WGAN GP DRAGAN BEGAN VAE (cross-entr.) VAE (fixed ) VAE (learned ) 2-Stage VAE 2-Stage VAE*

MNIST 9.8 ± 0.9 6.8 ± 0.5 7.8 ± 0.6 6.7 ± 0.4 20.3 ± 5.0 7.6 ± 0.4 13.1 ± 1.0
23.8 ± 0.6 51.2 ± 0.8 47.0 ± 0.9
13.4 ± 1.3 11.2 ± 0.5

Fashion 29.6 ± 1.6 26.5 ± 1.6 30.7 ± 2.2 21.5 ± 1.6 24.5 ± 2.1 27.7 ± 1.2 22.9 ± 0.9 58.7 ± 1.2 104.5 ± 1.3 51.5 ± 1.0 22.0 ± 0.6 21.3 ± 0.4

CIFAR-10 72.7 ± 3.6 58.5 ± 1.9 87.1 ± 47.5 55.2 ± 2.3 55.8 ± 0.9 69.8 ± 2.0 71.4 ± 1.6
155.7 ± 11.6 113.0 ± 0.7 80.1 ± 0.6 71.0 ± 0.6 68.0 ± 0.8

CelebA 65.6 ± 4.2 55.0 ± 3.3 53.9 ± 2.8 41.3 ± 2.0 30.3 ± 1.0 42.3 ± 3.0 38.9 ± 0.9 85.7 ± 3.8 119.8 ± 0.9 67.4 ± 2.1 45.9 ± 1.4 23.8 ± 0.5

Mean 44.4 ± 2.6 36.7 ± 1.8 44.9 ± 13.3 31.2 ± 1.6 32.7 ± 2.3 36.9 ± 1.7 36.6 ± 1.1 81 ± 4.3 97.1 ± 0.9 61.5 ± 1.2 38.1 ± 1.0 31.1 ± 0.6

Table 1: FID score comparisons. For all GAN-based models, the reported values represent the best FID obtained across a large-scale hyperparameter search conducted separately for each dataset; default settings are considerably worse (Lucic et al., 2018). Likewise outlier cases (e.g., severe mode collapse) were omitted, which would have otherwise degraded these GAN scores and increased standard deviations still further. In contrast, for the VAE results we used only default training settings across all models and datasets (no tuning), except for the 2-Stage VAE*. Here we simply tested a couple different values for  and picked the best result for each dataset. Note that specialized architectures and/or random seed optimization can potentially improve the FID score for all models.

(Chen et al., 2016b). We also present results from (Lucic et al., 2018) involving numerous stateof-the-art GAN models, including MM GAN (Goodfellow et al., 2014a), WGAN (Arjovsky et al., 2017), WGAN-GP (Gulrajani et al., 2017), NS GAN (Fedus et al., 2017), DRAGAN (Kodali et al., 2017), LS GAN (Mao et al., 2017) and BEGAN (Berthelot et al., 2017). Testing is conducted across four significantly different datasets: MNIST (LeCun et al., 1998), Fashion MNIST (Xiao et al., 2017), CIFAR-10 (Krizhevsky & Hinton, 2009) and CelebA (Liu et al., 2015).
For each dataset we executed 10 independent trials and report the mean and standard deviation of the FID scores in Table 1. Despite the fact that all GAN models benefited from a large-scale hyperparameter search executed independently across each dataset to achieve the best results, our proposed two-stage VAE with minimal tuning is capable of equaling or exceeding the performance of all the GAN models and VAE baselines (see Table 1 caption for more details). This is the first demonstration of a VAE pipeline capable of competing with GANs in the arena of generated sample quality. For example, note the poor performance of VAE baselines relative to GANs in Table 1). Representative samples generated using our two-stage VAE model are in the supplementary.
As a final point of reference, although not an exact VAE model per se, an autoencoder-based architecture that substitutes a Wassenstein distance measure for the KL regularizer from (2) has also recently been proposed (Tolstikhin et al., 2018). Two variants of this approach, termed WAE-MMD and WAE-GAN (because different MMD and GAN regularization factors are included), were evaluated using FID scores, with penalty weights and encoder/decoder networks specifically adapted for use with the CelebA dataset (FID values were not provided for other datasets). A baseline VAE using these networks achieved an FID of 63, which is somewhat better than our VAE baselines presumably because of this tuning for CelebA data. In contrast, the corresponding WAE-MMD and WAE-GAN scores were 55 and 42 respectively. Although these values represent an improvement over the VAE baseline, they are considerably worse than the absolute score of 23.8 achieved by our generic two-stage VAE model with neutral architecture borrowed from (Lucic et al., 2018).
6 CONCLUSION
It is often assumed that there exists an unavoidable trade-off between the stable training, valuable attendant encoder network, and resistance to mode collapse of VAEs, versus the impressive visual quality of images produced by GANs. While we certainly are not claiming that our two-stage VAE model is necessarily superior to the latest and greatest GAN-based model in terms of the realism of generated samples, we do strongly believe that this work at least narrows that gap substantially such that VAEs are worth considering in a broader range of applications.
8

Under review as a conference paper at ICLR 2019
REFERENCES
Martin Arjovsky, Soumith Chintala, and Le´on Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pp. 214­223, 2017.
Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798­1828, 2013.
David Berthelot, Thomas Schumm, and Luke Metz. BEGAN: Boundary equilibrium generative adversarial networks. arXiv:1703.10717, 2017.
Andrew Brock, Theodore Lim, James M Ritchie, and Nick Weston. Neural photo editing with introspective adversarial networks. arXiv:1609.07093, 2016.
Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv:1509.00519, 2015.
Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. InfoGAN: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2172­2180, 2016a.
Xi Chen, Diederik Kingma, Tim Salimans, Yan Duan, Prafulla Dhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel. Variational lossy autoencoder. arXiv:1611.02731, 2016b.
Bin Dai, Yu Wang, John Aston, Gang Hua, and David Wipf. Hidden talents of the variational autoencoder. arXiv:1706.05148, 2018.
Carl Doersch. Tutorial on variational autoencoders. arXiv:1606.05908, 2016.
Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep networks. In Advances in Neural Information Processing Systems, pp. 658­666, 2016.
William Fedus, Mihaela Rosca, Balaji Lakshminarayanan, Andrew M Dai, Shakir Mohamed, and Ian Goodfellow. Many paths to equilibrium: GANs do not need to decrease a divergence at every step. arXiv:1710.08446, 2017.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672­2680, 2014a.
I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial networks. In arXiv:1406.2661, 2014b.
Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems, pp. 5767­5777, 2017.
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems, pp. 6626­6637, 2017.
Diederik Kingma and Max Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
Diederik Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, pp. 4743­4751, 2016.
Naveen Kodali, Jacob Abernethy, James Hays, and Zsolt Kira. On convergence and stability of GANs. arXiv:1705.07215, 2017.
Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
9

Under review as a conference paper at ICLR 2019
Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and Ole Winther. Autoencoding beyond pixels using a learned similarity metric. arXiv:1512.09300, 2015.
Yann LeCun, Le´on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278­2324, 1998.
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In IEEE International Conference on Computer Vision, pp. 3730­3738, 2015.
Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are GANs created equal? A large-scale study. arXiv:1711.10337v3, 2018.
Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv:1511.05644, 2016.
Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In IEEE International Conference on Computer Vision, pp. 2813­2821, 2017.
D.J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, 2014.
Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein autoencoders. International Conference on Learning Representations, 2018.
Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747, 2017.
10

SUPPLEMENTARY FILE Diagnosing and Enhancing VAE Models
1. Introduction
This document contains companion technical material regarding our ICLR 2019 submission. Note that herein all equation numbers referencing back to the main submission document will be be prefixed with an `M-' to avoid confusion, i.e, (M-#) will refer to equation (#) from the main text. Similar notation differentiates sections, tables, and figures, e.g., Section M-#, etc.
2. Contents
The remainder of this document includes the following contents: · Section 3 - Comparison of novel samples generated from our model. · Section 4 - Example reconstructions of training data. · Section 5 - Additional experimental results validating theoretical predictions. · Section 6 - Network structure and experimental settings. · Section 7 - Proof of Theorem M.1. · Section 8 - Proof of Theorem M.2. · Section 9 - Proof of Theorem M.3. · Section 10 - Proof of Theorem M.4. · Section 11 - Further analysis of the VAE cost as  becomes small.
3. Comparison of Novel Samples Generated from our Model
Generation results for CelebA, MNIST, Fashion-MNIST and CIFAR-10 datasets are shown in Figures 1-4 respectively. When  is fixed to be one, the generated samples are very blurry. If a learnable  is used, the samples becomes sharper; however, there are many lingering artifacts as expected. In contrast, the proposed 2-Stage VAE can remove these artifacts and generate more realistic samples. For comparison purposes, we also show the results from WAE-MMD, WAE-GAN (Tolstikhin et al., 2018) and WGAN-GP (Gulrajani et al., 2017) for the CelebA dataset.
1

(a) WAE-MMD

(b) WAE-GAN

(c) WGAN-GP

(d) VAE (Fix  = 1)

(e) VAE (Learnable )

(f) 2-Stage VAE

Figure 1: Randomly generated samples on the CelebaA dataset (i.e., no cherry-picking).

(a) VAE (Fix  = 1)

(b) VAE (Learnable )

(c) 2-Stage VAE

Figure 2: Randomly generated samples on the MNIST dataset (i.e., no cherry-picking).

2

(a) VAE (Fix  = 1)

(b) VAE (Learnable )

(c) 2-Stage VAE

Figure 3: Randomly Generated Samples on Fashion-MNIST Dataset (i.e., no cherrypicking).

(a) VAE (Fix  = 1)

(b) VAE (Learnable )

(c) 2-Stage VAE

Figure 4: Randomly Generated Samples on CIFAR-10 Dataset (i.e., no cherry-picking).

3

4. Example Reconstructions of Training Data
Reconstruction results for MNIST, Fashion-MNIST, CIFAR-10 and CelebA datasets are shown in Figures 5-8 respectively. On relatively simple datasets like MNIST and FashionMNIST, the VAE with learnable  achieves almost exact reconstruction because of a better estimate of the underlying manifold consistent with theory. However, the VAE with fixed  = 1 produces blurry reconstructions as expected. Note that the reconstruction of a 2Stage VAE is the same as that of a VAE with learnable  because the second-stage VAE has nothing to do with facilitating the reconstruction task.

(a) Ground Truth

(b) VAE (Fix  = 1)

(c) VAE (Learnable )

Figure 5: Reconstructions on CelebA Dataset.

(a) Ground Truth

(b) VAE (Fix  = 1)

(c) VAE (Learnable )

Figure 6: Reconstructions on MNIST Dataset.

4

(a) Ground Truth

(b) VAE (Fix  = 1)

(c) VAE (Learnable )

Figure 7: Reconstructions on Fashion-MNIST Dataset.

(a) Ground Truth

(b) VAE (Fix  = 1)

(c) VAE (Learnable )

Figure 8: Reconstructions on CIFAR-10 Dataset.

5

 = 0.005, Image Variance = 27.33  = 0.005, Image Variance = 27.20  = 0.007, Image Variance = 19.64  = 0.008, Image Variance = 13.90  = 0.010, Image Variance = 12.78  = 1.000, Image Variance = 0.000
(a) MNIST

 = 0.005, Image Variance = 63.18  = 0.005, Image Variance = 72.89  = 0.009, Image Variance = 24.56  = 0.011, Image Variance = 21.05  = 0.030, Image Variance = 5.243  = 1.001, Image Variance = 0.000
(b) Fashion-MNIST

 = 0.008, Image Variance = 135.2
 = 0.009, Image Variance = 115.0
 = 0.015, Image Variance = 42.41
 = 0.114, Image Variance = 1.156
 = 1.013, Image Variance = 0.000
(c) CelebA Figure 9: More examples similar to Figure M.1(center ).

5. Additional Experimental Results Validating Theoretical Predictions
We first present more examples similar to Figure M.1(center ) from the main paper. Random noise is added to µz along different directions and the result is passed through the decoder network. Each row corresponds to a certain direction in the latent space and 15 samples are shown for each direction. These dimensions/rows are ordered by the eigenvalues j of z. The larger j is, the less impact a random perturbation along this direction will have as quantified by the reported image variance values. In the first two or three rows, the noise generates some images from different classes/objects/identities, indicating a significant visual difference. For a slightly larger j, the corresponding dimensions encode relatively less significant attributes as predicted. For example, the fifth row of both MNIST and Fashion-MNIST contains images from the same class but with a slightly different style. The images in the fourth row of the CelebA dataset have very subtle differences. When j = 1, the corresponding dimensions become completely inactive and all the output images are exactly the same, as shown in the last rows for all the three datasets.
Additionally, as discussed in the main text and below in Section 11, there are likely to be r eigenvalues of z converging to zero and  - r eigenvalues converging to one. We plot
6

5 #106

5 #106

44

Frequency Frequency

33

22

11

00 0 0.2 0.4 0.6 0.8 1 1.2 0 0.2 0.4 0.6 0.8 1 1.2
66
jj

(a) Hist of j on MNIST

(b) Hist of j on CelebA

Figure 10: Histogram of j values. There are more values around 0 for CelebA because it is more complicated than MNIST and therefore requres more active dimensions to model the underlying manifold.

the histogram of j values for both MNIST and CelebA datasets in Figure 10. For both datasets, j approximately converges to either to zero or one. However, since CelebA is a more complicated dataset than MNIST, the ground-truth manifold dimension of CelebA is likely to be much larger than that of MNIST. So more eigenvalues are expected to be near zero for the CelebA dataset. This is indeed the case, demonstrating that VAE has the ability to detect the manifold dimension and select the proper number of latent dimensions in practical environments.
6. Network Structure and Experimental Settings
We first describe the network and training details used in producing Figure M.1 from the main file, and for generating samples and reconstructions in the supplementary. The firststage VAE network is shown in Figure 11. Basically we use two Residual Blocks for each resolution scale, and we double the number of channels when downsampling and halve it when upsampling. The specific settings such as the number of channels and the number of scales are specified in the caption. The second VAE is much simpler. Both the encoder and decoder have three 2048-dimensional hidden layers. Finally, the training details are presented below. Note that these settings were not tuned, we simply chose more epochs for more complex data sets and fewer for datasets with larger training samples. For each dataset just a single setting was tested as follows:
· MNIST and Fashion-MNIST: The batch size is specified to be 100. We use the ADAM optimizer with the default hyperparameters in TensorFlow. Standard weight decay is set as 5 × 10-4. The first VAE is trained for 400 epochs. The initial learning rate is 0.0001 and we halve it every 150 epochs. The second VAE is trained for 800 epochs with the same initial learning rate, halved every 300 epochs.
7

input

ScaleBlock

ResBlock

ResBlock

bn+relu

bn+relu

 conv ScaleBlock 1 downsample ScaleBlock 2

 fc Reshape upsample

conv/fc

conv/fc

downsample

ScaleBlock 1

... ...

bn+relu

bn+relu

ScaleBlock N

upsample

conv/fc

conv/fc

output

Flatten ScaleBlock

ScaleBlock M conv

fc 

fc+exp 

sigmoid 

Figure 11: Network structure of the first-stage VAE used in producing Figure M.1, and for generating samples and reconstructions. (Left) The basic building block of the network called a Scale Block, which consists of two Residual Blocks. (Center ) The encoder network. For an input image x, we use a convolutional layer to transform it into 32 channels. We then pass it to a Scale Block. After each Scale Block, we downsample using a convolutional layer with stride 2 and double the channels. After N Scale Blocks, the feature map is flattened to a vector. In our experiments, we used N = 4 for CelebA dataset and 3 for other datasets. The vector is then passed through another Scale Block, the convolutional layers of which are replaced with fully connected layers of 512 dimensions. The output of this Scale Block is used to produce the -dimensional latent code, with  = 64. (Right) The decoder network. A latent code z is first passed through a fully connected layer. The dimension is 4096 for CelebA dataset and 2048 for other datasets. Then it is reshaped to 2 × 2 resolution. We upsample the feature map using a deconvolution layer and half the number of channels at the same time. It then goes through some Scale Blocks and upsampling layers until the feature map size becomes the desired value. Then we use a convolutional layer to transform the feature map, which should have 32 channels, to 3 channels for RGB datasets and 1 channel for gray scale datasets.

· CIFAR-10: Since CIFAR-10 is more complicated than MNIST and Fashion-MNIST, we use more epochs for training. Specifically, we use 1000 and 2000 epochs for the two VAEs respectively and half the learning rate every 300 and 600 epochs for the two stages. The other settings are the same as that for MNIST.
· CelebA: Because CelebA has many more examples, in the first stage we train 120 epochs and half the learning rate every 48 epochs. In the second stage, we train 300 epochs and half the learning rate every 120 epochs. The other settings are the same as that for MNIST, etc.
8

Finally, to fairly compare against various GAN models and VAE baselines using FID scores on a neutral architecture (i.e., the results from Table M.1), we simply adopt the InfoGAN network structure consistent with the neutral setup from (Lucic et al., 2018) for the first-stage VAE. For the second-stage VAE we just use three 1024-dimensional hidden layers, which contribute less than 5% to the total number of parameters. The only hyperparameter we considered tuning was , coarsely testing just a few different values based on dataset complexity. Note that the small number of additional parameters contributing to the second stage do not improve the other VAE baselines when aggregated and trained jointly.

7. Proof of Theorem M.1
We first consider the case where the latent dimension  equals the manifold dimension r and then extend the proof to allow for  > r. The intuition is to build a bijection between  and Rr that transforms the ground-truth distribution pgt(x) to a normal Gaussian distribution. The way to build such a bijection is shown in Figure 12. We now fill in the details.

 -1

  
  0,1 

  
-1 

      
Example:

2D Ground Truth Distribution

0,1 2

2D normal Gaussian

Figure 12: The relationship between different variables.

7.1 Finding a Sequence of Decoders such that pt(x) Converges to pgt(x) Define the function F : Rr  [0, 1]r as

F (x) = [F1(x1), F2(x2; x1), ..., Fr(xr; x1:r-1)] ,

xi

Fi(xi; x1:i-1) =

pgt(xi|x1:i-1)dxi.

xi=-

Per this definition, we have that

(1) (2)

dF (x) = pgt(x)dx. 9

(3)

Also, since pgt(x) is nonzero everywhere, F (·) is invertible. Similarly, we define another differentiable and invertible function G : Rr  [0, 1]r as

Then

G(z) = [G1(z1), G2(z2), ..., Gr(zr)] ,

zi

Gi(zi) =

N (zi|0, 1)dzi.

zi=-

dG(z) = p(z)dz = N (z|0, I)dz.

Now let the decoder be

fµx (z; t) = F -1  G(z),

t

=

1 .
t

Then we have

(4) (5)
(6)
(7) (8)

pt (x) = Rr pt (x|z)p(z)dz = Additionally, let  = G(z) such that

N x|F -1  G(z), tI dG(z).
Rr

(9)

pt (x) =

N
[0,1]r

x|F -1(), tI

d,

(10)

and let x = F -1() such that d = dF (x ) = pgt(x )dx . Plugging this expression into the previous p(x) we obtain

pt (x) =

N
Rr

x|x , tI

pgt(x )dx .

(11)

As t  , t becomes infinitely small and N (x|x , tI) becomes a Dirac-delta function, resulting in

lim
t

pt

(x)

=

(x - x)pgt(x )dx = pgt(x). 

(12)

7.2 Finding a Sequence of Encoders such that KL qt (z|x)||pt(z|x) Converges to 0
Assume the encoder networks satisfy

fµz (x; t ) = G-1  F (x) = fµ-x1(x; t),

fSz (x; t) =

-1
t fµx (fµz (x; t ); t) fµx (fµz (x; t); t) ,

(13) (14)

where fµx(·) is a d × r Jacobian matrix. We omit the arguments t and t in fµz (·), fSz (·) and fµx(·) hereafter to avoid unnecessary clutter. We first explain why fµx(·) is differentiable. Since fµx(·) is a composition of F -1(·) and G(·) according to (7), we only

10

need to explain that both functions are differentiable. For F -1(·), it is the inverse of a

differentiable function F (·). Moreover, the derivative of F (x) is pgt(x), which is nonzero everywhere. So F -1(·) and therefore fµx(·) are both differentiable.
The true posterior pt(z|x) and the approximate posterior are

pt (z|x)

=

N (z|0, I)N (x|fµx(z), tI) , pt (x)

qt (z|x)

=

N

z|fµz (x), t

fµx (fµz (x))

-1
fµx (fµz (x))

(15) (16)

respectively. We now prove that qt (z|x)/pt(z|x) converges to a constant not related to z as t goes to . If this is true, the constant must be 1 since both qt (z|x) and pt(z|x) are probability distributions. Then the KL divergence between them converges to 0 as t  .
We denote fµx (fµz (x)) fµx (fµz (x)) -1 as ~ z(x) for short. In addition, we define
z = fµz (x). Given these definitions, it follows that

qt (z|x) pt (z|x)

=

N z|z, t~ z pt (x) N (z|0, I)N (x|fµx(z), tI)

=

(2)d/2t(d-r)/2

~ z

-1/2
exp

- (z - z)

~ -z 1 (z - z) 2t

+ ||z||22 2

+

||x

- fµx (z)||22 2t

pt (x).

(17)

At this point, let

z = z + tz~

(18)

According to Lagrangian's mean value theorem, there exists a z between z and z such

that

fµx (z) = fµx (z) + fµx (z )(z - z) = x + fµx (z ) tz~,

(19)

where z = z +  tz~ is between z and z and  is a value between 0 and 1 (z = z

if  = 0 and z = z if  = 1). Use C(x) to represent the terms not related to z, i.e.,

(2)d/2t(d-r)/2

~ z

-1/2
pt (x).

Plug (18) and (19) into (17) and consider the limit given

by

lim qt (z|x) = lim C(x) exp - z~ ~z-1z~ + ||z + tz~||22 + ||fµx z +  tz~ z~||22

t pt (z|x)

t

22

2

= C(x) exp - z~ ~z-1z~ + ||z||22 + ||fµx (z) z~||22

22

2

= C(x) exp - z~ ~z-1z~ + ||z||22 + z~ fµx (z) fµx (z) z~

22

2

= C(x) exp ||z||22 2

(20)

11

The fourth equality comes from the fact that fµx (z) fµx (z) = fµx (fµz (x)) fµx (fµz (x)) = ~ z(x)-1. This expression is not related to z. Considering both qt(z|x) and pt(z|x) are probability distributions, the ratio should be equal to 1. The KL divergence between them
thus converges to 0 as t  .

7.3 Generalization to the Case with  > r

When  > r, we use the first r latent dimensions to build a projection between z and x and leave the remaining  - r latent dimensions unused. Specifically, let fµx(z) = f~µx(z1:r), where f~µx(z1:r) is defined as in (7) and t = 1/t. Again consider the case that t  . Then this decoder can also satisfy limt pt(x) = pgt(x) because it produces exactly the same distribution as the decoder defined by (7) and (8). The last  - r dimensions
contribute nothing to the generation process.
Now define the encoder as

fµz (x)1:r = f~µ-x1(x)

fµz (x)r+1: = 0

f~Sz (x)

fSz (x)

=

 nr+1    ... 

n

(21) (22)
(23)

where f~Sz (x) is defined as (14). Denote {ni}i=r+1 as a set of -dimensional column vectors satisfying

f~Sz (x)ni = 0

(24)

ni nj = 1i=j

(25)

Such a set always exists because f~Sz (x) is a r ×  matrix. So the dimension of the null space of f~Sz (x) is at least  - r. Assuming that {ni}i=r+1 are  - r basis vectors of null(f~Sz ), then the conditions (24) and (25) will be satisfied. The variance of the approximate posterior

then becomes

z = fSz (x)fSz (x)

=

f~Sz (x)f~Sz (x) 0

0 I -r

(26)

The first r dimensions can exactly match the true posterior as we have already shown.

The remaining  - r dimensions follow a standardized Gaussian distribution. Since these

dimensions contribute nothing to generating x, the true posterior should be the same as

the prior, i.e. a standardized Gaussian distribution. Moreover, any of these dimensions

is independent of all the other dimensions, so the corresponding off-diagonal elements of

the covariance of the true posterior should equal 0. Thus the approximate posterior also

matches the true posterior for the last  - r dimensions. As a result, we again have

limt KL qt (z|x)||pt (z|x) = 0.

8. Proof of Theorem M.2
Similar to Section 7, we also construct a bijection between  and Rr which transforms the ground-truth measure µgt to a normal Gaussian distribution. But in this construction, we

12

need one more step that bijects between  and Rr using the diffeomorphism (·), as shown in Figure 13. We will now go into the details.

 
 -1

  
  0,1 

  
-1 -1


  ()

  
Example:

2D manifold in 3D space

Diffeomorphism in 2D space

0,1 2

2D normal Gaussian

Figure 13: The relationship between different variables.

8.1 Finding a Sequence of Decoders such that pt(x) Converges to -
(·) is a diffeomorphism between  and Rr. So it transforms the ground-truth probability distribution p~gt(x) to another distribution pgut(u), where u  Rr. The relationship between the two distributions is

pugt(u)du = p~gt(x)µV (dx)|x=-1(u) = µgt(dx),

(27)

where µV (dx) is the volume measure with respect to X . Because (·) is a diffeomorphism, both (·) and -1(·) are differentiable. Thus dx/du is nonzero everywhere on the manifold.
Considering p~gt(x) is also nonzero everywhere, pgut(u) is nonzero everywhere. Analogous to the previous proof, define a function F : Rr  [0, 1]r as

F (u) = [F1(u1), F2(u2; u1), ..., Fr(ur; u1:r-1)]

Fi(ui; u1:i-1) =

ui pgut(ui|u1:i-1)dui.
ui=-

According to this definition, we have

,

(28) (29)

dF (u) = pgut(u)du.

(30)

Since pugt(u) is nonzero everywhere, F (·) is invertible. We also define another differentiable and invertible function G : Rr  [0, 1]r as (5).
Now let the decoder mean function be given by

fµx (z; t) = -1  F -1  G(z),

t

=

1 .
t

(31) (32)

13

Then we have

pt (x) = =

Rr pt (x|z)p(z)dz N x|-1(u), tI pugt(u)du.
Rr

(33)

We next show that pt(x) diverges to infinite as t   for any x. For a given x, let u = (x) and B(u, t) be the closed ball centered at u with radius t. Then

pt (x)  =

 N x|-1(u), tI pugt(u)du B(u, t)

(2 B(u,t)

t)-d/2 exp

-

x - -1(u) 2t

2 2

pugt(u)du.

(34)

According to the Lagrangian's mean value theorem, there exists a u between u and u such that

-1(u)

=

-1(u)

+

d-1(u) du

|u=u

(u

-

u)

=

x

+

d-1(u) du

|u=u

(u

-

u).

(35)

If we denote (u ) =

d-1(u) du

|u=u

d-1(u) du

|u=u

, we then have that

x - -1(u)

2 2

=

(u - u) (u )(u - u) =

(u )i,j(ui - ui) (uj - uj)

i,j


  (u )i,j (ui - ui)2  (u ) 1 · ||u - u||22.
ij

(36)

And after defining

D(u) =

max uB(u

,1)

||(u)||1


max ||(u)|| , uB(u, t)

(37)

it also follows that

x - -1(u)

2 2


(u ) 1 · ||u - u||22  D(u)t

u  B(u, t).

(38)

Plugging this inequality into (34) gives

pt (x) 

 (2t)-d/2 exp B(u, t)

-

D(u)t 2t

pgut(u)du


(2t)-d/2 exp

- D(u) 2

min uB(u,1)

pgut

(u)

 du B(u, t)

=

(2t)-d/2 exp

- D(u) 2

min uB(u,1)

pgut

(u)

V

B(u,

t) ,

(39)

14

where V B(u,

t)

is

the

volume

of

the

r-dimensional

ball

B(u,

 
).

The volume

should be arr/2 where ar is a constant related to the dimension r. So

pt (x)  (2)-d/2-(d-r)/2ar exp

- D(u) 2

min uB(u

,1)

pgut

(u)

.

(40)

Since (·) defines a diffeomorphism, D(u) < . Moreover, minuB(u,1) pugt(u) > 0 because pgut(u) is nonzero and continuous everywhere. We may then conclude that

lim
t

-

log

p (x)

=

-.

(41)

for x  X . This then implies that the stated average across X with respect to µgt will also be -.

8.2 Finding a Sequence of Encoders such that KL qt (z|x)||pt(z|x) Converges to 0
Similar to (13) and (14), let the encoder be

fµz (x; t) = G-1  F  (x) = fµ-x1(x; t),

fSz (x; t ) =

-1
t fµx (fµz (x; t ); t) fµx (fµz (x; t ); t) .

(42) (43)

Following the proofs in Section 7.2, we can prove the KL divergence between qt (z|x) and pt(z|x) converges to 0.

8.3 The Relationship between limt pt(x) and µgt(x)
We then prove our construction from (31) and (32) satisfies (M.6). Unlike the case d = r where we can compare pt(x) and pgt(x) directly, here pt(x) is a density defined everywhere in Rd while µgt is a probability measure defined only on the r-dimensional manifold . Consequently, to assess pt(x) relative to µgt, we evaluate the respective probability mass assigned to any measurable subset of Rd denoted as A. For pt(x), we integrate the density over A while for µgt we compute the measure of the intersection of A with , i.e., µgt confines all mass to the manifold.
We begin with the probability distribution given by pt(x):

pt (x) = =

Rr pt (x|z)p(z)dz =

N
Rr

x|-1  F -1  G(z), tI

N x|-1  F -1(), tI d
[0,1]r

dG(z)

= N x|-1(u), tI pgut(u)du
Rr

=

N x 

x|x , t

µgt(dx ).

(44)

15

Consider a measurable set A  Rd,

lim
t

xA

pt (x)dx

=

lim
t

xA

N x 

x|x , t

µgt(dx )

dx

= lim t x 
= lim x  t

N xA

x|x , t

dx

µgt(dx )

N x|x , t dx µgt(dx ). xA

(45)

The second equation that interchanges the order of the integrations admitted by Fubini's
theorem. The third equation that interchanges the order of the integration and the limit is
justified by the bounded convergence theorem. We now note that the term inside the first integration, N (x|x , tI), converges to a Dirac-delta function as t  0. So the integration over A depends on whether x is inside A or not, i.e.,

lim
t

N xA

x|x , tI

dx

=

1 if x  A - A, 0 if x  Ac - A.

(46)

We separate the manifold  into three parts:   (A - A),   (Ac - A) and   A. Then (45) can be separated into three parts accordingly. The first two parts can be derived as

lim (A-A) t

N (x|x , tI)dx µgt(dx ) =

1 µgt(dx ) = µgt (  (A - A)) ,

A (A-A)

(47)

lim (Ac-A) t

N (x|x , tI)dx µgt(dx ) =

0 µgt(dx ) = 0.

A (A-A)

For the third part, given the assumption that µgt(A) = 0, we have

(48)

0  lim A t
Therefore we have

A N (x|x , tI)dx

µgt(dx ) 

1 µgt(dx ) = µgt (  A) = 0. A
(49)

and thus

lim A t

A N (x|x , tI)dx µgt(dx ) = 0

(50)

lim
t

pgt(x; tI)dx A

=

lim x  t

N xA

x|x , t

dx

µgt(dx )

= µgt (  (A - A)) + 0 + 0

= µgt(  A),

(51)

16

leading to (M.6).
Note that this result involves a subtle requirement involving the boundary A. This
condition is only included to handle a minor, practically-inconsequential technicality. In
brief, as a density pt(x) will apply zero mass exactly on any low-dimensional manifold, although it can apply all of its mass to any region in the neighborhood of . But suppose we
choose some A is a subset of , i.e, it is exclusively confined to the ground-truth manifold.
Then the probability mass within A assigned by µgt will be nonzero while that given by pt(x) can still be zero. Of course this does not mean that pt(x) and µgt do not match each other in any practical sense. This is because if we expand this specialized A by an
arbitrary small d-dimensional volume, then pt(x) and µgt will now supply essentially the same probability mass on this infinitesimally expanded set (which is arbitrary close to A).

9. Proof of Theorem M.3
From the main text, {,  } is the optimal solution with a fixed . The true posterior and the approximate posterior are

p (z|x)

=

p(z)p (x|z) , p (x)

q (z|x) = N z|µz(x; ), z x; )).

(52) (53)

9.1 Case 1: r = d
We first argue that the KL divergence between p (z|x) and q (z|x) is always strictly greater than zero. This can be proved by contradiction. Suppose the KL divergence exactly equals zero. Then p (z|x) must also be a Gaussian distribution, meaning that the logarithm of p (z|x) is a quadratic form in z. In particular, we have

log p (z|x) = log N (z|0, I) + log N (x|fµx(z), I) - log p(x)

=

-

1 2

||z||22

-

1 ||x 2

-

fµx (z)||22

+

constant,

(54)

where we have absorbed all the terms not related to z into a constant, and it must be that

fµx(z) = W z + b, for some matrix W and vector b. Then we have

(55)

p (x) =

p(x|z)p(z)dz
R

= N (x|W z + b, I)N (z|0, I)dz.
R

(56)

This is a Gaussian distribution in Rd which contradicts our assumption that pgt(x) is not Gaussian. So the KL divergence between p (z|x) and q (z|x) is always greater than 0. As a result, L(,  ) cannot reach the theoretical optimal solution, i.e.,  -pgt(x) log pgt(x)dx. Denote the gap between L(,  ) and  -pgt(x) log pgt(x)dx as . According to the proof

17

in Section 7, there exists a t0 such that for any t > t0, the gap between the proposed solution in Section 7 and the theoretical optimal solution is smaller than . Pick some t > t0 such that 1/t <  and let  = 1/t. Then

L( ,  )  L(t, t ) < L(,  ).

(57)

The first inequality comes from the fact that { ,  } is the optimal solution when  is fixed at  while {t, t } is just one solution with  = 1/t =  . The second inequality holds because we chose {t, t } to be a better solution than {,  }.

9.2 Case 2: r < d
In this case, KL q (z|x)||p (z|x) does not need to be zero because it is possible that - log p (x) diverges to negative infinity and absorbs the positive cost caused by the KL divergence. Consider the objective function expression from (M.2). In can be bounded by

L(,  ) =  -Eq (z|x) log p (x|z) + KL q (z|x)||p(z) µgt(dx)


-Eq (z|x)

||x - fµx(z)||22 + d log(2) 2 2

µgt(dx)

 d log  > -. 2

(58)

The first inequality holds discarding the KL term, which is non-negative. The second inequality holds because a quadratic term is removed. Furthermore, according to the proof in Section 8, there exists a t0 such that for any t > t0,

L(t, t )

<

d 2

log .

(59)

Again, we select a t > t0 such that 1/t <  and let  = 1/t. Then

L( ,  )  L(t, t) < L(, ).

(60)

10. Proof of Theorem M.4
Recall that

q (z|x) = N z|fµz (x;  ), fSz (x;  )fSz (x;  ) p (x|z) = N x|fµx (z; ), I .
Plugging these expressions into (M.2) we obtain

,

L(,  ) = Ezq (z|x)

1 2 ||fµx(z)

-

x||22

+

d 2

log(2)

+ KL q (z|x)||p(z)

1  2 E N (0,I)

||fµx [fµz (x) + fSz (x) ] - x||22

d + log(2),
2

(61) (62)
(63) (64)

18

where we have omitted explicit inclusion of the parameters  and  in the functions fµz (·), fSz (·) and fµx(·) to avoid undue clutter. Now suppose

lim
0

E

N (0,I )

||fµx [fµz (x) + fSz (x) ] - x||22

=  = 0.

(65)

It then follows that

lim
0

L(

,


)


lim
0

 2

+

d 2

log(2)

=

+,

which contradicts the fact that L(,  ) converges to -. So we must have that

(66)

lim E
0

N (0,I )

||fµx [fµz (x) + fSz (x) ] - x||22

= 0.

(67)

Because the term inside the expectation, i.e., ||fµx [fµz (x) + fSz (x) ] - x||22, is always nonnegative, we can conclude that

lim
0

fµx

[fµz (x) + fSz (x)

]

=

x.

And if we let = 0, this equation then becomes

(68)

lim
0

fµx

[fµz

(x)]

=

x.

(69)

11. Further Analysis of the VAE Cost as  becomes small
In the main paper, we mentioned that the squared eigenvalues of fSz (x;  ) will become arbitrary small at a rate proportional to . To justify this, we borrow the simplified notation from the proof of Theorem M.4 and expand fµx(z) at z = fµz (x) using a Taylor series. Omitting the high order terms (in the present narrow context around the neighborhood of VAE global optima these will be small), this gives

fµx (z)  fµx [fµz (x)] + fµx [fµz (x)] (z - fµz )  x + fµx [fµz (x)] (z - fµz ). Plug this expression and (61) into (63), we obtain

(70)

L(,  )  Ezq (z|x)

1 2

||fµx

[fµz (x)]

(z

-

fµz (x))||22

+

d 2

log(2)

1 +
2

||fµz (x)||22 + tr fSz (x)fSz (x)

- log |fSz (x)fSz (x) | - 

1 = tr
2

Ezq (z|x)

(z - fµz (x))

(z - fµz (x))

fµx [fµz (x)]

fµx [fµz (x)]

d1 + log(2) +
22 = tr fSz (x)fSz (x)

||fµz (x)||22 + tr fSz (x)fSz (x) - log |fSz (x)fSz (x) | - 

11

I 2

+

2

fµx

[fµz (x)]

fµx [fµz (x)]

d1 + log(2) +
22

||fµz (x)||22 - log |fSz (x)fSz (x) | - 

.

(71)

19

From these manipulations we may conclude that the optimal value of fSz (x)fSz (x)

satisfy

11

I 2

+

2

fµx

[fµz (x)]

fµx [fµz (x)]

-1 2

fSz (x)fSz (x)

-1
= 0.

must (72)

So

1 -1

fSz (x)fSz (x) = I +  fµx [fµz (x)] fµx [fµz (x)] .

(73)

Note that fµx [fµz (x)] is the tangent space of the manifold  at fµx [fµz (x)], so the rank must
be r. fµx [fµz (x)] fµx [fµz (x)] can be decomposed as U SU , where U is a -dimensional orthogonal matrix and S is a -dimensional diagonal matrix with r nonzero elements.
Denote diag[S] = [S1, S2, ..., Sr, 0, ..., 0]. Then

fSz (x)fSz (x)

=

U

diag

1 + S1 , ..., 1 + Sr , 1, ..., 1

U

-1
.


(74)

Case 1: r = . In this case, S has no nonzero diagonal elements, and therefore

11

fSz (x)fSz (x)

= U diag

1

+

S1 

, ...,

1

+

Sr 

U

.

(75)

As   0, the eigenvalues of fSz (x)fSz (x)

, which are given by

1

1+

Si 

,

converge

to

0

at

a

rate of O().

Case 2: r < . In this case, the first r eigenvalues also converge to 0 at a rate of

O(), but the remaining  - r eigenvalues will be 1, meaning the redundant dimensions are

simply filled with noise matching the prior p(z as desired.

References
Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of Wasserstein GANs. In Advances in Neural Information Processing Systems, pp. 5767­5777, 2017.
Mario Lucic, Karol Kurach, Marcin Michalski, Sylvain Gelly, and Olivier Bousquet. Are GANs created equal? A large-scale study. arXiv:1711.10337v3, 2018.
Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. International Conference on Learning Representations, 2018.

20