text/papers/pdf?id=B1l08oAct7.txt

Under review as a conference paper at ICLR 2019
FIXING VARIATIONAL BAYES: DETERMINISTIC VARIATIONAL INFERENCE FOR BAYESIAN NEURAL NETWORKS
Anonymous authors Paper under double-blind review
ABSTRACT
Bayesian neural networks (BNNs) hold great promise as a flexible and principled solution to deal with uncertainty when learning from finite data. Among approaches to realize probabilistic inference in deep neural networks, variational Bayes (VB) is theoretically grounded, generally applicable, and computationally efficient. With wide recognition of potential advantages, why is it that variational Bayes has seen very limited practical use for BNNs in real applications? We argue that variational inference in neural networks is fragile: successful implementations require careful initialization and tuning of prior variances, as well as controlling the variance of Monte Carlo gradient estimates. We fix VB and turn it into a robust inference tool for Bayesian neural networks. We achieve this with two innovations: first, we introduce a novel deterministic method to approximate moments in neural networks, eliminating gradient variance; second, we introduce a hierarchical prior for parameters and a novel Empirical Bayes procedure for automatically selecting prior variances. Combining these two innovations, the resulting method is highly efficient and robust. On the application of heteroscedastic regression we demonstrate strong predictive performance over alternative approaches.
1 INTRODUCTION
Bayesian approaches to neural network training marry the representational flexibility of deep neural networks with principled parameter estimation in probabilistic models. Compared to "standard" parameter estimation by maximum likelihood, the Bayesian framework promises to bring key advantages such as better uncertainty estimates on predictions and automatic model regularization (MacKay, 1992; Graves, 2011). These features are often crucial for informing downstream decision tasks and reducing overfitting, particularly on small datasets. However, despite potential advantages, such Bayesian neural networks (BNNs) are often overlooked due to two limitations: First, posterior inference in deep neural networks is analytically intractable and approximate inference with Monte Carlo (MC) techniques can suffer from crippling variance given only a reasonable computation budget (Kingma et al., 2015; Molchanov et al., 2017; Miller et al., 2017; Zhu et al., 2018). Second, performance of the Bayesian approach is sensitive to the choice of prior (Neal, 1993), and although we may have a priori knowledge concerning the function represented by a neural network, it is generally difficult to translate this into a meaningful prior on neural network weights. Sensitivity to priors and initialization makes BNNs non-robust and thus often irrelevant in practice.
In this paper, we describe a novel approach for inference in feed-forward BNNs that is simple to implement and aims to solve these two limitations. We adopt the paradigm of variational Bayes (VB) for BNNs (Hinton & van Camp, 1993; MacKay, 1995c) which is normally deployed using Monte Carlo variational inference (MCVI) (Graves, 2011; Blundell et al., 2015). Within this paradigm we address the two shortcomings of current practice outlined above: First, we address the issue of high variance in MCVI, by reducing this variance to zero through novel deterministic approximations to variational inference in neural networks. Second, we derive a general and robust Empirical Bayes (EB) approach to prior choice using hierarchical priors. By exploiting conjugacy we derive data-adaptive closed-form variance priors for neural network weights, which we experimentally demonstrate to be remarkably effective.
1

Under review as a conference paper at ICLR 2019

Combining these two novel ingredients gives us a performant and robust BNN inference scheme that we refer to as "deterministic variational inference" (DVI). We demonstrate robustness and superior predictive performance in the context of non-linear regression models, deriving novel closedform results for expected log-likelihoods in homoscedastic and heteroscedastic regression (similar derivations for classification can be found in the appendix).
Experiments on standard regression datasets from the UCI repository, (Dheeru & Karra Taniskidou, 2017), show that for identical models DVI converges to local optima with better predictive loglikelihoods than existing methods based on MCVI. In direct comparisons, we show that our Empirical Bayes formulation automatically provides better or comparable test performance than manual tuning of the prior and that heteroscedastic models consistently outperform the homoscedastic models.
Concretely, our contributions are:
· Development of a deterministic procedure for propagating uncertain activations through neural networks with uncertain weights and ReLU or Heaviside activation functions.
· Development of an EB method for principled tuning of weight priors during BNN training. · Experimental results showing the accuracy and efficiency of our method and applicability to
heteroscedastic and homoscedastic regression on real datasets.

2 VARIATIONAL INFERENCE IN BAYESIAN NEURAL NETWORKS

We start by describing the inference task that our method must solve to successfully train a BNN. Given a model M parameterized by weights w and a dataset D = (x, y), the inference task is to discover the posterior distribution p(w|x, y). A variational approach acknowledges that this posterior generally does not have an analytic form, and introduces a variational distribution q(w; ) parameterized by  to approximate p(w|x, y). The approximation is considered optimal within the variational family for  that minimizes the Kullback-Leibler (KL) divergence between q and the true posterior.
 = argmin DKL [q(w; )||p(w|x, y)].


Introducing a prior p(w) and applying Bayes rule allows us to rewrite this as optimization of the quantity known as the evidence lower bound (ELBO):

 = argmax {Ewq [log p(y|w, x)] - DKL [q(w; )||p(w)]} .


(1)

Analytic results exist for the KL term in the ELBO for careful choice of prior and variational distributions (e.g. Gaussian families). However, when M is a non-linear neural network, the first term in equation 1 (referred to as the reconstruction term) cannot be computed exactly: this is where
MC approximations with finite sample size S are typically employed:

Ewq

[log p(y|w, x)]


1 S

S

log p(y|w(s), x),

w(s)  q(w; ).

s=1

(2)

Our goal in the next section is to develop an explicit and accurate approximation for this expectation, which provides a deterministic, closed-form expectation calculation, stabilizing BNN training by removing all stochasticity due to Monte Carlo sampling.

3 DETERMINISTIC VARIATIONAL APPROXIMATION
Figure 1 shows the architecture of the computation of Ewq [log p(D|w)] for a feed-forward neural network. The computation can be divided into two parts: first, propagation of activations though parameterized layers and second, evaluation of an unparameterized log-likelihood function (L). In this section, we describe how each of these stages is handled in our deterministic framework.
3.1 MOMENT PROPAGATION
We begin by considering activation propagation (figure 1(a)), with the aim of deriving the form of an approximation q~(aL) to the final layer activation distribution q(aL) that will be passed to

2

Under review as a conference paper at ICLR 2019

(a)


(0) (1)
( )

(0; ) (1; )

Layer 1
0 1

(b)

Layer 2 ... Layer 
1 


 log   , 

Figure 1: Architecture of a Bayesian neural network. Computation is divided into (a) propagation of activations (a) from an input x and (b) computation of a log-likelihood function L for outputs y. Weights are represented as high dimensional variational distributions (blue) that induce distributions over activations (yellow). MCVI computes using samples (dots); our method propagates a full distribution.

the likelihood computation. We compute aL by sequentially computing the distributions for the activations in the preceding layers. Concretely, we define the action of the lth layer that maps a(l-1) to al as follows:
hl = f (a(l-1)),
al = hlW l + bl ,

where f is a non-linearity and {W l, bl}  w are random variables representing the weights and biases of the lth layer that are assumed independent from weights in other layers. For notational
clarity, in the following we will suppress the explicit layer index l, and use primed symbols to denote variables from the (l - 1)th layer, e.g. a = a(l-1). Note that we have made the non-conventional
choice to draw the boundaries of the layers such that the linear transform is applied after the nonlinearity. This is to emphasize that al is constructed by linear combination of many distinct elements
of h , and in the limit of vanishing correlation between terms in this combination, we can appeal to
the central limit theorem (CLT). Under the CLT, for a large enough hidden dimension, elements ai will be normally distributed regardless of the potentially complicated distribution for hj induced by f 1. We empirically observe that this claim is approximately valid even when (weak) correlations
appear between the elements of h during training (see section 3.1.1).

Having argued that a adopts a Gaussian form, it remains to compute the first and second moments. In general, these cannot be computed exactly, so we develop an approximate expression. An overview of this derivation is presented here with more details in appendix A. First, we model W , b and h as independent random variables, allowing us to write:

ai = hj Wji + bi , Cov(ai, ak) = hjhl Cov(Wji, Wlk) + Wji Cov(hj, hl) Wlk + Cov(bi, bk),

(3)

where we have employed the Einstein summation convention and used angle brackets to indicate expectation over q. If we choose a variational family with analytic forms for weight means and covariances (e.g. Gaussian with variational parameters Wji and Cov(Wji, Wlk)), then the only difficult terms are the moments of h:

hj 

f (j) exp

-

(j

- aj 2jj

)2

dj ,

(4)

hj hl 

f (j)f (l) exp

-1 2

j - aj l- al

jj jl lj ll

-1

j - aj l- al

dj dl , (5)

where we have used the Gaussian form of a parameterized by mean a and covariance  , and for brevity we have omitted the normalizing constants. Closed form solutions for the integral in equation 4 exist for Heaviside or ReLU choices of non-linearity f (see appendix A). Furthermore, for these non-linearities, the aj  ± and al  ± asymptotes of the integral in equation 5 have closed form. Figure 2 shows schematically how these asymptotes can be used as a first approximation for equation 5. This approximation is improved by considering that (by definition) the residual decays to zero far from the origin in the ( aj , al ) plane, and so is well modelled by a decaying function
1We are also required to choose a Gaussian variational approximation for b to preserve the Gaussian distribution of a.

3

Under review as a conference paper at ICLR 2019

A(µ1, µ2, )

Heaviside (µ1)(µ2)

ReLU

SR(µ1)SR(µ2) + (µ1)(µ2)

Q(µ1, µ2, )

-

log(

gh 2

)

+

 2gh ¯

µ12

+

µ22

-

2 1+¯

µ1µ2

+ O(µ4)

-

log(

gr 2

)

+

 2gr (1+¯)

µ12 + µ22

-

arcsin - gr 

µ1

µ2

+ O(µ4)

Table 1: Forms for the components of the approximation in equation 6 for Heaviside and ReLU

non-linearities.  is the CDF of a standard Gaussian, SR is a "soft ReLU" that we define as SR(x) =

(x) + x(x) where  is a standard Gaussian, ¯ =

1

-

2,

gh

=

arcsin


and

gr

=

gh

+

 1+¯

exp[-Q( aj , al ,  )], where Q is a polynomial in a with a dominant positive even term. In practice we truncate Q at the quadratic term, and calculate the polynomial coefficients by matching
the moments of the resulting Gaussian with the analytic moments of the residual. Specifically, using
dimensionless variables µi = ai / ii and ij = ij/ iijj, this improved approximation takes the form

hj hl

= jl jl

A(µj, µl, jl) + exp -Q(µj, µl, jl)

,

(6)

where the expressions for the asymptote A and quadratic Q are given in table table 1 and derived in appendix A.2.1 and A.2.2. Using equation 6 in equation 3 gives a closed form approximation for the moments of a as a function of moments of a . Since a is approximately normally distributed by the CLT, this is sufficient information to sequentially propagate moments all the way through the network to compute the mean and covariances of q~(aL), our explicit multivariate Gaussian approximation to q(aL). Any deep learning framework supporting special functions arcsin and  will immediately support backpropagation through the deterministic expressions we have presented. Below we briefly empirically verify the presented approximation, and in section 3.2 we will show how it is used to compute an approximate log-likelihood and posterior predictive distribution for regression and classification tasks.

(a)
 
=
=
(b)
 
=

 Heaviside

 
+
+

Asymptote

× 0.1
× 0. 1
+

Gaussian approx.
× 0.001

 
+

Asymptote
× 0.003

ReLU
Gaussian approx.

3.1.1 EMPIRICAL VERIFICATION Approximation accuracy The approximation derived

=

× 0.003

× 0.00001

++

above relies on three assumptions. First, that some form of

CLT holds for the hidden units during training where the

iid assumption of the classic CLT is not strictly enforced; Figure 2: Approximation of hjhl using

second, that a quadratic truncation of Q is sufficient2; and an asymptote and Gaussian correction for

third that there are only weak correlation between layers (a) Heaviside and (b) ReLU non-linearities.

so that they can be represented using independent variables in the variational distribution. To provide evidence that these assumptions hold in practice, we train a small

Yellow functions have closed-forms, and blue
indicates residuals. The examples are plotted for -6 < µ < 6 and jl = 0.5, and the relative magnitude of each correction term is

ReLU network with two hidden layers each of 128 units indicated on the vertical axis.

to perform 1D heteroscedastic regression on a toy dataset

of 500 points drawn from the distribution shown in figure 3(b). The training objective is taken from

section 4, and the only detail required here is that aL is a 2-element vector where the elements are

labelled as (m, ). We use a diagonal Gaussian variational family to represent the weights, but we

preserve the full covariance of a during propagation. Using an input x = 0.25 (see arrow, Figure 3(b))

we compute the distributions for m and both at the start of training (where we expect the iid as-

sumption to hold) and at convergence (where iid does not necessarily hold). Figure 3(c) shows the

comparison between aL distributions reported by our deterministic approximation and MC evaluation

using 20k samples from q(w; ). This comparison is qualitatively excellent for all cases considered.

2Additional Taylor expansion terms can be computed if this assumption fails.

4

Under review as a conference paper at ICLR 2019

After Training Before Training
per-batch runtime (s)

(a) (b)

1 128 128 2

2

1

 
 0

 0 1 =2 -1

Data samples Data 1-std Model 1-std

(c) () MC ours

-0.5 0.0 0.5 
()

10-1 10-2 10-3

MCVI dDVI DVI

3

300 100 30 10 3 1
2

10-4 102

103

hidden dimension, d

-20 -10 0 10 20 ()

-10 ()

0

10

Figure 4: Runtime performance of VI methods. We show the time to propagate a batch of 10 activation vectors through a single d × d layer. For

MCVI we label curves with the number of sam-

ples used, and we show quadratic and cubic scal-

ing guides-to-the-eye (black). Black dots indicate

-1.0 -0.8 -0.6 -0.4 -3.5 -3.0 -2.5 -2.0  

where our implementation runs out of memory (16GB).

Figure 3: Empirical accuracy of our approximation on toy 1-dimensional data. (a) We train a 2 layer ReLU network to perform heteroscedastic regression on the dataset shown in (b) and obtain the fit shown in blue. (c) The output distributions for the activation units m and evaluated at x = 0.25 are in excellent agreement with Monte Carlo (MC) integration with a large number (20k) of samples both before and after training.

Computational efficiency In traditional
MCVI, propagation of S samples of d-
dimensional activations through a layer containing a d × d-dimensional transformation requires O(Sd2) compute and O(Sd) memory. Our DVI method approximates the S   limit, while only demanding O(d3) compute and O(d2) memory (the additional factor of d
arises from manipulation of the quadratically

large covariance matrix Cov[hj, hl]). Whereas MCVI can always trade compute and memory for accuracy by choosing a small value for S, the inherent scaling of DVI with d could potentially limit

its practical use for networks with large hidden size. To avoid this limitation, we also consider the

case where only the diagonal entries Cov(hj, hj) are computed and stored at each layer. We refer to this method as "diagonal-DVI" (dDVI), and in section 6 we show the surprising result that the

strong test performance of DVI is largely retained by dDVI across a range of datasets. Figure 4

shows the time required to propagate activations through a single layer using the MCVI, DVI and

dDVI methods on a Tesla V100 GPU. As a rough rule of thumb (on this hardware), for layer sizes of

practical relevance, we see that absolute DVI runtimes roughly equate to MCVI with S = 300 and

dDVI runtime equates to S = 1.

3.2 LOG-LIKELIHOOD EVALUATION

To use the moment propagation procedure derived above for training BNNs, we need to build a function L that maps final layer activations aL to the expected log-likelihood term in equation 1 (see
figure 1(b)). In appendix B.1 we show the intuitive result that this expected log-likelihood over q(w) can be rewritten as an expectation over q~(aL).

Ewq [log p(y|x, w)] = EaLq(aL) log p(y|aL) .

(7)

With this form we can derive closed forms for specific tasks; for brevity we focus on the regression case and refer the reader to appendices B.4 and B.5 for the classification case.

Regression Case For simplicity we consider scalar y and a Gaussian noise model parameterized
by mean m(x; w) and heteroscedastic log-variance log y2(x) = (x; w). The parameters of this Gaussian are read off as the elements of a 2-dimensional output layer aL = (m, ) so that
p(y|aL) = N y|m, e . Recall that these parameters themselves are uncertain and the statistics aL and L can be computed following section 3.1. Inserting the Gaussian forms for p(y|aL) and q(aL) into equation 7 and performing the integral (see appendix B.2) gives a closed form expression

5

Under review as a conference paper at ICLR 2019

for the ELBO reconstruction term:

EaL q~(aL )

log p(y|aL)

=

-

1 2

log 2 +

+ .mm+( m -m -y)2
e - /2

(8)

This heteroscedastic model can be made homoscedastic by setting =  = m = 0. The expression in equation 8 completes the derivations required to implement the closed form approximation to the ELBO reconstruction term for training a network. In addition, we can also compute a closed form approximation to the predictive distribution that is used at test-time to produce predictions that incorporate all parameter uncertainties. By approximating the moments of the posterior predictive and assuming normality (see appendix B.3), we find:

p(y)  p(y|aL) q~(aL) daL  N y m , mm + e + /2 .

(9)

4 EMPIRICAL BAYES FOR VARIATIONAL BNNS

So far, we have described methods for deterministic approximation of the reconstruction term in the
ELBO. We now turn to the KL term. For a d-dimensional Gaussian prior p(w) = N (µp, p), the KL divergence with the Gaussian variational distribution q = N (µq, q) has closed form:

DKL [q||p]

=

1 2

log

|p | |q |

-

d

+

Tr

-p 1q

+ (µp - µq)

-p 1(µp - µq)

.

(10)

However, this requires selection of (µp, p) for which there is usually little intuition beyond arguing µp = 0 by symmetry and choosing p to preserve the expected magnitude of the propagated activations (Glorot & Bengio, 2010; He et al., 2015). In practice, variational Bayes for neural network
parameters is sensitive to the choice of prior variance parameters, and we will demonstrate this
problem empirically in section 6 (figure 5).

To make variational Bayes robust we parameterize the prior hierarchically, retaining a conditional
diagonal Gaussian prior and variational distribution on the weights. The hierarchical prior takes the form s  p(s); w  p(w|s), using an inverse gamma distribution on s as the conjugate prior to the elements of the diagonal Gaussian variance. We partition the weights into sets {} that typically coincide with the layer partitioning3, and assign a single element in s to each set:

s  Inv-Gamma(, ) , wi  N (0, s), for shape  and scale , and where wi is the ith weight in set .

(11)

Rather than taking the fully Bayesian approach, we adopt an empirical Bayes approach (Type-2 MAP), optimizing s, assuming that the integral is dominated by a contribution from this optimal value s = s . We use the data to inform the optimal setting of s to produce the tightest ELBO:
ELBO = Ewq log p(y|hL(w)) - {DKL [q(w; )||p(w|s)p(s)]}

= s = argmin DKL q(w; )||p(w|s) - log p(s)
s

(12)

Writing out the integral for the KL in equation 12, substituting in the forms of the distributions in

equation 11 and differentiating to find the optimum gives

s = Tr

q + µq(µq )  + 2 + 2

+ 2 ,

(13)

where  is the number of weights in the set . The influence of the data on the choice of s is made explicit here through dependence on the learned variational parameters q and µq. Using s to populate the elements of the diagonal prior variance p, we can evaluate the KL in equation 10
under the empirical Bayes prior. Optimization of the resulting ELBO then simultaneously tunes the

variational distribution and prior.

In the experiments we will demonstrate that the proposed empirical Bayes approach works well; however, it only approximates the full Bayesian solution, and it could fail if we were to allow too many degrees of freedom. To see this, assume we were to use one prior per weight element, and we would also define a hyperprior for each prior mean. Then, adjusting both the prior variance and prior mean using empirical Bayes would always lead to a KL-divergence of zero and the ELBO objective would degenerate into maximum likelihood.

3In general, any arbitrary partitioning can be used

6

Under review as a conference paper at ICLR 2019
5 RELATED WORK
Bayesian neural networks have a rich history. In a 1992 landmark paper David MacKay demonstrated the many potential benefits of a Bayesian approach to neural network learning (MacKay, 1992); in particular, this work contained a convincing demonstration of naturally accounting for model flexibility in the form of the Bayesian Occam's razor, facilitating comparison between different models, accurate calibration of predictive uncertainty, and to perform learning robust to overfitting. However, at the time Bayesian inference was achieved only for small and shallow neural networks using a comparatively crude Laplace approximation. Another early review article summarizing advantages and challenges in Bayesian neural network learning is (MacKay, 1995c).
This initial excitement around Bayesian neural networks led to two main methods being developed; First, Hinton & van Camp (1993) and MacKay (1995b) developed the variational Bayes (VB) approach for posterior inference. Whereas Hinton & van Camp (1993) were motivated from a minimum description length (MDL) compression perspective, MacKay (1995b) motivated his equivalent ensemble learning method from a statistical physics perspective of variational free energy minimization. Barber & Bishop (1998) extended the methodology for two-layer neural networks to use general multivariate Normal variational distributions. Second, Neal (1993) developed efficient gradient-based Monte Carlo methods in the form of "hybrid Monte Carlo", now known as Hamiltonian Monte Carlo, and also raised the question of prior design and limiting behaviour of Bayesian neural networks.
Rebirth of Bayesian neural networks. After more than a decade of no further work on Bayesian neural networks Graves (2011) revived the field by using Monte Carlo variational inference (MCVI) to make VB practical and scalable, demonstrating gains in predictive performance on real world tasks.
Since 2015 the VB approach to Bayesian neural networks is mainstream (Blundell et al., 2015); key research drivers since then are the problems of high variance in MCVI and the search for useful variational families. One approach to reduce variance in feedforward networks is the local reparameterization trick (Kingma et al., 2015) (see appendix D). To enhance the variational families more complicated distributions such as Matrix Gaussian posteriors (Louizos & Welling, 2016), multiplicative posteriors (Kingma et al., 2015), and hierarchical posteriors (Louizos & Welling, 2017) are used. Both our methods, the deterministic moment approximation and the empirical Bayes estimation, can potentially be extended to these richer families.
Prior choice. Choosing priors in Bayesian neural networks remains an open issue. The hierarchical priors for feedforward neural networks that we use have been investigated before by Neal (1993) and MacKay (1995a), the latter proposing a "cheap and cheerful" heuristic, alternating optimization of weights and inverse variance parameters. Barber & Bishop (1998) also used a hierarchical prior and an efficient closed-form factored VB approximation; our approach can be seen as a point estimate to their approach in order to enable use of our closed-form moment approximation. Graves (2011) also used hierarchical Gaussian priors with flat hyperpriors, deriving a closed-form update for the prior mean and variance. Compared to these prior works our approach is rigorous and with sufficient data accurately approximates the Bayesian approach of integrating over the prior parameters.
Alternative inference procedures. As an alternative to variational Bayes, probabilistic backpropagation (PBP) (Herna´ndez-Lobato & Adams, 2015) applies approximate inference in the form of assumed density filtering (ADF) to refine a Gaussian posterior approximation. Like in our work, each update to the approximate posterior requires propagating means and variances of activations through the network. (Herna´ndez-Lobato & Adams, 2015) only consider the diagonal propagation case and regression. Since the original work, PBP has been generalized to classification (Ghosh et al., 2016) and richer posterior families such as the matrix variate Normal posteriors (Sun et al., 2017). Our moment approximation could be used to improve the inference accuracy of PBP.
Gaussianity in neural networks. Our demonstration of Gaussianity of ReLU network activations is also directly relevant to recent work on Gaussian process interpretations of deep neural networks (Matthews et al., 2018; Lee et al., 2017), validating the insight that activations in deep neural networks are closely approximated by Gaussian processes. Two recent works derived deterministic moment approximations for deep neural networks: Bibi et al. (2018), using Price's theorem, derived exact first and second moment expressions for ReLU activations but limit themselves to the case of zero-mean Gaussian activations. Kandemir et al. (2018) also derive closed-form solutions to the
7

Under review as a conference paper at ICLR 2019

ELBO for the case of diagonal Gaussian variational families. However, their approach is limited to linear layers without bias.
Markov chain Monte Carlo approaches. Another rich class of approximate inference methods for Bayesian neural networks are stochastic gradient Markov chain Monte Carlo (SG-MCMC) methods. These methods allow for approximate posterior parameter inference using unbiased log-likelihood estimates. Stochastic gradient Langevin dynamics (SGLD) was the first method in this class (Welling & Teh, 2011). SGLD is particularly simple and efficient to implement, but recent methods increase efficiency in the case of correlated posteriors by estimating the Fisher information matrix (Ahn et al., 2012) and extend Hamiltonian Monte Carlo to the stochastic gradient case (Chen et al., 2014). A complete characterization of SG-MCMC methods is given by (Ma et al., 2015; Gong et al., 2018). However, despite this progress, important theoretical questions regarding approximation guarantees for practical computational budgets remain (Nagapetyan et al., 2017). Moreover, while SG-MCMC methods work robustly in practice, they remain computationally inefficient, especially because evaluation of the posterior predictive requires evaluating an ensemble of models.
Wild approximations. The above methods are principled but often require sophisticated implementations; recently, a few methods aim to provide "cheap" approximations to the Bayes posterior. Dropout has been interpreted by Gal & Ghahramani (2016) to approximately correspond to variational inference. Likewise, Bootstrap posteriors (Lakshminarayanan et al., 2017; Fushiki et al., 2005; Harris, 1989) have been proposed as a general, robust, and accurate method for posterior inference. However, obtaining a bootstrap posterior ensemble of size k is computationally intense at k times the computation of training a single model.
6 EXPERIMENTS
We implement4 deterministic variational inference (DVI) as described above to train small ReLU networks on UCI regression datasets (Dheeru & Karra Taniskidou, 2017). The experiments address the claims that our methods for eliminating gradient variance and automatic tuning of the prior improve the performance of the final trained model. In Appendix C we present extended results to demonstrate that our method is competitive against a variety of models and inference schemes.

Dataset |D|

dx

DVI

dDVI

MCVI

hoDVI

bost

506

13 -2.41 ± 0.02

-2.42 ± 0.02

-2.46 ± 0.02

-2.58 ± 0.04

conc 1030

8 -3.06 ± 0.01

-3.07 ± 0.02

-3.07 ± 0.01

-3.23 ± 0.01

ener

768

8 -1.01 ± 0.06

-1.06 ± 0.06

-1.03 ± 0.04

-2.09 ± 0.06

kin8 8192

8

1.13 ± 0.00

1.13 ± 0.00

1.14 ± 0.00

1.01 ± 0.01

nava 11934

16

6.29 ± 0.04

6.22 ± 0.06

5.94 ± 0.05

5.84 ± 0.06

powe 9568

4

-2.80 ± 0.00

-2.80 ± 0.00

-2.80 ± 0.00

-2.82 ± 0.00

prot 45730

9

-2.85 ± 0.01

-2.84 ± 0.01

-2.87 ± 0.01

-2.94 ± 0.00

wine 1588

11 -0.90 ± 0.01

-0.91 ± 0.02

-0.92 ± 0.01

-0.96 ± 0.01

yach

308

6

-0.47 ± 0.03

-0.47 ± 0.03

-0.68 ± 0.03

-1.41 ± 0.03

Table 2: Average test log-likelihood on UCI datasets. |D| is the dataset size, and dx is the input dimension.

Deterministic vs. Stochastic We compare DVI with MCVI from equation 2 with S = 10 samples. The same model is used for each inference method: a single hidden layer of 50 units for each dataset considered, extending this to 100 units in the special case of the larger protein structure dataset, prot. Additionally, both methods use the same EB prior from equation 13 with a broad inverse Gamma hyperprior ( = 1,  = 10) and an independent s for each linear transformation. Each dataset is split into random training and test sets with 90% and 10% of the data respectively. This splitting process is repeated 20 times and the average test performance of each method at convergence is reported in table 2 (see also learning curves in appendix E). We see that DVI consistently outperforms MCVI, by up to 0.35 nats per data point on some datasets. The computationally efficient diagonalDVI (dDVI) surprisingly retains much of this performance. By default we use the heteroscedastic
4Our implementation in TensorFlow is available at redacted for anonymity
8

Under review as a conference paper at ICLR 2019

model, and we observe that this uniformly delivers better results than a homoscedastic model (hoDVI; rightmost column in table 2) on these datasets with no overfitting issues5.

Empirical Bayes In Figure 5 we compare the performance of networks trained with manual tuning of a fixed Gaussian prior to networks trained with the automatic EB tuning. We find that the EB method consistently finds priors that produce models with competitive or significantly improved test log-likelihood relative to the best manual setting. Since this observation holds across all datasets considered, we say that our method is "robust". Note that the EB method can outperform manual tuning because it automatically finds different prior variances for each weight matrix, whereas in the manual tuning case we search over a single hyperparameter controlling all prior variances.

bost
-2.4

conc
-3.05

ener
-1.0

test log-likelihood [higher is better]

-2.6
kin8
1.13 1.12 1.11

prot
-2.84

-2.86

101

-3.10

-1.2

nava powe

6.3 6.2 6.1
wine

-2.80
-2.81
yach

-0.90

-0.5

-0.95 103

-0.6
-0.7 101 103 prior variance

no EB EB
101 103

Figure 5: Comparison of converged test log-likelihood with a manually tuned prior variance (orange) or empirical Bayes (blue).

7 CONCLUSION
We introduced two innovations to make variational inference for neural networks more robust: 1. an effective deterministic approximation to the moments of activations of a neural networks; and 2. a simple empirical Bayes hyperparameter update. We demonstrate that together these innovations make variational Bayes a competitive method for Bayesian inference in neural heteroscedastic regression models.
Beside the challenge of efficient posterior inference, for Bayesian neural networks two major issues remain open. First, how to design suitable priors for functions represented by neural network parameters? And second, what structure do the posterior distributions in neural network models have and how can this be used to improve approximate inference (Watanabe, 2009)?

REFERENCES
Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via stochastic gradient Fisher scoring. arXiv preprint arXiv:1206.6380, 2012.
David Barber and Christopher M Bishop. Ensemble learning in Bayesian neural networks. NATO ASI SERIES F COMPUTER AND SYSTEMS SCIENCES, 168:215­238, 1998.
Adel Bibi, Modar Alfadly, and Bernard Ghanem. Analytic expressions for probabilistic moments of PL-DNN with Gaussian input. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
Thang Bui, Daniel Herna´ndez-Lobato, Jose Hernandez-Lobato, Yingzhen Li, and Richard Turner. Deep Gaussian processes for regression using approximate expectation propagation. In International Conference on Machine Learning, pp. 1472­1481, 2016.
Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning, pp. 1683­1691, 2014.
5Note that this result is non-trivial because heteroscedastic models are more complex and could result in poorer approximate inference leading to worse test performance

9

Under review as a conference paper at ICLR 2019
Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017. URL http: //archive.ics.uci.edu/ml.
Tadayoshi Fushiki, Fumiyasu Komaki, Kazuyuki Aihara, et al. Nonparametric bootstrap prediction. Bernoulli, 11(2):293­307, 2005.
Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050­1059, 2016.
Soumya Ghosh, Francesco Maria Delle Fave, and Jonathan S Yedidia. Assumed density filtering methods for learning Bayesian neural networks. In AAAI, pp. 1589­1595, 2016.
Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249­256, 2010.
Wenbo Gong, Yingzhen Li, and Jose´ Miguel Herna´ndez-Lobato. Meta-learning for stochastic gradient MCMC. arXiv preprint arXiv:1806.04522, 2018.
Alex Graves. Practical variational inference for neural networks. In Advances in neural information processing systems, pp. 2348­2356, 2011.
Ian R Harris. Predictive fit for natural exponential families. Biometrika, 76(4):675­684, 1989.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026­1034, 2015.
Jose´ Miguel Herna´ndez-Lobato and Ryan Adams. Probabilistic backpropagation for scalable learning of Bayesian neural networks. In International Conference on Machine Learning, pp. 1861­1869, 2015.
GE Hinton and Drew van Camp. Keeping neural networks simple by minimising the description length of weights. In Proceedings of COLT-93, pp. 5­13, 1993.
Melih Kandemir, Manuel Haussmann, and Fred A Hamprecht. Sampling-free variational inference of Bayesian neural nets. arXiv preprint arXiv:1805.07654, 2018.
Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pp. 2575­2583, 2015.
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 6402­6413, 2017.
Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
Christos Louizos and Max Welling. Structured and efficient variational deep learning with matrix Gaussian posteriors. In International Conference on Machine Learning, pp. 1708­1716, 2016.
Christos Louizos and Max Welling. Multiplicative normalizing flows for variational Bayesian neural networks. arXiv preprint arXiv:1703.01961, 2017.
Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient MCMC. In Advances in Neural Information Processing Systems, pp. 2917­2925, 2015.
David JC MacKay. A practical Bayesian framework for backpropagation networks. Neural computation, 4(3):448­472, 1992.
David JC MacKay. Bayesian neural networks and density networks. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 354(1):73­80, 1995a.
10

Under review as a conference paper at ICLR 2019
David JC MacKay. Developments in probabilistic modelling with neural networks--ensemble learning. In Neural Networks: Artificial Intelligence and Industrial Applications, pp. 191­198. Springer, 1995b.
David JC MacKay. Probable networks and plausible predictionsa review of practical Bayesian methods for supervised neural networks. Network: Computation in Neural Systems, 6(3):469­505, 1995c.
Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271, 2018.
Andrew Miller, Nick Foti, Alexander D'Amour, and Ryan P Adams. Reducing reparameterization gradient variance. In Advances in Neural Information Processing Systems, pp. 3708­3718, 2017.
Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. arXiv preprint arXiv:1701.05369, 2017.
Tigran Nagapetyan, Andrew B Duncan, Leonard Hasenclever, Sebastian J Vollmer, Lukasz Szpruch, and Konstantinos Zygalakis. The true cost of stochastic gradient Langevin dynamics. arXiv preprint arXiv:1706.02692, 2017.
Radford M Neal. Bayesian learning via stochastic dynamics. In Advances in neural information processing systems, pp. 475­482, 1993.
Christopher G. Small. Expansions and Asymptotics for Statistics. CRC Press, 2010. Shengyang Sun, Changyou Chen, and Lawrence Carin. Learning structured weight uncertainty in
Bayesian neural networks. In Artificial Intelligence and Statistics, pp. 1283­1292, 2017. Jarno Vanhatalo and Aki Vehtari. Mcmc methods for MLP-network and Gaussian process and
stuff--a documentation for Matlab toolbox MCMCstuff. Laboratory of computational engineering, Helsinki university of technology, 2006. Sumio Watanabe. Algebraic geometry and statistical learning theory, volume 25. Cambridge University Press, 2009. Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681­688, 2011. Zhanxing Zhu, Ruosi Wan, and Mingjun Zhong. Neural control variates for variance reduction. arXiv preprint arXiv:1806.00159, 2018.
11

Under review as a conference paper at ICLR 2019

APPENDIX

A MOMENTS OF THE ACTIVATION VARIABLES a

Under assumption of independence of h, W and b, we can write:

ai = hj Wji + bi = hj Wji + bi

(14)

Cov(ai, ak) = Cov(hjWji, hlWlk) + Cov(bi, bk) = hj Wji hlWlk - hj Wji hlWlk + Cov(bi, bk) = hj hl WjiWlk - hj hl Wji Wlk + Cov(bi, bk) = hj hl [Cov(WjiWlk) + Wji Wlk ] - hj hl Wji Wlk + Cov(bi, bk) = hjhl Cov(Wji, Wlk) + Wji Cov(hj, hl) Wlk + Cov(bi, bk),

(15)

which is seen in the main text as equation 3. For Heaviside and ReLU activation functions, closed forms exist for hj in equation 14:

Heaviside

hj

= 1
2jj


-
e

1 2jj

(j -

aj

)2 dj = 

µj

0

ReLU

hj

= 1
2jj


j

-
e

1 2jj

(j -

aj

)2

dj

=

0

jj SR(µj ),

where SR(x) ··= (x) + x(x) is a "soft ReLU",  and  represent the standard Gaussian PDF
and CDF, and we have introduced the dimensionless variables µj = aj /jj. These results are is sufficient to evaluate equation 14, so in the following sections we turn to each term from equation 15.

A.1 EVALUATION OF TERM 1: hj hl Cov(Wji, Wlk)
In the general case, we can use the results from section A.2 to evaluate off-diagonal hjhl . However, in our experiments we always consider the the special case where Cov(Wji, Wlk) is diagonal. In this case we can write the first term in equation 15 as (reintroducing the explicit summation):

hj hl Cov(Wji, Wlk) = hj hl jlikVar(Wji)
jl jl

=ik

zj zj Var(Wji)

j

= diag [vVar(W )]

i.e. this term is a diagonal matrix with the diagonal given by the left product of the vector vj = hjhj with the matrix Var(Wki). Note that hjhj can be evaluated analytically for Heaviside and ReLU activation functions:

Heaviside

hj hj

= 1
2jj


-
e

1 2jj

(j -

aj

)2 dj = 

µj

0

ReLU

hj hj

= 1
2jj


j2e-

1 2jj

(j

-

aj

)2

dj

= jj

µj(µj) + (1 + µj)(µj)

0

12

Under review as a conference paper at ICLR 2019

A.2 EVALUATION OF TERM 2: Wji Cov(hj , hl) Wlk Evaluation of Cov(hj, hl) requires an expression for hjhl . From equation 5, we write:

hj hl 

f (j)f (l) exp

-

1 2

P

(j ,

l;

a

,


)

dj dl ,

where P is the quadratic form:

P (j, l; a ,  ) = =

j - aj l- al
j -µj l -µl

jj jl lj ll

-1

j - aj l- al

1 jl lj 1

-1

j -µj l -µl

.

(16)

Here we have introduced further dimensionless variables j = j/ jj, l = l/ ll and jl = jl/ jjll. We can then rewrite equation 16 in terms of a dimensionless integral I:

hj hl

=

jl jl

I

(µj

,

µl

,

jl

)

;

1 I=
Z

f (j)f (l) exp

-

1 2

P

(j

,

l;

µ

,


)

dj dl ,

where the normalization constant is evaluated by integrating over e-P/2 and is explicitly written as

Z = 2¯jl asymptote

, where ¯jl = 1 A plus a decaying

-corrj2el.ctNioonwe,-foQl.loTwoinegvaeluqautaetiAona6n,dwQe,

have the we have

task to write to insert the

I as an explicit

form of the non-linearity f , which we do for Heaviside and ReLU functions in the next sections.

A.2.1 HEAVISIDE NON-LINEARITY

For the Heaviside activation, we can represent the integral I as the shaded area under the Gaussian in
the upper-left quadrant shown below. In general, this integral does not have a closed form. However, for µj  , vanishing weight appears under the Gaussian in the upper-right quadrant, so we can write down the asymptote of the integral in this limit:


1

lim I =

µj 

Z

 j =-


exp

-

1 2

P

(j

,

l;

µ

,

jl)

l =0

djdl = (µl)

   
Here we performed the integral by noticing that the outer integral over j marginalizes out j from the bivariate Gaussian, leaving the inner integral as the definition of the Gaussian CDF. By symmetry,
we also have limµl I = (µj) and limµj,l- I = 0. We can then write down the following symmetrized form that satisfies all the limits required to qualify as an asymptote:

A = (µj)(µl) To compute the correction factor we evaluate the derivatives of (I - A) at the origin up to second order to match the moments of e-Q for quadratic Q. Description of this process is found below

Zeroth derivative At the origin µj = µl = 0, we can diagonalize the quadratic form P :

P (j, l; 0, jl)

=

1 2¯j2l

j2 - 2jlj l + l2

=1
4¯j2l

+2 + -2

,


where ± = 1  (1 ± 2). Performing this change of variables in the integral gives:

1 I = 4¯j2l

exp
H

1 4¯j2l

+2 + -2

d+d- = /

  
-  
+

where we integrated in polar coordinates over the region H in which the Heaviside function is

13

Under review as a conference paper at ICLR 2019

non-zero. The angle  can be found from the coordinate transform between  and  as6:

 = arctan

1+jl 1-jl

=

 2

-

1 2

arccos jl.

Since A|µ =0 = (0)(0) = 1/4, we can evaluate:

(I

-

A)|µ

=0

=

 2

-

1 2

arccos jl

-

1 4

=

1 2

arcsin jl

First derivative Performing a change of variables xi = i - µi, we can write I as:

1 I=
Z

H(xj + µj)H(xl + µl) exp

-

1 2

P

(xj

,

xl;

0,

jl)

dxj dxl

where H is the Heaviside function. Now, using xH(x) = (x), we have:

 µj

I=
µ =0

1 Z

(xj)H(xl) exp

-

1 2

P

(xj

,

xl

;

0,

jl

)

dxj dxl

=

1 .
2 2

In addition, using x(x) = (x), we have:

(17)

 µj

A=
µ =0

1 2 2

=

 µj

(I - A) = 0.
µ =0

By symmetry (I - A) also has zero gradient with respect to µl at the origin. Therefore Q has no linear term in µ .

Second derivative Taking another derivative in equation 17 gives:

2  µj2

I
µ =0

=-1 Z

(xj

)H

(xl)

jl ¯j2l

xl

exp

-

1 2

P

(xj

,

xl

;

0,

jl

)

dxj dxl

=

-

jl 2¯jl

.

where we used the identity f (x)x(x)dx = - (x)xf (x)dx, which holds for arbitrary f . In addition, we have:

2  µj2

A=0
µ =0

=

2  µj2

µ

=0

(I

-

A)

=

- jl
2¯jl

.

and the same result holds for the second derivative w.r.t. µl. To complete the Hessian, it is a simple extension of previous results to show that:

2 µj µl

(I - A) =
µ =0

.1-¯jl
2¯jl

Now that we have obtained derivatives of the residual (I - A) up to second order we propose a correction factor of the form e-Q where Q is truncated at quadratic terms:

Q

=

-

log

 2

+


µj2 + µl2

+ µj2µl2.

We then find the coefficients {, , } by matching (=! ) derivatives at µ = 0:

e-Q

µ=0

=

 2

 µi

e-Q

µ=0

=

0

e2 -Q
 µi2

µ=0

=

-2

 2


 µj

2


µl

e-Q

µ=0

=


 2

=!

arcsin jl 2

=  = arcsin jl

=! 0

=!

- jl
2¯jl

=! 1-¯jl
2¯jl

=


=

jl 2¯jl

=


=

1-¯jl ¯jl

This yields the expression seen in table 1 of the main text.

6Here we use the identity cos(2 arctan x)

=

cos2 tan-1 x - sin2 arctan x

=

1-x2 1+x2

14

Under review as a conference paper at ICLR 2019

A.2.2 RELU NON-LINEARITY

As in the Heaviside case, we begin by computing the asymptote of I by inspecting the limit as µj  :

1

lim I =

µj 

Z

 j =-


jl exp

-

1 2

P

(j ,

l;

µ

,

jl)

l =0

dj dl

= 1
2


µj

l

e-

1 2

(l

-µl

)2

dl

+

jl 2


l(l

-

µl)e-

1 2

(l-µl)2 dl

l =0

l =0

= µj SR(µl) + jl(µl)

(18)

Now, we construct a full 2-dimensional asymptote by symmetrizing equation 18 (using properties SR(x)  x and (x)  1 as x   to check that the correct limits are preserved after
symmetrizing):

A = SR(µj)SR(µl) + jl(µj)(µl)
Next we compute the correction factor e-Q. The details of this procedure closely follow those for the Heaviside non-linearity of the previous section, so we omit them here (and in practice we use Mathematica to perform the intermediate calculations). The final result is presented in table 1 of the main text.

B LOG-LIKELIHOOD AND POSTERIOR PREDICTIVE COMPUTATION
Here we give derivations of expressions quoted in section 3.2. In section B.1 we justify the intuitive result that expectation of the ELBO reconstruction term over q(w; ) can be re-written as an expectation over q~(aL). We then derive expected log-likelihoods and posterior predictive distributions for the cases of univariate Gaussian regression and classification. The latter sections are arranged as follows:
Regression Classification
Log-likelihood section B.2 section B.4 Posterior predictive section B.3 section B.5

B.1 LOG-LIKELIHOODS: FROM Ew TO EaL
We begin by rewriting the reconstruction term for data point (x, y) in terms of aL:
Ewq [log p(y|w)] = q(w) log p(y|w) dw = q(aL)q(w|aL) log p(y|w) dw daL
where we have suppressed explicit conditioning on x for brevity. Our goal now is to perform the integral over w, leaving the expectation in terms of aL only, thus allowing it to be evaluated using the approximation q~(aL) from section 3.1. To eliminate w, consider the case where the output of the model is a distribution p(y|aL) that is a parameter-free transformation of aL (e.g. aL are logits of a softmax distribution for classification or the moments of a Gaussian for regression). Since the model output is conditioned only on aL, we must have p(y|w) = p(y|aL) for all configurations w that satisfy the deterministic transformation aL = M(x; w), where M is the neural network (i.e p(y|w) = p(y|aL) for all w where q(w|aL) is non-zero). This allows us to write:
q(w|aL) log p(y|x, w) dw = log p(y|aL) q(w|aL) dw = log p(y|aL),
so the reconstruction term becomes:
Ewq [log p(y|x, w)] = q(aL) log p(y|aL)daL = EaLq(aL) log p(y|aL) .
This establishes the equivalence given in equation 7 in the main text. Since we are using an approximation to q, we will actually compute EaLq~(aL) log p(y|aL) .

15

Under review as a conference paper at ICLR 2019

B.2 UNIVARIATE REGRESSION: LOG-LIKELIHOOD

Here we give a derivation of equation 8 from the main text. Throughout this section we label the 2 elements of the final activation vector as aL = (m, ). We first insert the Gaussian form for
p(y|aL)  N m, e into the log-likelihood expression:

EaL q~(aL )

log p(y|aL)

=

-

1 2

EaL

q~(aL

)

log (2 exp( )) + exp(- )(y - m)2

=

-

1 2

log

2

-

1 2

- EaLq~(aL) exp(- )(y - m)2 .

Now we use the Gaussian form of q~(aL):

(19)

q~(aL)  exp

-

1 2

X

(L)-1X

;

X=

m- m -

.

and note that

exp(- ) exp

-

1 2

X

(L)-1X

dm d

= exp (-

)

exp

-

1 2

X

(L)-1X - (

-

) dm d

= exp (-

)

exp

-

1 2

X

(L)-1X - e X

dm d

= exp

 2

-

exp

-

1 2

(X

+ e L)(L)-1(X + e L)

dm d ,

(20)

where e = (0, 1) is the unit vector in the coordinate, and we completed the square to obtain the final line. Inserting equation 20 into equation 19 and marginalizing out the coordinate gives:

EaLq~(aL) log p(y|aL)

=

-

1 2

log 2 +

+ e /2-
2mm

(y - m)2 exp

-

[m-(

m -m 2mm

)]2

Finally, performing the integral over m gives the result seen in equation 8.

dm .

B.3 UNIVARIATE REGRESSION: POSTERIOR PREDICTIVE DISTRIBUTION
Here we give a derivation of equation 9 from the main text. We first calculate the first and second moments of the predictive distribution under the approximation q(aL)  q~(aL):

Eyp(y)[y] = =

yp(y|aL)q~(aL) daL yp(y|aL) q~(aL) daL

Var[y] = Var[Eyp(y|aL)(y)] + Eyp(y) Var(y|aL) = Var[m] + Eyp(y) e = mm + e p(y|aL)q~(aL) daL dy

= mq~(aL) daL =m

= mm + e q~(aL) daL = mm + exp ( +  /2)

where the final integral in the variance computation is performed by inserting the Gaussian form for q~(aL) and completing the square. Then we assume normality of the predictive distribution to obtain
the result in equation 9.

B.4 CLASSIFICATION: LOG-LIKELIHOOD

There is no exact form for the expected log-likelihood for multivariate classification with logits aL. However, using the second-order Delta method (Small, 2010), we find the expansion

EaLq~(aL) log p(y|aL) = aL - EaLq~(aL) logsumexp(aL)


aL

- logsumexp(

aL

)

-

1 2

p

diag(L) - p Lp ,

(21)

16

Under review as a conference paper at ICLR 2019

To derive this expansion, we first state the second order expansion for the expectation of a function g of random variable x using the Delta method as follows7:

E [g(x)]


g

(E[x])

+

1 2

ij

Cij

2g  xi  xj

,
x=E[x]

(22)

where Cij = Cov(xi, xj). Now we note that the logsumexp function has a simple Hessian

2  xi  xj

logsumexp(x)

=

ij pi

-

pipj ,

where p = softmax(x). Putting these results together allows us to write:

E

[logsumexp(x)]


logsumexp

(E[x])

+

1 2

p

diag(C) - p

Cp x=E[x] ,

This result is sufficient to complete the derivation of equation 21 and enable training of a classifier using our method.

B.5 CLASSIFICATION: POSTERIOR PREDICTIVE DISTRIBUTION

Using the same second-order Delta method, we find the following expansion for the posterior predictive distribution:

p(y) = EaLq~(aL) p(y|aL)  p

1+p

Lp

-

Lp

+

1 2

diag(L)

-

1 2

p

diag(L) .

(23)

where p = softmax( aL ).

For this expansion, we begin by computing the Hessian:

ij [p]k

=

2  xi  xj

pk

=

[2pipj

- (kipj

+ kj pi) + ikjk

- ij pi] pk,

where p = softmax(x), and we used the intermediate result j[p]k = jkpk - pjpk. Then we can form the product:

Tr [Cp] = p 2p Cp - 2Cp + diag(C) - p diag(C)

and insert this into equation 22 to obtain equation 23.

Preliminary experiments show that good results are obtained either using these approximations or a lightweight MC approximation just to perform the mapping of aL to (log)p after the deterministic heavy-lifting of computing aL. In this work we are primarily concerned with demonstrating the
benefits of the moment propagation method from section 3.1, so we limit our experiments to regression
examples without additional complication from approximation of the likelihood function.

7This result is obtained by Taylor expansion inside the expectation. 17

Under review as a conference paper at ICLR 2019 18

C EXTENDED RESULTS

Here we include comparison with a number of different models and inference schemes on the 9 UCI datasets considered in the main text. We report test log-likelihoods at convergence and find that our method is competitive or superior to a range of state-of-the-art techniques (reproduced from Bui et al. (2016)).

Dataset

bost

conc

ener

kin8

nava

powe

prot

wine

yach

|D|

506

1030

768

8192

11934

9568

45730

1588

308

dx

13 8

8

8 16

4

9 11 6

GP 50 DGP-1 50 DGP-2 50 DGP-3 50 GP 100 DGP-1 100 DGP-2 100 DGP-3 100

-2.22 ± 0.07 -2.85 ± 0.02 -1.29 ± 0.01 -2.33 ± 0.06 -3.13 ± 0.03 -1.32 ± 0.03 -2.17 ± 0.10 -2.61 ± 0.02 -0.95 ± 0.01 -2.09 ± 0.07 -2.63 ± 0.03 -0.95 ± 0.01 -2.16 ± 0.07 -2.65 ± 0.02 -1.11 ± 0.02 -2.37 ± 0.10 -2.92 ± 0.03 -1.21 ± 0.02 -2.09 ± 0.06 -2.43 ± 0.02 -0.90 ± 0.01 -2.13 ± 0.09 -2.44 ± 0.02 -0.91 ± 0.01

1.31 ± 0.01 0.68 ± 0.07 1.79 ± 0.02 1.93 ± 0.01 1.68 ± 0.01 1.09 ± 0.04 2.31 ± 0.01 2.46 ± 0.01

4.86 ± 0.04 3.60 ± 0.33 4.77 ± 0.32 5.11 ± 0.23 5.51 ± 0.03 3.75 ± 0.37 5.13 ± 0.27 5.78 ± 0.05

-2.66 ± 0.01 -2.81 ± 0.01 -2.58 ± 0.01 -2.58 ± 0.01 -2.55 ± 0.01 -2.67 ± 0.02 -2.39 ± 0.02 -2.37 ± 0.02

-2.95 ± 0.05 -2.55 ± 0.03 -2.11 ± 0.04 -2.03 ± 0.07 -2.52 ± 0.07 -2.18 ± 0.06 -1.51 ± 0.09 -1.32 ± 0.06

-0.67 ± 0.01 -0.35 ± 0.04 -0.10 ± 0.03 -0.13 ± 0.02 -0.57 ± 0.02 0.07 ± 0.03 0.37 ± 0.02 0.25 ± 0.03

-1.15 ± 0.03 -1.39 ± 0.14 -0.99 ± 0.07 -0.94 ± 0.05 -1.26 ± 0.03 -1.34 ± 0.10 -0.96 ± 0.06 -0.80 ± 0.04

VI(KW)-2 SGLD-2 SGLD-1 HMC-1 PBP-1 VI(G)-1 VI(KW)-1 Dropout-1

-2.64 ± 0.02 -2.38 ± 0.06 -2.40 ± 0.05 -2.27 ± 0.03 -2.57 ± 0.09 -2.90 ± 0.07 -2.43 ± 0.03 -2.46 ± 0.25

-3.07 ± 0.02 -3.01 ± 0.03 -3.08 ± 0.03 -2.72 ± 0.02 -3.16 ± 0.02 -3.39 ± 0.02 -3.04 ± 0.02 -3.04 ± 0.09

-1.89 ± 0.07 -2.21 ± 0.01 -2.39 ± 0.01 -0.93 ± 0.01 -2.04 ± 0.02 -2.39 ± 0.03 -2.38 ± 0.02 -1.99 ± 0.09

2.91 ± 0.10 6.10 ± 0.19 -2.28 ± 0.02 -0.42 ± 0.31 -0.85 ± 0.01 1.68 ± 0.00 3.21 ± 0.02 -2.61 ± 0.01 -1.23 ± 0.01 0.14 ± 0.02 1.28 ± 0.00 3.33 ± 0.01 -2.67 ± 0.00 -3.11 ± 0.02 -0.41 ± 0.01 1.35 ± 0.00 7.31 ± 0.00 -2.70 ± 0.00 -2.77 ± 0.00 -0.91 ± 0.02 0.90 ± 0.01 3.73 ± 0.01 -2.84 ± 0.01 -2.97 ± 0.00 -0.97 ± 0.01 0.90 ± 0.01 3.73 ± 0.12 -2.89 ± 0.01 -2.99 ± 0.01 -0.98 ± 0.01 2.40 ± 0.05 5.87 ± 0.29 -2.66 ± 0.01 -1.84 ± 0.07 -0.78 ± 0.02 0.95 ± 0.03 3.80 ± 0.05 -2.89 ± 0.01 -2.80 ± 0.05 -0.93 ± 0.06

-1.92 ± 0.03 -3.23 ± 0.03 -2.90 ± 0.02 -1.62 ± 0.01 -1.63 ± 0.02 -3.44 ± 0.16 -1.68 ± 0.04 -1.55 ± 0.12

DVI dDVI DVI-MC DVI-MC-softplus MCVI MCVI-softplus

-2.41 ± 0.02 -2.42 ± 0.02 -2.41 ± 0.02 -2.42 ± 0.02 -2.46 ± 0.02 -2.47 ± 0.02

-3.06 ± 0.01 -3.07 ± 0.02 -3.05 ± 0.01 -3.06 ± 0.02 -3.07 ± 0.01 -3.08 ± 0.02

-1.01 ± 0.06 -1.06 ± 0.06 -1.00 ± 0.06 -1.03 ± 0.05 -1.03 ± 0.04 -1.02 ± 0.05

1.13 ± 0.00 1.13 ± 0.00 1.13 ± 0.00 1.13 ± 0.00 1.14 ± 0.00 1.14 ± 0.00

6.29 ± 0.04 6.22 ± 0.06 6.25 ± 0.03 6.20 ± 0.04 5.94 ± 0.05 5.99 ± 0.02

-2.80 ± 0.00 -2.80 ± 0.00 -2.80 ± 0.00 -2.80 ± 0.01 -2.80 ± 0.00 -2.80 ± 0.00

-2.85 ± 0.01 -2.84 ± 0.01 -2.85 ± 0.01 -2.85 ± 0.01 -2.87 ± 0.01 -2.85 ± 0.01

-0.90 ± 0.01 -0.47 ± 0.03 -0.91 ± 0.02 -0.47 ± 0.03 -0.93 ± 0.04 -0.55 ± 0.03 -0.89 ± 0.01 -0.54 ± 0.03 -0.92 ± 0.01 -0.68 ± 0.03 -0.92 ± 0.01 -0.69 ± 0.03

Table 3: Average test log likelihood on UCI datasets. See table 4 for model glossary and implementation references.

Under review as a conference paper at ICLR 2019

Abbreviation GP N DGP-L N VI(KW)-L SGLD-L HMC-L PBP-L VI(G)-L Dropout-L DVI dDVI DVI-MC DVI-MC-softplus
MCVI-exp MCVI-softplus

Description Gaussian process regression

Implementation Bui et al. (2016)

Deep Gaussian process regression

Bui et al. (2016)

BNN with the variational free energy evaluated using Bui et al. (2016) the reparameterization trick (KW = Kingma + Welling)

Stochastic Gradient Langevin Dynamics

Bui et al. (2016)

Hamiltonian Monte Carlo
probabilistic back-propagation
scalable variational inference (VI) method for neural networks (G = Graves)

Bui et al. (2016) using toolbox Vanhatalo & Vehtari (2006)
Herna´ndez-Lobato & Adams (2015)
Graves (2011)

a technique that employs dropout during training as well Gal & Ghahramani (2016) as at prediction time.

Our method

ours

same as DVI, but with diagonal activation covariance ours

same as DVI, but with a light-weight Monte Carlo inte- ours gration only for computing the predictive distribution

same as DVI-MC, but uses softplus( ) to model het-
eroscedastic observation variance rather than e . Note that in this case there is no closed form for the loglikelihood, so the lightweight final MC step is required

ours

our implementation of SVI using e our implementation of SVI using softplus( )

ours ours

Table 4: Glossary of methods displayed in table 3 with references. Note: L = number of hidden layers; N = number of Gaussian process pseudo-points. Please refer to Bui et al. (2016) for more descriptions of other state-of-the-art methods.

D VARIANCE REDUCTION AND THE LOCAL REPARAMETERIZATION TRICK
By eliminating MC sampling and its associated variance entirely, our method directly tackles the problem of high variance gradient estimates that hinder MC approaches to training of BNNs. Alternative methods that only reduce variance have been considered, and among these, the local reparameterization trick (Kingma et al., 2015) is particularly popular. Similar to our approach, the local reparameterization trick maps the uncertainty in the weights to an uncertainty in activations, however, unlike the fully deterministic DVI, MC methods are then used to propagate this uncertainty through non-linearities. The benefits of MCVI with the reparameterization trick (rMCVI) over vanilla MCVI are two-fold:
· The variance of the gradient estimates during back propagation are reduced (see details in Kingma et al. (2015)).
· Since the sampling dimension in rMCVI only appears on the activations and not on the weights, an H × H linear transform can be implemented using SB × H by H × H matrix multiplies (where S is the number of samples and B is the batch size). This contrasts with the S × B × H by S × H × H batched matrix multiply required for MCVI. Although both
19

Under review as a conference paper at ICLR 2019 (a) (b)

Figure 6: Performance of MCVI vs rMCVI. (a) Gradient variance for the model shown in figure 3 with batch size B = 1. Variance values are normalized such that MCVI with 1 sample appears at unit relative variance. For this model, rMCVI achieves the same variance as MCVI with roughly 5× fewer samples (b) Runtime
performance of rMCVI evaluated under the conditions of figure 4. For this model, rMCVI runs with roughly 10× more samples in the same time as MCVI.

of these algorithms have the same asymptotic complexity O(SBH H), a single large matrix multiplication is generally more efficient on GPUs than smaller batched matrix multiplies.

Figure 6 shows empirical studies of the gradient variance and runtime for rMCVI vs. MCVI applied
to the model described in section 3.1.1 and figure 3. To evaluate the gradient variance, we initialize
the model with partially trained weights and measure the variance of the gradient of the ELBO reconstruction term L with respect to variational parameters. Specifically, we inspect the gradient with respect to the parameters Lq describing the variance of the q distribution for the weight matrix in the final layer.

Gradient variance ··= mean
sqL

Var

L s

The plots in figure 6 serve to show that rMCVI is not fundamentally different from MCVI, and the performance of one (on either the speed or variance metric) can be transformed into the other by varying the number of samples. A comparison of DVI with rMCVI is included in table 3 using the implementation labelled as "VI(KW)-1".

E LEARNING CURVES
Figure 7 shows the test log-likelihood during the training of the models from table 2 using DVI and MCVI inference algorithms. Since the underlying model is identical, both methods should achieve the same test log-likelihood given infinite time and infinite MC samples (or a suitable learning rate schedule) to mitigate the increased variance of the MCVI method. However, since we use only 10 samples and do not employ a leaning rate schedule, we find that MCVI converges to a log-likelihood that is consistently worse than that achieved by DVI.

20

Under review as a conference paper at ICLR 2019

test log-likelihood [higher is better]

bost
-2.5 -3.0 -3.5
kin8
1.0

0.5
prot
-2.9

-3.0 0

10

conc
-3.1
-3.2
nava
6 5 4
wine
-0.9
-1.0
20 0 20 40 epoch

ener
-1.0

-1.5
powe
-2.80
-2.85
-2.90
yach

-2

-4 0

20

DVI MCVI
40

Figure 7: Learning trajectories for the models from table 2.

21