-
Notifications
You must be signed in to change notification settings - Fork 5
/
pdf?id=B1l08oAct7.txt
2156 lines (1151 loc) · 61.2 KB
/
pdf?id=B1l08oAct7.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Under review as a conference paper at ICLR 2019
FIXING VARIATIONAL BAYES: DETERMINISTIC VARIATIONAL INFERENCE FOR BAYESIAN NEURAL NETWORKS
Anonymous authors Paper under double-blind review
ABSTRACT
Bayesian neural networks (BNNs) hold great promise as a flexible and principled solution to deal with uncertainty when learning from finite data. Among approaches to realize probabilistic inference in deep neural networks, variational Bayes (VB) is theoretically grounded, generally applicable, and computationally efficient. With wide recognition of potential advantages, why is it that variational Bayes has seen very limited practical use for BNNs in real applications? We argue that variational inference in neural networks is fragile: successful implementations require careful initialization and tuning of prior variances, as well as controlling the variance of Monte Carlo gradient estimates. We fix VB and turn it into a robust inference tool for Bayesian neural networks. We achieve this with two innovations: first, we introduce a novel deterministic method to approximate moments in neural networks, eliminating gradient variance; second, we introduce a hierarchical prior for parameters and a novel Empirical Bayes procedure for automatically selecting prior variances. Combining these two innovations, the resulting method is highly efficient and robust. On the application of heteroscedastic regression we demonstrate strong predictive performance over alternative approaches.
1 INTRODUCTION
Bayesian approaches to neural network training marry the representational flexibility of deep neural networks with principled parameter estimation in probabilistic models. Compared to "standard" parameter estimation by maximum likelihood, the Bayesian framework promises to bring key advantages such as better uncertainty estimates on predictions and automatic model regularization (MacKay, 1992; Graves, 2011). These features are often crucial for informing downstream decision tasks and reducing overfitting, particularly on small datasets. However, despite potential advantages, such Bayesian neural networks (BNNs) are often overlooked due to two limitations: First, posterior inference in deep neural networks is analytically intractable and approximate inference with Monte Carlo (MC) techniques can suffer from crippling variance given only a reasonable computation budget (Kingma et al., 2015; Molchanov et al., 2017; Miller et al., 2017; Zhu et al., 2018). Second, performance of the Bayesian approach is sensitive to the choice of prior (Neal, 1993), and although we may have a priori knowledge concerning the function represented by a neural network, it is generally difficult to translate this into a meaningful prior on neural network weights. Sensitivity to priors and initialization makes BNNs non-robust and thus often irrelevant in practice.
In this paper, we describe a novel approach for inference in feed-forward BNNs that is simple to implement and aims to solve these two limitations. We adopt the paradigm of variational Bayes (VB) for BNNs (Hinton & van Camp, 1993; MacKay, 1995c) which is normally deployed using Monte Carlo variational inference (MCVI) (Graves, 2011; Blundell et al., 2015). Within this paradigm we address the two shortcomings of current practice outlined above: First, we address the issue of high variance in MCVI, by reducing this variance to zero through novel deterministic approximations to variational inference in neural networks. Second, we derive a general and robust Empirical Bayes (EB) approach to prior choice using hierarchical priors. By exploiting conjugacy we derive data-adaptive closed-form variance priors for neural network weights, which we experimentally demonstrate to be remarkably effective.
1
Under review as a conference paper at ICLR 2019
Combining these two novel ingredients gives us a performant and robust BNN inference scheme that we refer to as "deterministic variational inference" (DVI). We demonstrate robustness and superior predictive performance in the context of non-linear regression models, deriving novel closedform results for expected log-likelihoods in homoscedastic and heteroscedastic regression (similar derivations for classification can be found in the appendix).
Experiments on standard regression datasets from the UCI repository, (Dheeru & Karra Taniskidou, 2017), show that for identical models DVI converges to local optima with better predictive loglikelihoods than existing methods based on MCVI. In direct comparisons, we show that our Empirical Bayes formulation automatically provides better or comparable test performance than manual tuning of the prior and that heteroscedastic models consistently outperform the homoscedastic models.
Concretely, our contributions are:
· Development of a deterministic procedure for propagating uncertain activations through neural networks with uncertain weights and ReLU or Heaviside activation functions.
· Development of an EB method for principled tuning of weight priors during BNN training. · Experimental results showing the accuracy and efficiency of our method and applicability to
heteroscedastic and homoscedastic regression on real datasets.
2 VARIATIONAL INFERENCE IN BAYESIAN NEURAL NETWORKS
We start by describing the inference task that our method must solve to successfully train a BNN. Given a model M parameterized by weights w and a dataset D = (x, y), the inference task is to discover the posterior distribution p(w|x, y). A variational approach acknowledges that this posterior generally does not have an analytic form, and introduces a variational distribution q(w; ) parameterized by to approximate p(w|x, y). The approximation is considered optimal within the variational family for that minimizes the Kullback-Leibler (KL) divergence between q and the true posterior.
= argmin DKL [q(w; )||p(w|x, y)].
Introducing a prior p(w) and applying Bayes rule allows us to rewrite this as optimization of the quantity known as the evidence lower bound (ELBO):
= argmax {Ewq [log p(y|w, x)] - DKL [q(w; )||p(w)]} .
(1)
Analytic results exist for the KL term in the ELBO for careful choice of prior and variational distributions (e.g. Gaussian families). However, when M is a non-linear neural network, the first term in equation 1 (referred to as the reconstruction term) cannot be computed exactly: this is where
MC approximations with finite sample size S are typically employed:
Ewq
[log p(y|w, x)]
1 S
S
log p(y|w(s), x),
w(s) q(w; ).
s=1
(2)
Our goal in the next section is to develop an explicit and accurate approximation for this expectation, which provides a deterministic, closed-form expectation calculation, stabilizing BNN training by removing all stochasticity due to Monte Carlo sampling.
3 DETERMINISTIC VARIATIONAL APPROXIMATION
Figure 1 shows the architecture of the computation of Ewq [log p(D|w)] for a feed-forward neural network. The computation can be divided into two parts: first, propagation of activations though parameterized layers and second, evaluation of an unparameterized log-likelihood function (L). In this section, we describe how each of these stages is handled in our deterministic framework.
3.1 MOMENT PROPAGATION
We begin by considering activation propagation (figure 1(a)), with the aim of deriving the form of an approximation q~(aL) to the final layer activation distribution q(aL) that will be passed to
2
Under review as a conference paper at ICLR 2019
(a)
(0) (1)
( )
(0; ) (1; )
Layer 1
0 1
(b)
Layer 2 ... Layer
1
log ,
Figure 1: Architecture of a Bayesian neural network. Computation is divided into (a) propagation of activations (a) from an input x and (b) computation of a log-likelihood function L for outputs y. Weights are represented as high dimensional variational distributions (blue) that induce distributions over activations (yellow). MCVI computes using samples (dots); our method propagates a full distribution.
the likelihood computation. We compute aL by sequentially computing the distributions for the activations in the preceding layers. Concretely, we define the action of the lth layer that maps a(l-1) to al as follows:
hl = f (a(l-1)),
al = hlW l + bl ,
where f is a non-linearity and {W l, bl} w are random variables representing the weights and biases of the lth layer that are assumed independent from weights in other layers. For notational
clarity, in the following we will suppress the explicit layer index l, and use primed symbols to denote variables from the (l - 1)th layer, e.g. a = a(l-1). Note that we have made the non-conventional
choice to draw the boundaries of the layers such that the linear transform is applied after the nonlinearity. This is to emphasize that al is constructed by linear combination of many distinct elements
of h , and in the limit of vanishing correlation between terms in this combination, we can appeal to
the central limit theorem (CLT). Under the CLT, for a large enough hidden dimension, elements ai will be normally distributed regardless of the potentially complicated distribution for hj induced by f 1. We empirically observe that this claim is approximately valid even when (weak) correlations
appear between the elements of h during training (see section 3.1.1).
Having argued that a adopts a Gaussian form, it remains to compute the first and second moments. In general, these cannot be computed exactly, so we develop an approximate expression. An overview of this derivation is presented here with more details in appendix A. First, we model W , b and h as independent random variables, allowing us to write:
ai = hj Wji + bi , Cov(ai, ak) = hjhl Cov(Wji, Wlk) + Wji Cov(hj, hl) Wlk + Cov(bi, bk),
(3)
where we have employed the Einstein summation convention and used angle brackets to indicate expectation over q. If we choose a variational family with analytic forms for weight means and covariances (e.g. Gaussian with variational parameters Wji and Cov(Wji, Wlk)), then the only difficult terms are the moments of h:
hj
f (j) exp
-
(j
- aj 2jj
)2
dj ,
(4)
hj hl
f (j)f (l) exp
-1 2
j - aj l- al
jj jl lj ll
-1
j - aj l- al
dj dl , (5)
where we have used the Gaussian form of a parameterized by mean a and covariance , and for brevity we have omitted the normalizing constants. Closed form solutions for the integral in equation 4 exist for Heaviside or ReLU choices of non-linearity f (see appendix A). Furthermore, for these non-linearities, the aj ± and al ± asymptotes of the integral in equation 5 have closed form. Figure 2 shows schematically how these asymptotes can be used as a first approximation for equation 5. This approximation is improved by considering that (by definition) the residual decays to zero far from the origin in the ( aj , al ) plane, and so is well modelled by a decaying function
1We are also required to choose a Gaussian variational approximation for b to preserve the Gaussian distribution of a.
3
Under review as a conference paper at ICLR 2019
A(µ1, µ2, )
Heaviside (µ1)(µ2)
ReLU
SR(µ1)SR(µ2) + (µ1)(µ2)
Q(µ1, µ2, )
-
log(
gh 2
)
+
2gh ¯
µ12
+
µ22
-
2 1+¯
µ1µ2
+ O(µ4)
-
log(
gr 2
)
+
2gr (1+¯)
µ12 + µ22
-
arcsin - gr
µ1
µ2
+ O(µ4)
Table 1: Forms for the components of the approximation in equation 6 for Heaviside and ReLU
non-linearities. is the CDF of a standard Gaussian, SR is a "soft ReLU" that we define as SR(x) =
(x) + x(x) where is a standard Gaussian, ¯ =
1
-
2,
gh
=
arcsin
and
gr
=
gh
+
1+¯
exp[-Q( aj , al , )], where Q is a polynomial in a with a dominant positive even term. In practice we truncate Q at the quadratic term, and calculate the polynomial coefficients by matching
the moments of the resulting Gaussian with the analytic moments of the residual. Specifically, using
dimensionless variables µi = ai / ii and ij = ij/ iijj, this improved approximation takes the form
hj hl
= jl jl
A(µj, µl, jl) + exp -Q(µj, µl, jl)
,
(6)
where the expressions for the asymptote A and quadratic Q are given in table table 1 and derived in appendix A.2.1 and A.2.2. Using equation 6 in equation 3 gives a closed form approximation for the moments of a as a function of moments of a . Since a is approximately normally distributed by the CLT, this is sufficient information to sequentially propagate moments all the way through the network to compute the mean and covariances of q~(aL), our explicit multivariate Gaussian approximation to q(aL). Any deep learning framework supporting special functions arcsin and will immediately support backpropagation through the deterministic expressions we have presented. Below we briefly empirically verify the presented approximation, and in section 3.2 we will show how it is used to compute an approximate log-likelihood and posterior predictive distribution for regression and classification tasks.
(a)
=
=
(b)
=
Heaviside
+
+
Asymptote
× 0.1
× 0. 1
+
Gaussian approx.
× 0.001
+
Asymptote
× 0.003
ReLU
Gaussian approx.
3.1.1 EMPIRICAL VERIFICATION Approximation accuracy The approximation derived
=
× 0.003
× 0.00001
++
above relies on three assumptions. First, that some form of
CLT holds for the hidden units during training where the
iid assumption of the classic CLT is not strictly enforced; Figure 2: Approximation of hjhl using
second, that a quadratic truncation of Q is sufficient2; and an asymptote and Gaussian correction for
third that there are only weak correlation between layers (a) Heaviside and (b) ReLU non-linearities.
so that they can be represented using independent variables in the variational distribution. To provide evidence that these assumptions hold in practice, we train a small
Yellow functions have closed-forms, and blue
indicates residuals. The examples are plotted for -6 < µ < 6 and jl = 0.5, and the relative magnitude of each correction term is
ReLU network with two hidden layers each of 128 units indicated on the vertical axis.
to perform 1D heteroscedastic regression on a toy dataset
of 500 points drawn from the distribution shown in figure 3(b). The training objective is taken from
section 4, and the only detail required here is that aL is a 2-element vector where the elements are
labelled as (m, ). We use a diagonal Gaussian variational family to represent the weights, but we
preserve the full covariance of a during propagation. Using an input x = 0.25 (see arrow, Figure 3(b))
we compute the distributions for m and both at the start of training (where we expect the iid as-
sumption to hold) and at convergence (where iid does not necessarily hold). Figure 3(c) shows the
comparison between aL distributions reported by our deterministic approximation and MC evaluation
using 20k samples from q(w; ). This comparison is qualitatively excellent for all cases considered.
2Additional Taylor expansion terms can be computed if this assumption fails.
4
Under review as a conference paper at ICLR 2019
After Training Before Training
per-batch runtime (s)
(a) (b)
1 128 128 2
2
1
0
0 1 =2 -1
Data samples Data 1-std Model 1-std
(c) () MC ours
-0.5 0.0 0.5
()
10-1 10-2 10-3
MCVI dDVI DVI
3
300 100 30 10 3 1
2
10-4 102
103
hidden dimension, d
-20 -10 0 10 20 ()
-10 ()
0
10
Figure 4: Runtime performance of VI methods. We show the time to propagate a batch of 10 activation vectors through a single d × d layer. For
MCVI we label curves with the number of sam-
ples used, and we show quadratic and cubic scal-
ing guides-to-the-eye (black). Black dots indicate
-1.0 -0.8 -0.6 -0.4 -3.5 -3.0 -2.5 -2.0
where our implementation runs out of memory (16GB).
Figure 3: Empirical accuracy of our approximation on toy 1-dimensional data. (a) We train a 2 layer ReLU network to perform heteroscedastic regression on the dataset shown in (b) and obtain the fit shown in blue. (c) The output distributions for the activation units m and evaluated at x = 0.25 are in excellent agreement with Monte Carlo (MC) integration with a large number (20k) of samples both before and after training.
Computational efficiency In traditional
MCVI, propagation of S samples of d-
dimensional activations through a layer containing a d × d-dimensional transformation requires O(Sd2) compute and O(Sd) memory. Our DVI method approximates the S limit, while only demanding O(d3) compute and O(d2) memory (the additional factor of d
arises from manipulation of the quadratically
large covariance matrix Cov[hj, hl]). Whereas MCVI can always trade compute and memory for accuracy by choosing a small value for S, the inherent scaling of DVI with d could potentially limit
its practical use for networks with large hidden size. To avoid this limitation, we also consider the
case where only the diagonal entries Cov(hj, hj) are computed and stored at each layer. We refer to this method as "diagonal-DVI" (dDVI), and in section 6 we show the surprising result that the
strong test performance of DVI is largely retained by dDVI across a range of datasets. Figure 4
shows the time required to propagate activations through a single layer using the MCVI, DVI and
dDVI methods on a Tesla V100 GPU. As a rough rule of thumb (on this hardware), for layer sizes of
practical relevance, we see that absolute DVI runtimes roughly equate to MCVI with S = 300 and
dDVI runtime equates to S = 1.
3.2 LOG-LIKELIHOOD EVALUATION
To use the moment propagation procedure derived above for training BNNs, we need to build a function L that maps final layer activations aL to the expected log-likelihood term in equation 1 (see
figure 1(b)). In appendix B.1 we show the intuitive result that this expected log-likelihood over q(w) can be rewritten as an expectation over q~(aL).
Ewq [log p(y|x, w)] = EaLq(aL) log p(y|aL) .
(7)
With this form we can derive closed forms for specific tasks; for brevity we focus on the regression case and refer the reader to appendices B.4 and B.5 for the classification case.
Regression Case For simplicity we consider scalar y and a Gaussian noise model parameterized
by mean m(x; w) and heteroscedastic log-variance log y2(x) = (x; w). The parameters of this Gaussian are read off as the elements of a 2-dimensional output layer aL = (m, ) so that
p(y|aL) = N y|m, e . Recall that these parameters themselves are uncertain and the statistics aL and L can be computed following section 3.1. Inserting the Gaussian forms for p(y|aL) and q(aL) into equation 7 and performing the integral (see appendix B.2) gives a closed form expression
5
Under review as a conference paper at ICLR 2019
for the ELBO reconstruction term:
EaL q~(aL )
log p(y|aL)
=
-
1 2
log 2 +
+ .mm+( m -m -y)2
e - /2
(8)
This heteroscedastic model can be made homoscedastic by setting = = m = 0. The expression in equation 8 completes the derivations required to implement the closed form approximation to the ELBO reconstruction term for training a network. In addition, we can also compute a closed form approximation to the predictive distribution that is used at test-time to produce predictions that incorporate all parameter uncertainties. By approximating the moments of the posterior predictive and assuming normality (see appendix B.3), we find:
p(y) p(y|aL) q~(aL) daL N y m , mm + e + /2 .
(9)
4 EMPIRICAL BAYES FOR VARIATIONAL BNNS
So far, we have described methods for deterministic approximation of the reconstruction term in the
ELBO. We now turn to the KL term. For a d-dimensional Gaussian prior p(w) = N (µp, p), the KL divergence with the Gaussian variational distribution q = N (µq, q) has closed form:
DKL [q||p]
=
1 2
log
|p | |q |
-
d
+
Tr
-p 1q
+ (µp - µq)
-p 1(µp - µq)
.
(10)
However, this requires selection of (µp, p) for which there is usually little intuition beyond arguing µp = 0 by symmetry and choosing p to preserve the expected magnitude of the propagated activations (Glorot & Bengio, 2010; He et al., 2015). In practice, variational Bayes for neural network
parameters is sensitive to the choice of prior variance parameters, and we will demonstrate this
problem empirically in section 6 (figure 5).
To make variational Bayes robust we parameterize the prior hierarchically, retaining a conditional
diagonal Gaussian prior and variational distribution on the weights. The hierarchical prior takes the form s p(s); w p(w|s), using an inverse gamma distribution on s as the conjugate prior to the elements of the diagonal Gaussian variance. We partition the weights into sets {} that typically coincide with the layer partitioning3, and assign a single element in s to each set:
s Inv-Gamma(, ) , wi N (0, s), for shape and scale , and where wi is the ith weight in set .
(11)
Rather than taking the fully Bayesian approach, we adopt an empirical Bayes approach (Type-2 MAP), optimizing s, assuming that the integral is dominated by a contribution from this optimal value s = s . We use the data to inform the optimal setting of s to produce the tightest ELBO:
ELBO = Ewq log p(y|hL(w)) - {DKL [q(w; )||p(w|s)p(s)]}
= s = argmin DKL q(w; )||p(w|s) - log p(s)
s
(12)
Writing out the integral for the KL in equation 12, substituting in the forms of the distributions in
equation 11 and differentiating to find the optimum gives
s = Tr
q + µq(µq ) + 2 + 2
+ 2 ,
(13)
where is the number of weights in the set . The influence of the data on the choice of s is made explicit here through dependence on the learned variational parameters q and µq. Using s to populate the elements of the diagonal prior variance p, we can evaluate the KL in equation 10
under the empirical Bayes prior. Optimization of the resulting ELBO then simultaneously tunes the
variational distribution and prior.
In the experiments we will demonstrate that the proposed empirical Bayes approach works well; however, it only approximates the full Bayesian solution, and it could fail if we were to allow too many degrees of freedom. To see this, assume we were to use one prior per weight element, and we would also define a hyperprior for each prior mean. Then, adjusting both the prior variance and prior mean using empirical Bayes would always lead to a KL-divergence of zero and the ELBO objective would degenerate into maximum likelihood.
3In general, any arbitrary partitioning can be used
6
Under review as a conference paper at ICLR 2019
5 RELATED WORK
Bayesian neural networks have a rich history. In a 1992 landmark paper David MacKay demonstrated the many potential benefits of a Bayesian approach to neural network learning (MacKay, 1992); in particular, this work contained a convincing demonstration of naturally accounting for model flexibility in the form of the Bayesian Occam's razor, facilitating comparison between different models, accurate calibration of predictive uncertainty, and to perform learning robust to overfitting. However, at the time Bayesian inference was achieved only for small and shallow neural networks using a comparatively crude Laplace approximation. Another early review article summarizing advantages and challenges in Bayesian neural network learning is (MacKay, 1995c).
This initial excitement around Bayesian neural networks led to two main methods being developed; First, Hinton & van Camp (1993) and MacKay (1995b) developed the variational Bayes (VB) approach for posterior inference. Whereas Hinton & van Camp (1993) were motivated from a minimum description length (MDL) compression perspective, MacKay (1995b) motivated his equivalent ensemble learning method from a statistical physics perspective of variational free energy minimization. Barber & Bishop (1998) extended the methodology for two-layer neural networks to use general multivariate Normal variational distributions. Second, Neal (1993) developed efficient gradient-based Monte Carlo methods in the form of "hybrid Monte Carlo", now known as Hamiltonian Monte Carlo, and also raised the question of prior design and limiting behaviour of Bayesian neural networks.
Rebirth of Bayesian neural networks. After more than a decade of no further work on Bayesian neural networks Graves (2011) revived the field by using Monte Carlo variational inference (MCVI) to make VB practical and scalable, demonstrating gains in predictive performance on real world tasks.
Since 2015 the VB approach to Bayesian neural networks is mainstream (Blundell et al., 2015); key research drivers since then are the problems of high variance in MCVI and the search for useful variational families. One approach to reduce variance in feedforward networks is the local reparameterization trick (Kingma et al., 2015) (see appendix D). To enhance the variational families more complicated distributions such as Matrix Gaussian posteriors (Louizos & Welling, 2016), multiplicative posteriors (Kingma et al., 2015), and hierarchical posteriors (Louizos & Welling, 2017) are used. Both our methods, the deterministic moment approximation and the empirical Bayes estimation, can potentially be extended to these richer families.
Prior choice. Choosing priors in Bayesian neural networks remains an open issue. The hierarchical priors for feedforward neural networks that we use have been investigated before by Neal (1993) and MacKay (1995a), the latter proposing a "cheap and cheerful" heuristic, alternating optimization of weights and inverse variance parameters. Barber & Bishop (1998) also used a hierarchical prior and an efficient closed-form factored VB approximation; our approach can be seen as a point estimate to their approach in order to enable use of our closed-form moment approximation. Graves (2011) also used hierarchical Gaussian priors with flat hyperpriors, deriving a closed-form update for the prior mean and variance. Compared to these prior works our approach is rigorous and with sufficient data accurately approximates the Bayesian approach of integrating over the prior parameters.
Alternative inference procedures. As an alternative to variational Bayes, probabilistic backpropagation (PBP) (Herna´ndez-Lobato & Adams, 2015) applies approximate inference in the form of assumed density filtering (ADF) to refine a Gaussian posterior approximation. Like in our work, each update to the approximate posterior requires propagating means and variances of activations through the network. (Herna´ndez-Lobato & Adams, 2015) only consider the diagonal propagation case and regression. Since the original work, PBP has been generalized to classification (Ghosh et al., 2016) and richer posterior families such as the matrix variate Normal posteriors (Sun et al., 2017). Our moment approximation could be used to improve the inference accuracy of PBP.
Gaussianity in neural networks. Our demonstration of Gaussianity of ReLU network activations is also directly relevant to recent work on Gaussian process interpretations of deep neural networks (Matthews et al., 2018; Lee et al., 2017), validating the insight that activations in deep neural networks are closely approximated by Gaussian processes. Two recent works derived deterministic moment approximations for deep neural networks: Bibi et al. (2018), using Price's theorem, derived exact first and second moment expressions for ReLU activations but limit themselves to the case of zero-mean Gaussian activations. Kandemir et al. (2018) also derive closed-form solutions to the
7
Under review as a conference paper at ICLR 2019
ELBO for the case of diagonal Gaussian variational families. However, their approach is limited to linear layers without bias.
Markov chain Monte Carlo approaches. Another rich class of approximate inference methods for Bayesian neural networks are stochastic gradient Markov chain Monte Carlo (SG-MCMC) methods. These methods allow for approximate posterior parameter inference using unbiased log-likelihood estimates. Stochastic gradient Langevin dynamics (SGLD) was the first method in this class (Welling & Teh, 2011). SGLD is particularly simple and efficient to implement, but recent methods increase efficiency in the case of correlated posteriors by estimating the Fisher information matrix (Ahn et al., 2012) and extend Hamiltonian Monte Carlo to the stochastic gradient case (Chen et al., 2014). A complete characterization of SG-MCMC methods is given by (Ma et al., 2015; Gong et al., 2018). However, despite this progress, important theoretical questions regarding approximation guarantees for practical computational budgets remain (Nagapetyan et al., 2017). Moreover, while SG-MCMC methods work robustly in practice, they remain computationally inefficient, especially because evaluation of the posterior predictive requires evaluating an ensemble of models.
Wild approximations. The above methods are principled but often require sophisticated implementations; recently, a few methods aim to provide "cheap" approximations to the Bayes posterior. Dropout has been interpreted by Gal & Ghahramani (2016) to approximately correspond to variational inference. Likewise, Bootstrap posteriors (Lakshminarayanan et al., 2017; Fushiki et al., 2005; Harris, 1989) have been proposed as a general, robust, and accurate method for posterior inference. However, obtaining a bootstrap posterior ensemble of size k is computationally intense at k times the computation of training a single model.
6 EXPERIMENTS
We implement4 deterministic variational inference (DVI) as described above to train small ReLU networks on UCI regression datasets (Dheeru & Karra Taniskidou, 2017). The experiments address the claims that our methods for eliminating gradient variance and automatic tuning of the prior improve the performance of the final trained model. In Appendix C we present extended results to demonstrate that our method is competitive against a variety of models and inference schemes.
Dataset |D|
dx
DVI
dDVI
MCVI
hoDVI
bost
506
13 -2.41 ± 0.02
-2.42 ± 0.02
-2.46 ± 0.02
-2.58 ± 0.04
conc 1030
8 -3.06 ± 0.01
-3.07 ± 0.02
-3.07 ± 0.01
-3.23 ± 0.01
ener
768
8 -1.01 ± 0.06
-1.06 ± 0.06
-1.03 ± 0.04
-2.09 ± 0.06
kin8 8192
8
1.13 ± 0.00
1.13 ± 0.00
1.14 ± 0.00
1.01 ± 0.01
nava 11934
16
6.29 ± 0.04
6.22 ± 0.06
5.94 ± 0.05
5.84 ± 0.06
powe 9568
4
-2.80 ± 0.00
-2.80 ± 0.00
-2.80 ± 0.00
-2.82 ± 0.00
prot 45730
9
-2.85 ± 0.01
-2.84 ± 0.01
-2.87 ± 0.01
-2.94 ± 0.00
wine 1588
11 -0.90 ± 0.01
-0.91 ± 0.02
-0.92 ± 0.01
-0.96 ± 0.01
yach
308
6
-0.47 ± 0.03
-0.47 ± 0.03
-0.68 ± 0.03
-1.41 ± 0.03
Table 2: Average test log-likelihood on UCI datasets. |D| is the dataset size, and dx is the input dimension.
Deterministic vs. Stochastic We compare DVI with MCVI from equation 2 with S = 10 samples. The same model is used for each inference method: a single hidden layer of 50 units for each dataset considered, extending this to 100 units in the special case of the larger protein structure dataset, prot. Additionally, both methods use the same EB prior from equation 13 with a broad inverse Gamma hyperprior ( = 1, = 10) and an independent s for each linear transformation. Each dataset is split into random training and test sets with 90% and 10% of the data respectively. This splitting process is repeated 20 times and the average test performance of each method at convergence is reported in table 2 (see also learning curves in appendix E). We see that DVI consistently outperforms MCVI, by up to 0.35 nats per data point on some datasets. The computationally efficient diagonalDVI (dDVI) surprisingly retains much of this performance. By default we use the heteroscedastic
4Our implementation in TensorFlow is available at redacted for anonymity
8
Under review as a conference paper at ICLR 2019
model, and we observe that this uniformly delivers better results than a homoscedastic model (hoDVI; rightmost column in table 2) on these datasets with no overfitting issues5.
Empirical Bayes In Figure 5 we compare the performance of networks trained with manual tuning of a fixed Gaussian prior to networks trained with the automatic EB tuning. We find that the EB method consistently finds priors that produce models with competitive or significantly improved test log-likelihood relative to the best manual setting. Since this observation holds across all datasets considered, we say that our method is "robust". Note that the EB method can outperform manual tuning because it automatically finds different prior variances for each weight matrix, whereas in the manual tuning case we search over a single hyperparameter controlling all prior variances.
bost
-2.4
conc
-3.05
ener
-1.0
test log-likelihood [higher is better]
-2.6
kin8
1.13 1.12 1.11
prot
-2.84
-2.86
101
-3.10
-1.2
nava powe
6.3 6.2 6.1
wine
-2.80
-2.81
yach
-0.90
-0.5
-0.95 103
-0.6
-0.7 101 103 prior variance
no EB EB
101 103
Figure 5: Comparison of converged test log-likelihood with a manually tuned prior variance (orange) or empirical Bayes (blue).
7 CONCLUSION
We introduced two innovations to make variational inference for neural networks more robust: 1. an effective deterministic approximation to the moments of activations of a neural networks; and 2. a simple empirical Bayes hyperparameter update. We demonstrate that together these innovations make variational Bayes a competitive method for Bayesian inference in neural heteroscedastic regression models.
Beside the challenge of efficient posterior inference, for Bayesian neural networks two major issues remain open. First, how to design suitable priors for functions represented by neural network parameters? And second, what structure do the posterior distributions in neural network models have and how can this be used to improve approximate inference (Watanabe, 2009)?
REFERENCES
Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via stochastic gradient Fisher scoring. arXiv preprint arXiv:1206.6380, 2012.
David Barber and Christopher M Bishop. Ensemble learning in Bayesian neural networks. NATO ASI SERIES F COMPUTER AND SYSTEMS SCIENCES, 168:215238, 1998.
Adel Bibi, Modar Alfadly, and Bernard Ghanem. Analytic expressions for probabilistic moments of PL-DNN with Gaussian input. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424, 2015.
Thang Bui, Daniel Herna´ndez-Lobato, Jose Hernandez-Lobato, Yingzhen Li, and Richard Turner. Deep Gaussian processes for regression using approximate expectation propagation. In International Conference on Machine Learning, pp. 14721481, 2016.
Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning, pp. 16831691, 2014.
5Note that this result is non-trivial because heteroscedastic models are more complex and could result in poorer approximate inference leading to worse test performance
9
Under review as a conference paper at ICLR 2019
Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017. URL http: //archive.ics.uci.edu/ml.
Tadayoshi Fushiki, Fumiyasu Komaki, Kazuyuki Aihara, et al. Nonparametric bootstrap prediction. Bernoulli, 11(2):293307, 2005.
Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pp. 10501059, 2016.
Soumya Ghosh, Francesco Maria Delle Fave, and Jonathan S Yedidia. Assumed density filtering methods for learning Bayesian neural networks. In AAAI, pp. 15891595, 2016.
Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249256, 2010.
Wenbo Gong, Yingzhen Li, and Jose´ Miguel Herna´ndez-Lobato. Meta-learning for stochastic gradient MCMC. arXiv preprint arXiv:1806.04522, 2018.
Alex Graves. Practical variational inference for neural networks. In Advances in neural information processing systems, pp. 23482356, 2011.
Ian R Harris. Predictive fit for natural exponential families. Biometrika, 76(4):675684, 1989.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 10261034, 2015.
Jose´ Miguel Herna´ndez-Lobato and Ryan Adams. Probabilistic backpropagation for scalable learning of Bayesian neural networks. In International Conference on Machine Learning, pp. 18611869, 2015.
GE Hinton and Drew van Camp. Keeping neural networks simple by minimising the description length of weights. In Proceedings of COLT-93, pp. 513, 1993.
Melih Kandemir, Manuel Haussmann, and Fred A Hamprecht. Sampling-free variational inference of Bayesian neural nets. arXiv preprint arXiv:1805.07654, 2018.
Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pp. 25752583, 2015.
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in Neural Information Processing Systems, pp. 64026413, 2017.
Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
Christos Louizos and Max Welling. Structured and efficient variational deep learning with matrix Gaussian posteriors. In International Conference on Machine Learning, pp. 17081716, 2016.
Christos Louizos and Max Welling. Multiplicative normalizing flows for variational Bayesian neural networks. arXiv preprint arXiv:1703.01961, 2017.
Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient MCMC. In Advances in Neural Information Processing Systems, pp. 29172925, 2015.
David JC MacKay. A practical Bayesian framework for backpropagation networks. Neural computation, 4(3):448472, 1992.
David JC MacKay. Bayesian neural networks and density networks. Nuclear Instruments and Methods in Physics Research Section A: Accelerators, Spectrometers, Detectors and Associated Equipment, 354(1):7380, 1995a.
10
Under review as a conference paper at ICLR 2019
David JC MacKay. Developments in probabilistic modelling with neural networks--ensemble learning. In Neural Networks: Artificial Intelligence and Industrial Applications, pp. 191198. Springer, 1995b.
David JC MacKay. Probable networks and plausible predictionsa review of practical Bayesian methods for supervised neural networks. Network: Computation in Neural Systems, 6(3):469505, 1995c.
Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271, 2018.
Andrew Miller, Nick Foti, Alexander D'Amour, and Ryan P Adams. Reducing reparameterization gradient variance. In Advances in Neural Information Processing Systems, pp. 37083718, 2017.
Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. arXiv preprint arXiv:1701.05369, 2017.
Tigran Nagapetyan, Andrew B Duncan, Leonard Hasenclever, Sebastian J Vollmer, Lukasz Szpruch, and Konstantinos Zygalakis. The true cost of stochastic gradient Langevin dynamics. arXiv preprint arXiv:1706.02692, 2017.
Radford M Neal. Bayesian learning via stochastic dynamics. In Advances in neural information processing systems, pp. 475482, 1993.
Christopher G. Small. Expansions and Asymptotics for Statistics. CRC Press, 2010. Shengyang Sun, Changyou Chen, and Lawrence Carin. Learning structured weight uncertainty in
Bayesian neural networks. In Artificial Intelligence and Statistics, pp. 12831292, 2017. Jarno Vanhatalo and Aki Vehtari. Mcmc methods for MLP-network and Gaussian process and
stuff--a documentation for Matlab toolbox MCMCstuff. Laboratory of computational engineering, Helsinki university of technology, 2006. Sumio Watanabe. Algebraic geometry and statistical learning theory, volume 25. Cambridge University Press, 2009. Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pp. 681688, 2011. Zhanxing Zhu, Ruosi Wan, and Mingjun Zhong. Neural control variates for variance reduction. arXiv preprint arXiv:1806.00159, 2018.
11
Under review as a conference paper at ICLR 2019
APPENDIX
A MOMENTS OF THE ACTIVATION VARIABLES a
Under assumption of independence of h, W and b, we can write:
ai = hj Wji + bi = hj Wji + bi
(14)
Cov(ai, ak) = Cov(hjWji, hlWlk) + Cov(bi, bk) = hj Wji hlWlk - hj Wji hlWlk + Cov(bi, bk) = hj hl WjiWlk - hj hl Wji Wlk + Cov(bi, bk) = hj hl [Cov(WjiWlk) + Wji Wlk ] - hj hl Wji Wlk + Cov(bi, bk) = hjhl Cov(Wji, Wlk) + Wji Cov(hj, hl) Wlk + Cov(bi, bk),
(15)
which is seen in the main text as equation 3. For Heaviside and ReLU activation functions, closed forms exist for hj in equation 14:
Heaviside
hj
= 1
2jj
-
e
1 2jj
(j -
aj
)2 dj =
µj
0
ReLU
hj
= 1
2jj
j
-
e
1 2jj
(j -
aj
)2
dj
=
0
jj SR(µj ),
where SR(x) ··= (x) + x(x) is a "soft ReLU", and represent the standard Gaussian PDF
and CDF, and we have introduced the dimensionless variables µj = aj /jj. These results are is sufficient to evaluate equation 14, so in the following sections we turn to each term from equation 15.
A.1 EVALUATION OF TERM 1: hj hl Cov(Wji, Wlk)
In the general case, we can use the results from section A.2 to evaluate off-diagonal hjhl . However, in our experiments we always consider the the special case where Cov(Wji, Wlk) is diagonal. In this case we can write the first term in equation 15 as (reintroducing the explicit summation):
hj hl Cov(Wji, Wlk) = hj hl jlikVar(Wji)
jl jl
=ik
zj zj Var(Wji)
j
= diag [vVar(W )]
i.e. this term is a diagonal matrix with the diagonal given by the left product of the vector vj = hjhj with the matrix Var(Wki). Note that hjhj can be evaluated analytically for Heaviside and ReLU activation functions:
Heaviside
hj hj
= 1
2jj
-
e
1 2jj
(j -
aj
)2 dj =
µj
0
ReLU
hj hj
= 1
2jj
j2e-
1 2jj
(j
-
aj
)2
dj
= jj
µj(µj) + (1 + µj)(µj)
0
12
Under review as a conference paper at ICLR 2019
A.2 EVALUATION OF TERM 2: Wji Cov(hj , hl) Wlk Evaluation of Cov(hj, hl) requires an expression for hjhl . From equation 5, we write:
hj hl
f (j)f (l) exp
-
1 2
P
(j ,
l;
a
,
)
dj dl ,
where P is the quadratic form:
P (j, l; a , ) = =
j - aj l- al
j -µj l -µl
jj jl lj ll
-1
j - aj l- al
1 jl lj 1
-1
j -µj l -µl
.
(16)
Here we have introduced further dimensionless variables j = j/ jj, l = l/ ll and jl = jl/ jjll. We can then rewrite equation 16 in terms of a dimensionless integral I:
hj hl
=
jl jl
I
(µj
,
µl