【免费】DeterministicVariationalInferenceforRobustBayesianNeuralN资源-CSDN文库

自然语言处理

需积分: 0 105 浏览量 2022-08-03 13:26:06 上传评论收藏 3.2MB PDF 举报

资源详情

资源评论

资源推荐

Published as a conference paper at ICLR 2019

DETERMINISTIC VARIATIONAL INFERENCE FOR

ROBUST BAYESIAN NEURAL NETWORKS

Anqi Wu

1∗

, Sebastian Nowozin

2†

, Edward Meeds

Richard E. Turner

3,4

, Jos

e Miguel Hern

andez-Lobato

3,4

Alexander L. Gaunt

Princeton Neuroscience Institute, Princeton University

Google AI Berlin

Department of Engineering, University of Cambridge

Microsoft Research, Cambridge

anqiw@princeton.edu, nowozin@google.com,

{ret26,jmh233}@cam.ac.uk, {ted.meeds, algaunt}@microsoft.com

ABSTRACT

Bayesian neural networks (BNNs) hold great promise as a ﬂexible and principled

solution to deal with uncertainty when learning from ﬁnite data. Among approaches

to realize probabilistic inference in deep neural networks, variational Bayes (VB)

is theoretically grounded, generally applicable, and computationally efﬁcient. With

wide recognition of potential advantages, why is it that variational Bayes has seen

very limited practical use for BNNs in real applications? We argue that variational

inference in neural networks is fragile: successful implementations require careful

initialization and tuning of prior variances, as well as controlling the variance of

Monte Carlo gradient estimates. We provide two innovations that aim to turn VB

into a robust inference tool for Bayesian neural networks: ﬁrst, we introduce a novel

deterministic method to approximate moments in neural networks, eliminating

gradient variance; second, we introduce a hierarchical prior for parameters and

a novel Empirical Bayes procedure for automatically selecting prior variances.

Combining these two innovations, the resulting method is highly efﬁcient and

robust. On the application of heteroscedastic regression we demonstrate good

predictive performance over alternative approaches.

1 INTRODUCTION

Bayesian approaches to neural network training marry the representational ﬂexibility of deep neural

networks with principled parameter estimation in probabilistic models. Compared to “standard”

parameter estimation by maximum likelihood, the Bayesian framework promises to bring key

advantages such as better uncertainty estimates on predictions and automatic model regularization

(MacKay, 1992; Graves, 2011). These features are often crucial for informing downstream decision

tasks and reducing overﬁtting, particularly on small datasets. However, despite potential advantages,

such Bayesian neural networks (BNNs) are often overlooked due to two limitations: First, posterior

inference in deep neural networks is analytically intractable and approximate inference with Monte

Carlo (MC) techniques can suffer from crippling variance given only a reasonable computation

budget (Kingma et al., 2015; Molchanov et al., 2017; Miller et al., 2017; Zhu et al., 2018). Second,

performance of the Bayesian approach is sensitive to the choice of prior (Neal, 1993), and although

we may have a priori knowledge concerning the function represented by a neural network, it is

generally difﬁcult to translate this into a meaningful prior on neural network weights. Sensitivity to

priors and initialization makes BNNs non-robust and thus often irrelevant in practice.

In this paper, we describe a novel approach for inference in feed-forward BNNs that is simple to

implement and aims to solve these two limitations. We adopt the paradigm of variational Bayes (VB)

for BNNs (Hinton & van Camp, 1993; MacKay, 1995c) which is normally deployed using Monte

∗

Work done during an internship at Microsoft Research, Cambridge.

†

Work done while at Microsoft Research, Cambridge.

Published as a conference paper at ICLR 2019

Carlo variational inference (MCVI) (Graves, 2011; Blundell et al., 2015). Within this paradigm we

address the two shortcomings of current practice outlined above: First, we address the issue of high

variance in MCVI, by reducing this variance to zero through novel deterministic approximations to

variational inference in neural networks. Second, we derive a general and robust Empirical Bayes (EB)

approach to prior choice using hierarchical priors. By exploiting conjugacy we derive data-adaptive

closed-form variance priors for neural network weights, which we experimentally demonstrate to be

remarkably effective.

Combining these two novel ingredients gives us a performant and robust BNN inference scheme

that we refer to as “deterministic variational inference” (DVI). We demonstrate robustness and

improved predictive performance in the context of non-linear regression models, deriving novel

closed-form results for expected log-likelihoods in homoscedastic and heteroscedastic regression

(similar derivations for classiﬁcation can be found in the appendix).

Experiments on standard regression datasets from the UCI repository, (Dheeru & Karra Taniskidou,

2017), show that for identical models DVI converges to local optima with better predictive log-

likelihoods than existing methods based on MCVI. In direct comparisons, we show that our Empirical

Bayes formulation automatically provides better or comparable test performance than manual tuning

of the prior and that heteroscedastic models consistently outperform the homoscedastic models.

Concretely, our contributions are:

•

Development of a deterministic procedure for propagating uncertain activations through

neural networks with uncertain weights and ReLU or Heaviside activation functions.

•

Development of an EB method for principled tuning of weight priors during BNN training.

•

Experimental results showing the accuracy and efﬁciency of our method and applicability to

heteroscedastic and homoscedastic regression on real datasets.

2 VARIATIONAL INFERENCE IN BAYESIAN NEURAL NETWORKS

We start by describing the inference task that our method must solve to successfully train a BNN.

Given a model

parameterized by weights

and a dataset

D = (x, y)

, the inference task is

to discover the posterior distribution

p(w|x, y)

. A variational approach acknowledges that this

posterior generally does not have an analytic form, and introduces a variational distribution q(w; θ)

parameterized by

to approximate

p(w|x, y)

. The approximation is considered optimal within the

variational family for

∗

that minimizes the Kullback-Leibler (KL) divergence between

and the

true posterior.

∗

= argmin

[q(w; θ)||p(w|x, y)].

Introducing a prior

p(w)

and applying Bayes rule allows us to rewrite this as optimization of the

quantity known as the evidence lower bound (ELBO):

∗

= argmax

w∼q

[log p(y|w, x)] − D

[q(w; θ)||p(w)]}. (1)

Analytic results exist for the KL term in the ELBO for careful choice of prior and variational

distributions (e.g. Gaussian families). However, when

is a non-linear neural network, the ﬁrst

term in equation 1 (referred to as the reconstruction term) cannot be computed exactly: this is where

MC approximations with ﬁnite sample size S are typically employed:

w∼q

[log p(y|w, x)] ≈

s=1

log p(y|w

(s)

, x), w

(s)

∼ q(w; θ). (2)

Our goal in the next section is to develop an explicit and accurate approximation for this expectation,

which provides a deterministic, closed-form expectation calculation, stabilizing BNN training by

removing all stochasticity due to Monte Carlo sampling.

3 DETERMINISTIC VARIATIONAL APPROXIMATION

Figure 1 shows the architecture of the computation of

w∼q

[log p(D|w)]

for a feed-forward neural

network. The computation can be divided into two parts: ﬁrst, propagation of activations though

Published as a conference paper at ICLR 2019







    













Layer 1 Layer 2

Layer 

…

(a) (b)









󰇛



 󰇜

󰇛



 󰇜

󰇛



󰇜

󰇛



󰇜

󰇛



󰇜

Figure 1

: Architecture of a Bayesian

neural network. Computation is divided

into (a) propagation of activations (

)

from an input

and (b) computation of

a log-likelihood function

for outputs

. Weights are represented as high di-

mensional variational distributions (blue)

that induce distributions over activations

(yellow). MCVI computes using sam-

ples (dots); our method propagates a full

distribution.

parameterized layers and second, evaluation of an unparameterized log-likelihood function (

). In

this section, we describe how each of these stages is handled in our deterministic framework.

3.1 MOMENT PROPAGATION

We begin by considering activation propagation (ﬁgure 1(a)), with the aim of deriving the form

of an approximation

˜q(a

)

to the ﬁnal layer activation distribution

q(a

)

that will be passed to

the likelihood computation. We compute

by sequentially computing the distributions for the

activations in the preceding layers. Concretely, we deﬁne the action of the

layer that maps

(l−1)

to a

as follows:

= f(a

(l−1)

= h

+ b

where

is a non-linearity and

, b

} ⊂ w

are random variables representing the weights and

biases of the

layer that are assumed independent from weights in other layers. For notational

clarity, in the following we will suppress the explicit layer index

, and use primed symbols to denote

variables from the

(l − 1)

layer, e.g.

= a

(l−1)

. Note that we have made the non-conventional

choice to draw the boundaries of the layers such that the linear transform is applied after the non-

linearity. This is to emphasize that

is constructed by linear combination of many distinct elements

, and in the limit of vanishing correlation between terms in this combination, we can appeal

to the central limit theorem (CLT). Under the CLT, for a large enough hidden dimension and for

variational distributions with ﬁnite ﬁrst and second moments, elements

will be normally distributed

regardless of the potentially complicated distribution for

induced by

. We empirically observe

that this claim is approximately valid even when (weak) correlations appear between the elements of

h during training (see section 3.1.1).

Having argued that

adopts a Gaussian form, it remains to compute the ﬁrst and second moments. In

general, these cannot be computed exactly, so we develop an approximate expression. An overview

of this derivation is presented here with more details in appendix A. First, we model

and

independent random variables, allowing us to write:

i = hh

ihW

i + hb

Cov(a

, a

) = hh

iCov(W

, W

) + hW

iCov(h

, h

) hW

i + Cov(b

, b

), (3)

where we have employed the Einstein summation convention and used angle brackets to indicate

expectation over

. If we choose a variational family with analytic forms for weight means and

covariances (e.g. Gaussian with variational parameters

and

Cov(W

, W

)

), then the only

difﬁcult terms are the moments of h:

i ∝

f(α

) exp



−

(

−

2Σ



dα

, (4)

i ∝

f(α

)f(α

) exp

−



−







−1



−



dα

, (5)

We are also required to choose a Gaussian variational approximation for

to preserve the Gaussian

distribution of a.

Published as a conference paper at ICLR 2019

A(µ

, µ

, ρ) Q(µ

, µ

, ρ)

Heaviside

Φ(µ

)Φ(µ

)

−log(

2π

) +

¯ρ

+ µ

−

2ρ

1+¯ρ

+ O(µ

)

ReLU

SR(µ

)SR(µ

)

+ ρΦ(µ

)Φ(µ

)

−log(

2π

) +

(1+¯ρ)



+ µ



−

arcsin ρ−ρ

+ O(µ

)

Table 1: Forms for the components of the approximation in equation 6 for Heaviside and ReLU

non-linearities.

is the CDF of a standard Gaussian, SR is a “soft ReLU” that we deﬁne as

SR(x) =

φ(x) + xΦ(x) where φ is a standard Gaussian, ¯ρ =

1 − ρ

, g

= arcsin ρ and g

= g

1+¯ρ

where we have used the Gaussian form of

parameterized by mean

and covariance

, and

for brevity we have omitted the normalizing constants. Closed form solutions for the integral in

equation 4 exist for Heaviside or ReLU choices of non-linearity

(see appendix A). Furthermore, for

these non-linearities, the





→ ±∞

and

i → ±∞

asymptotes of the integral in equation 5 have

closed form. Figure 2 shows schematically how these asymptotes can be used as a ﬁrst approximation

for equation 5. This approximation is improved by considering that (by deﬁnition) the residual decays

to zero far from the origin in the

(





, ha

plane, and so is well modelled by a decaying function

exp[−Q(





, ha

i, Σ

)]

, where

is a polynomial in

with a dominant positive even term. In

practice we truncate

at the quadratic term, and calculate the polynomial coefﬁcients by matching

the moments of the resulting Gaussian with the analytic moments of the residual. Speciﬁcally, using

dimensionless variables

= ha

and

= Σ

, this improved approximation

takes the form

i = S



A(µ

, µ

, ρ

) + exp



−Q(µ

, µ

, ρ

)



, (6)

= +

× 0.1

= + +

× 0.001× 0. 1

Asymptote

Gaussian

approx.

Heaviside(a)

𝜇

𝑗

′

𝜇

𝑙

′

ℎ

𝑗

ℎ

𝑙

= +

= + +

× 0.003

× 0.003 × 0.00001

Asymptote

Gaussian

approx.

ℎ

𝑗

ℎ

𝑙

ReLU(b)

𝜇

𝑗

′

𝜇

𝑙

′

Figure 2: Approximation of

using

an asymptote and Gaussian correction for

(a) Heaviside and (b) ReLU non-linearities.

Yellow functions have closed-forms, and blue

indicates residuals. The examples are plotted

for

−6 < µ

< 6

and

= 0.5

, and the

relative magnitude of each correction term is

indicated on the vertical axis.

where the expressions for the dimensionless asymptote

and quadratic

are given in table table 1 and derived in

appendix A.2.1 and A.2.2. The dimensionful scale fac-

tor

is 1 for a Heaviside non-linearity or

/ρ

for

ReLU. Using equation 6 in equation 3 gives a closed form

approximation for the moments of

as a function of mo-

ments of

. Since

is approximately normally distributed

by the CLT, this is sufﬁcient information to sequentially

propagate moments all the way through the network to

compute the mean and covariances of ˜q(a

), our explicit

multivariate Gaussian approximation to

q(a

)

. Any deep

learning framework supporting special functions

arcsin

and

will immediately support backpropagation through

the deterministic expressions we have presented. Below

we brieﬂy empirically verify the presented approximation,

and in section 3.2 we will show how it is used to compute

an approximate log-likelihood and posterior predictive

distribution for regression and classiﬁcation tasks.

3.1.1 EMPIRICAL VERIFICATION

Approximation accuracy

The approximation derived

above relies on three assumptions. First, that some form of

CLT holds for the hidden units during training where the

iid assumption of the classic CLT is not strictly enforced;

second, that a quadratic truncation of

is sufﬁcient

; and

third that there are only weak correlation between layers

so that they can be represented using independent vari-

ables in the variational distribution. To provide evidence

that these assumptions hold in practice, we train a small

ReLU network with two hidden layers each of 128 units

Additional Taylor expansion terms can be computed if this assumption fails.

Published as a conference paper at ICLR 2019

















   













0 





Data samples

Data 1-std

Model 1-std

󰇛󰇜 󰇛󰇜

  

󰇛󰇜 󰇛󰇜

ours

 

   

 

Before Training

After Training

󰇛󰇜

Figure 3: Empirical accuracy of our approximation on toy 1-dimensional data. (a) We train a 2 layer ReLU

network to perform heteroscedastic regression on the dataset shown in (b) and obtain the ﬁt shown in blue. (c)

The output distributions for the activation units

and

evaluated at

x = 0.25

are in excellent agreement with

Monte Carlo (MC) integration with a large number (20k) of samples both before and after training.

to perform 1D heteroscedastic regression on a toy dataset of 500 points drawn from the distribution

shown in ﬁgure 3(b). Deeper networks and skip connections are considered in appendix C. The

training objective is taken from section 4, and the only detail required here is that

is a 2-element

vector where the elements are labelled as

(m, `)

. We use a diagonal Gaussian variational family to

represent the weights, but we preserve the full covariance of

during propagation. Using an input

x = 0.25

(see arrow, Figure 3(b)) we compute the distributions for

and

both at the start of training

(where we expect the iid assumption to hold) and at convergence (where iid does not necessarily

hold). Figure 3(c) shows the comparison between

distributions reported by our deterministic

approximation and MC evaluation using 20k samples from

q(w; θ)

. This comparison is qualitatively

excellent for all cases considered.

hidden dimension, d

−4

−3

−2

−1

per-batch runtime (s)

100

300

MCVI

dDVI

DVI

Figure 4: Runtime performance of VI methods.

We show the time to propagate a batch of 10 ac-

tivation vectors through a single

d × d

layer. For

MCVI we label curves with the number of sam-

ples used, and we show quadratic and cubic scal-

ing guides-to-the-eye (black). Black dots indicate

where our implementation runs out of memory

(16GB).

Computational efﬁciency

In traditional MCVI,

propagation of

samples of

-dimensional activa-

tions through a layer containing a

d × d

-dimensional

transformation requires

O(Sd

)

compute and

O(Sd)

memory. Our DVI method approximates the

S → ∞

limit, while only demanding

O(d

)

compute and

O(d

)

memory (the additional factor of

arises from

manipulation of the quadratically large covariance

matrix

Cov[h

, h

]

). Whereas MCVI can always

trade compute and memory for accuracy by choosing

a small value for

, the inherent scaling of DVI with

could potentially limit its practical use for networks

with large hidden size. To avoid this limitation, we

also consider the case where only the diagonal entries

Cov(h

, h

)

are computed and stored at each layer.

We refer to this method as “diagonal-DVI” (dDVI),

and in section 6 we show the surprising result that

the strong test performance of DVI is largely retained

by dDVI across a range of datasets. Figure 4 shows

the time required to propagate activations through a

single layer using the MCVI, DVI and dDVI methods

on a Tesla V100 GPU. As a rough rule of thumb (on this hardware), for layer sizes of practical

relevance, we see that absolute DVI runtimes roughly equate to MCVI with

S = 300

and dDVI

runtime equates to S = 1.

3.2 LOG-LIKELIHOOD EVALUATION

To use the moment propagation procedure derived above for training BNNs, we need to build a

function

that maps ﬁnal layer activations

to the expected log-likelihood term in equation 1 (see

ﬁgure 1(b)). In appendix B.1 we show the intuitive result that this expected log-likelihood over

q(w)

剩余23页未读，继续阅读

评论收藏

内容反馈

江水流春去

粉丝: 42
资源: 352

Deterministic Variational Inference for Robust Bayesian Neural N

评论0

最新资源

Deterministic Variational Inference for Robust Bayesian Neural N

评论0

Bayesian Reasoning and Machine Learning

Network Caculus A Theory of Deterministic Queuing Systems for the Internet

Network Calculus: A Theory of Deterministic Queuing Systems for the Internet

Deterministic Execution for Arbitrary Multithreaded Programs

generative-models.pdf

A fast accurate deterministic parser for Chinese

论文研究-Aujin Algorithm: A Deterministic Polynomial Algorithm for SAT.pdf

NETWORK CALCULUS A Theory of Deterministic Queuing Systems for the Internet

NETWORKCALCULUSATheoryofDeterministicQueuingSystemsfortheInternet.pdf 英文原版

Machine Learning: A Bayesian and Optimization Perspective

Deterministic Crash Recovery for NAND Flash based Storage Systems

DOA估计 - DML(deterministic ML) & SML(stochastic ML)

稀疏非负卷积的确定性和拉斯维加斯算法_Deterministic and Las Vegas Algorithms for Sp

机器学习应用与趋势研究手册算法,方法与技术

IEEE 802.11s - Mesh Deterministic Access

Deep deterministic strategy gradient.py

Improved deterministic algorithms for weighted matching and packing problems

Deterministic PD Compliance MOI 1.pdf

最新版ISO/IEC 27001:2022、ISO 27002:2022中英文合集

Goby红队版-win-x64-2.4.7版本

Chrome Header Editor 插件

ISO SAE 21434-2021 中文版.pdf

安全认证cisp教材全套

OpenVAS GVM 中文翻译补丁

2024最新：Hvv中常见的面试问题

现代永磁同步电机控制原理及MATLAB仿真__袁雷编著1

全面的安全基线核查清单

CISP、NISP二级、CISE题库最新版（2024年1月更新）

最新资源