ScalableDeepGaussianMarkovRandomFieldsforGeneralGraphs.p资源-CSDN文库

机器学习

需积分: 9 14 浏览量 2022-06-27 18:44:22 上传评论收藏 2.55MB PDF 举报

资源详情

资源评论

This paper is to appear in Proceedings of the 39th International Conference on Machine Learning, ICML 2022.

Please cite as:

@inproceedings{graph_dgmrf,

author = {Oskarsson, Joel and Sid{\'e}n, Per and Lindsten, Fredrik},

booktitle = {Proceedings of the 39th International Conference on Machine Learning},

title = {Scalable Deep {G}aussian {M}arkov Random Fields for General Graphs},

year = {2022}

}

arXiv:2206.05032v1 [stat.ML] 10 Jun 2022

Scalable Deep Gaussian Markov Random Fields for General Graphs

Joel Oskarsson

Per Sid

1 2

Fredrik Lindsten

Abstract

Machine learning methods on graphs have proven

useful in many applications due to their ability

to handle generally structured data. The frame-

work of Gaussian Markov Random Fields (GM-

RFs) provides a principled way to deﬁne Gaussian

models on graphs by utilizing their sparsity struc-

ture. We propose a ﬂexible GMRF model for

general graphs built on the multi-layer structure

of Deep GMRFs, originally proposed for lattice

graphs only. By designing a new type of layer we

enable the model to scale to large graphs. The

layer is constructed to allow for efﬁcient training

using variational inference and existing software

frameworks for Graph Neural Networks. For a

Gaussian likelihood, close to exact Bayesian infer-

ence is available for the latent ﬁeld. This allows

for making predictions with accompanying uncer-

tainty estimates. The usefulness of the proposed

model is veriﬁed by experiments on a number of

synthetic and real world datasets, where it com-

pares favorably to other both Bayesian and deep

learning methods.

1. Introduction

Graphs are immensely important constructs in scientiﬁc

modeling. They show up as natural representations of tech-

nological networks, such as computer and electricity net-

works, but also of biological and social networks (Stankovi

et al., 2020a; 2019). As massive amounts of data are col-

lected about these networks, the area of machine learning on

graphs becomes increasingly relevant. In this ﬁeld we ﬁnd

probabilistic graphical models, that allow for drawing sta-

tistically sound inferences about the quantities in the graph

(Koller & Friedman, 2009). Another, more recent, family

of graph-based models are

Graph Neural Networks (GNNs)

Division of Statistics and Machine Learning, Department

of Computer and Information Science, Link

oping University,

Link

oping, Sweden

Arriver Software AB. Correspondence to:

Joel Oskarsson <joel.oskarsson@liu.se>.

Proceedings of the

International Conference on Machine

Learning, Baltimore, Maryland, USA, PMLR 162, 2022. Copy-

right 2022 by the author(s).

Figure 1.

Predictive uncertainties over a large graph. Nodes inside

the rectangles are unobserved, resulting in higher uncertainty (more

red). Values are marginal standard deviations, taken from the

experiment with wind speed data in section 4.4.

(Wu et al., 2020; Kipf & Welling, 2017). These bring

the ﬂexibility and scalability of deep learning to graph-

structured data. Unifying statistically sound methods with

deep learning is an important goal in the ﬁeld of graph-based

methods and in modern machine learning as a whole.

There is a need for methods with a strong statistical ba-

sis, but with the scalability of GNNs. While some graphs,

such as those describing molecules, are comparatively small,

massive graphs can emerge for example from social and traf-

ﬁc networks. We need efﬁcient methods that scale also to

these. In many situations it is additionally desirable not

just to produce accurate predictions, but also some measure

of the uncertainty about the predictions. An example of

this is shown in Figure 1. We are here concerned with the

node-wise regression setting, where a single graph is con-

sidered and nodes are associated with real-valued targets.

We assume the full graph structure to be known. The main

application of interest is prediction for a subset of unob-

served nodes. Such problems can for example arise when

information is missing about some individuals in a social

network or due to partial outages in technological networks.

Bayesian methods provide a principled way to obtain pre-

dictive uncertainty estimates, that properly account for the

uncertainty in latent variables. One type of Bayesian model

that utilizes the sparsity of graphs are Gaussian Markov Ran-

dom Fields (GMRFs) (Rue & Held, 2005). The Deep GMRF

(DGMRF) framework of Sid

en & Lindsten (2020) combines

Scalable Deep Gaussian Markov Random Fields for General Graphs

GMRFs with Convolutional Neural Networks (CNNs) for

the special case of image-structured data. DGMRFs can be

trained efﬁciently and keep all useful properties of GMRFs,

such as exact Bayesian inference for the latent ﬁeld. In this

paper we extend and generalize the DGMRF framework

to the general graph setting. This requires us to design a

new layer construct for DGMRFs based on local operations

over node neighborhoods. Without making any assumptions

on the graph structure we propose methods that allow the

model training to scale to massive graphs.

Our main contributions are: 1) We extend the DGMRF

framework to general graphs by designing a new type of

layer construct based on GNNs. 2) We adapt the DGMRF

training to this new setting, making use of an improved

variational distribution. 3) We propose scalable methods

for performing the log-determinant computations central to

the training. 4) We demonstrate properties of the resulting

model on synthetic data. 5) We experiment on multiple real-

world datasets, for which our model outperforms existing

methods.

2. Background

2.1. GMRFs

In graphical models a set of random variables are associ-

ated with the nodes of a graph (Koller & Friedman, 2009;

Bishop, 2006). GMRFs are undirected graphical models

where the nodes jointly follow a Gaussian distribution. More

speciﬁcally, let

be an undirected graph with

nodes

concatenated in the random vector

x ∈ R

. We say that

x ∼ N



µ, Q

−1



is a GMRF with mean

and precision ma-

trix

w.r.t. the graph

iff

i,j

6= 0 ⇔ j ∈ n(i), ∀i 6= j

where

n(i)

is the exclusive neighborhood of node

(

i /∈

n(i)

). A GMRF is thus a multivariate Gaussian with a pre-

cision matrix as sparse as the graph. Note however that the

covariance matrix can still be fully dense, enabling depen-

dencies between all nodes in the graph.

Consider now the common situation where we observe

y = x + 

 ∼ N



0, σ



. A GMRF prior on

is conju-

gate to this Gaussian likelihood. We will mainly consider

the application of GMRFs to problems where

is observed

only for some nodes. Let

m ∈ {0, 1}

be a mask vector

with ones in positions corresponding to the observed nodes,

= y  m

and

= diag(m)

. The posterior for

this setting is then given by x|y

∼ N



µ,

−1



, where

Q = Q +

(1a)

µ =

−1



Qµ +



. (1b)

While the posterior is analytically tractable, and again a

GMRF, computing the involved entities explicitly can be a

signiﬁcant computational challenge for large N.

2.2. DGMRFs

Sid

en & Lindsten (2020) note that for an afﬁne map

g: R

→ R

a GMRF x can be deﬁned by

z = g(x) = Gx + b, z ∼ N(0, I) (2)

where

G ∈ R

N×N

is a matrix and

some offset vector. This

results in a GMRF with mean

µ = −G

−1

and precision

matrix

Q = G

. Note how the direction of the mapping

in Eq. 2 makes

implicitly deﬁned, a different setup from

other generative models mapping Gaussian noise to data.

The afﬁne map

can in turn be deﬁned as a combination of

simpler layers as

g = g

(L)

◦ g

(L−1)

◦ ··· ◦ g

(1)

, adding

the depth to the Deep GMRF. The value of considering the

layers separately is that they can be implemented implicitly,

using some operation that is known to be afﬁne. Thus,

multiple such operations can be chained without performing

the expensive matrix multiplications to create G.

Sid

en & Lindsten (2020) consider the special case where

the entries of

are associated with pixels in an image. They

then deﬁne each

(l)

as a 2-dimensional convolution with

a ﬁlter containing trainable parameters. Such a DGMRF

is a GMRF w.r.t. a lattice graph (Rue & Held, 2005), a

graph where each pixel is connected to neighboring pixels

within a window determined by the ﬁlter size. The resulting

model shares much of its structure with CNNs. This allows

for utilizing existing deep learning frameworks for efﬁcient

convolution computations, automatic differentiation, and

GPU support.

After observing some data

inference for the latent ﬁeld

follows from Eq. 1. To avoid inverting

, the posterior

mean

can be computed using the Conjugate Gradient

(CG) method (Hestenes & Stiefel, 1952; Shewchuk, 1994).

Often, also the posterior marginal variances

Var[x

]

are of interest. To avoid computing the covariance matrix

explicitly, an alternative is to use a Monte Carlo estimate

based on a set of samples from the posterior. It is possible

to use the CG method also for efﬁciently drawing posterior

samples (Papandreou & Yuille, 2010).

3. DGMRFs on Graphs

We extend the DGMRF framework of Sid

en & Lindsten

(2020) to general graphs, removing the assumption of a

lattice graph to match the general deﬁnition of a GMRF.

To achieve this we design a new type of layer

(l)

without

any assumptions on the graph structure. We then propose

a way to train this new type of DGMRF using scalable

log-determinant computations and a new, more ﬂexible,

variational distribution. An overview of our model is shown

in Figure 2.

Scalable Deep Gaussian Markov Random Fields for General Graphs

Figure 2.

Overview of the graph DGMRF model. The latent ﬁeld

is transformed to

through

afﬁne maps

(1)

, . . . , g

(L)

. The data

is a noisy observation of

. In

(1)

we illustrate Eq. 5 for a single node

. The node itself is weighted with

(solid purple arrow).

Each node

in the neighborhood is weighted with

i,j

−1

(dashed orange arrows). The value at node

after the layer is the sum of

all these contributions plus the bias term b

Let

be an undirected connected graph with

nodes and

adjacency matrix

A ∈ R

N×N

. In general we consider

weighted graphs with

i,j

= w

i,j

{j∈n(i)}

, where

i,j

j,i

> 0

is the weight of the edge between nodes

and

For unweighted graphs

i,j

= 1 ∀i, j

. We denote the degree

of node

i,j

(

= |n(i)|

in the unweighted

case) and arrange all node degrees in the degree matrix

D = diag([d

, d

, . . . , d

]

3.1. Graph layer

Generalizing the CNN-based DGMRF layers of Sid

en &

Lindsten (2020) to general graphs requires some special

considerations. It is integral to take into account the varying

node degrees in the graph. To achieve this we look to the

GNN framework and consider a linear version of a message

passing neural network (Gilmer et al., 2017). Let

(l)

∈ R

be the node values after layer

, with

(0)

= x

and

(L)

. An intuitive way to deﬁne the operation of

(l)

would be

to sum over the node neighborhood and weight the center

node by its degree,

(l)

= b

+ α

(l−1)

+ β

j∈n(i)

(l−1)

, (3)

where

, β

and

are layer-speciﬁc trainable parameters.

This operation can be viewed as a parametrized version

of the graph Laplacian (Stankovi

c et al., 2020a), a central

construct in graph-based machine learning (Stankovi

c et al.,

2020b; Kipf & Welling, 2017). As a special case of Eq. 3

we also ﬁnd the commonly used Intrinsic GMRF (IGMRF)

model (Rue & Held, 2005). A limitation of Eq. 3 is however

that there are no parameter values that reduce the layer to

an identity mapping. If the model would consist of a single

layer, this would not have been an issue. However, when

stacking multiple layers in a deep architecture this inability

introduces undesirable restrictions, in the sense that the

range of attainable models is not strictly increasing as we

add more layers. To avoid this shortcoming we also consider

an alternative way to deﬁne

(l)

by instead taking the mean

over the neighborhood,

(l)

= b

+ α

(l−1)

+ β

j∈n(i)

(l−1)

. (4)

Unlike Eq. 3, this operation includes the identity mapping

as a special case.

Finally, we propose a layer structure that generalizes these

two ideas, making it possible for the model to learn which

is the better choice for the data at hand. Our proposed layer

is deﬁned by

(l)

= b

+ α

(l−1)

+ β

−1

j∈n(i)

i,j

(l−1)

, (5)

where we introduce another parameter

∈ ]0, 1[

and the

optional edge weights

i,j

. Eq. 3 and 4 are special cases as

tends to its limits. We empirically verify the usefulness

of this layer construct in Appendix A.1. Note also that when

= 1

= 0

and

→ 0

Eq. 5 reduces to

an identity mapping. Another motivation for this speciﬁc

layer construct is that it will enable scalable methods for

computing the log-determinant of

, as will be explained in

section 3.3. We additionaly reparametrize the model so that

> 0

and

|β

| < |α

. This avoids degenerate, indeﬁnite

solutions (see Appendix B for details) and will enable our

most scalable method for log-determinant computations.

If we consider the entire vector

(l)

, the layer corresponds

to h

(l)

= g

(l)



(l−1)



= G

(l)

(l−1)

+ b

(l)

with

(l)

= α

+ β

−1

A (6)

and

(l)

= b

, where

1 ∈ R

is a vector with all ones.

As the operation

(l)

corresponds to a layer of a GNN we

can rely on existing software libraries for the model im-

plementation. Such libraries come with a number of use-

ful properties, most importantly automatic differentiation

and GPU-acceleration (Fey & Lenssen, 2019; Grattarola &

Alippi, 2020).

Scalable Deep Gaussian Markov Random Fields for General Graphs

3.2. Variational training

The parameters of a DGMRF can be trained by maximizing

the log marginal likelihood

log p(y

|θ)

. This is however

often infeasible as it requires computing the determinant of

the posterior precision matrix

(Sid

en & Lindsten, 2020).

For large

one can instead resort to variational inference,

maximizing the Evidence Lower Bound (ELBO),

ELBO(θ, φ) = E

q(x|φ)

[log p(y

, x|θ)] + H[q(x|φ)]

≤ log p(y

|θ)

(7)

where

is a variational distribution with parameters

and

H[·]

refers to differential entropy. For a DGMRF with a

Gaussian likelihood the ﬁrst term of the ELBO is

q(x|φ)

[log p(y

, x|θ)] =

−

q( x|φ)



g(x)

g(x) +

− x)



+ log|det(G)| −M log σ + const.

(8)

where

M =

i=1

is the number of observed nodes.

The expectation in Eq. 8 can be estimated using a set of

samples drawn from

. As

G = G

(L)

(L−1)

. . . G

(1)

, the

log-(absolute)-determinant is given by

log|det(G)| =

l=1

log



det



(l)





. (9)

Computing this efﬁciently is one of the major challenges

with the general graph setting, as will be discussed further

in section 3.3.

The full set of model parameters

are the trainable pa-

rameters of each layer and the noise standard deviation

Maximizing the ELBO w.r.t.

and

can be done using

gradient-based stochastic optimization.

3.2.1. VARIATIONAL DISTRIBUTION

A natural and useful way to choose the variational distribu-

tion is as another Gaussian

q(x|φ) = N(x|ν, SS

)

. This

corresponds to deﬁning

by another afﬁne transformation

in the opposite direction of the DGMRF,

x = Sr + ν, r ∼ N(0, I). (10)

Note the difference to Eq. 2 as we here parametrize the

covariance matrix instead of the precision matrix. This

parametrization additionally allows for computing gradi-

ents through the sampling process, by the use of the

reparametrization trick (Kingma & Welling, 2014).

Sid

en & Lindsten (2020) use a simple mean ﬁeld approx-

imation with a diagonal

, making all components of

independent (Bishop, 2006). However, we propose a more

ﬂexible q by choosing

S = diag(ξ)

G diag(τ ) (11)

where

ξ, τ ∈ R

are vectors containing positive parameters

and

is deﬁned in the same way as the DGMRF layer in

Eq. 6. Including the matrix

introduces off-diagonal

elements in the covariance matrix of

, alleviating the inde-

pendence assumption between nodes. Multiple such layers

can also be used, introducing longer dependencies between

nodes in the graph. The full set of variational parameters

is then

ν, ξ, τ

and all trainable parameters from the layer(s)

. In Appendix A.2 we empirically show that DGMRFs

trained using our more ﬂexible variational distribution con-

sistently outperforms those trained using the simple mean

ﬁeld approximation.

With this choice of S the entropy term of the ELBO is

H[q(x|φ)] = log|det(S)| + const.

= log



det







i=1

log ξ

+ log τ

+ const.

(12)

Re-using the DGMRF layer construct in

has the added ben-

eﬁt that the techniques we develop for the log-determinant

computation readily extend also to computing H[q(x|φ)].

3.3. Computing the log-determinant

Computing the necessary log-determinants in Eq. 9 and

12 efﬁciently is a major challenge with the general graph

setting. The CNN-based DGMRF was deﬁned on a lattice

graph, which creates a special structure in

and allows

for ﬁnding efﬁcient closed-form expressions for the log-

determinants (Sid

en & Lindsten, 2020). As we do not make

any such assumptions on the graph structure we here pro-

pose new scalable methods to compute the log-determinants.

3.3.1. EIGENVALUE METHOD

One way to compute the log-determinant is based on the

eigenvalues of the matrix. As the determinant is given by

the product of all eigenvalues,

log



det



(l)





= log|det(D

+ log



det



I + β

−1





i=1

log(d

) + log |λ

(13)

where

{λ

}

i=1

are the eigenvalues of

I + β

−1

. It

can be shown

that

= α

+ β

with

being the

:th

For an eigenvector

−1

with eigenvalue

−1

= λ

⇒ (α

I + β

−1

A)v

= (α

+ β

剩余21页未读，继续阅读

评论收藏

内容反馈

Scalable Deep Gaussian Markov Random Fields for General Graphs.p

评论0

最新资源

Scalable Deep Gaussian Markov Random Fields for General Graphs.p

评论0

最新资源

相关推荐

Practical Machine Learning with H2O Powerful Scalable Techniques for Deep.pdf

藏经阁-Smart Scalable Feature Reduction with Random Forests.pdf

藏经阁-Scalable Deep Learning on Spark.pdf

Practical.Node.js.Building.RealWorld.Scalable.Web.Apps.PDF.pdf

Scalable.Big.Data.Architecture.148421327

Real-Time Adaptive Scalable TextureCompression for the Web.pdf

Architecting.for.Scale.High.Availability.for.Your.Growing.Applications.epub

Web.Development.with.Go.Building.Scalable.Web.Apps.and.RESTful.Services

Deep.Learning.with.TensorFlow

Deploying.with.JRuby.9k.Deliver.Scalable.Web.Apps.Using.JVM

Advanced Algorithms for Fast and Scalable Deep Packet Inspection

.Using.Flume.Flexible.Scalable.and.Reliable.Data.Streaming

Learning.Scala.Practical.Functional.Programming.for.the.JVM

DB - Scalable Replay-Based Replication For Fast Databases.pd

Using.Flume.Flexible.Scalable.and.Reliable.Data.Streaming.pdf

Big.Data.Principles.and.best.practices.of.scalable.realtime.data.systems.161

Algorithms.for.Data.Science

ScalableDeepLearningonSpark.pdf

ChatGPT教程（终极版）最全整理

博客中Kmeans以及FCM算法数据（免积分）

hugging face的models-openai-clip-vit-large-patch14文件夹

神经网络回归预测--气温数据集

XGBoost+LightGBM+LSTM-光伏发电量预测

Mathwork+Matlab+编程手册

Stable-Diffusion WEBUI 简体中文语言包（2023.05.30更新）

中文短信数据集-带标签

时间序列预测模型实战案例(Xgboost)(Python)(机器学习)包括时间序列预测和时间序列分类，点击即可运行！

亚博K210模型训练部署