【免费】2019-2019-D-VAE,AVariationalAutoencoderforDirectedAcyclic

需积分: 0 72 浏览量 2022-08-04 13:16:17 上传评论收藏 2.34MB PDF 举报

资源详情

资源评论

资源推荐

D-VAE: A Variational Autoencoder for Directed

Acyclic Graphs

Muhan Zhang, Shali Jiang, Zhicheng Cui, Roman Garnett, Yixin Chen

Department of Computer Science and Engineering

Washington University in St. Louis

{muhan, jiang.s, z.cui, garnett}@wustl.edu, chen@cse.wustl.edu

Abstract

Graph structured data are abundant in the real world. Among different graph types,

directed acyclic graphs (DAGs) are of particular interest to machine learning re-

searchers, as many machine learning models are realized as computations on DAGs,

including neural networks and Bayesian networks. In this paper, we study deep

generative models for DAGs, and propose a novel DAG variational autoencoder

(D-VAE). To encode DAGs into the latent space, we leverage graph neural networks.

We propose an asynchronous message passing scheme that allows encoding the

computations deﬁned by DAGs, rather than using existing simultaneous message

passing schemes to encode local graph structures. We demonstrate the effectiveness

of our proposed D-VAE through two tasks: neural architecture search and Bayesian

network structure learning. Experiments show that our model not only generates

novel and valid DAGs, but also produces a smooth latent space that facilitates

searching for DAGs with better performance through Bayesian optimization.

1 Introduction

Many real-world problems can be posed as optimizing of a directed acyclic graph (DAG) representing

some computational task. In machine learning, deep neural networks are DAGs. Although they have

achieved remarkable performance on a wide range of learning tasks, tremendous efforts need to be

devoted to designing their architectures, which is essentially a DAG optimization task. Similarly,

optimizing the connection structures of Bayesian networks is also a critical problem in learning

graphical models [

]. DAG optimization is pervasive in other ﬁelds as well. For example, in

electronic circuit design, engineers need to optimize directed network structures not only to realize

target functions, but also to meet speciﬁcations such as power usage and operating temperature.

DAGs, as well as other discrete structures, cannot be optimized directly with traditional gradient-

based techniques, as gradients are not available. Bayesian optimization, a state-of-the-art black-box

optimization technique, requires a kernel to measure the similarity between discrete structures as

well as a method to explore the design space and extrapolate to new points. Principled solutions to

these problems are still lacking. Recently, there has been increased interest in training generative

models for discrete data types such as molecules [

], arithmetic expressions [

], source code [

general graphs [

], etc. In particular, Kusner et al.

[3]

developed a grammar variational autoencoder

(GVAE) for molecules, which is able to encode and decode molecules into and from a

continuous

latent space

, allowing one to optimize molecule properties by searching in this well-behaved space

instead of a discrete space. Inspired by this work, we propose to train variational autoencoders for

DAGs, and optimize DAG structures in the latent space based on Bayesian optimization.

Existing graph generative models can be classiﬁed into three categories: token-based, adjacency-

matrix-based, and graph-based approaches. Token-based graph generative models [

] represent a

graph as a sequence of tokens (e.g., characters, grammar production rules) and model these sequences

Preprint. Work in progress.

arXiv:1904.11088v2 [cs.LG] 9 May 2019

based on established RNN modules such as gated recurrent unit (GRU) [

]. Adjacency-matrix-based

models [

] generate columns/entries of the adjacency matrix of a graph sequentially

or generate the entire adjacency matrix in one shot. Graph-based models [

] iteratively

add new nodes or new edges to a graph based on the the existing graph state and node hidden states.

Among the three types, token-based models require task-speciﬁc graph grammars such as SMILES

for molecules [

], and thus is less general. Adjacency-matrix-based models leverage proxy matrix

representations of graphs, and generate graphs through multi-layer perceptrons (MLPs) or RNNs

which could be less expressive and less suitable for graphs. On the other hand, graph-based models

seem to be more general and natural, since they operate directly on graph structures instead of proxy

representations. In addition, the iterative process is driven by the current graph and nodes’ states as

represented by

graph neural networks (GNNs)

, which have already shown their powerful graph

feature learning ability on various tasks [18, 19, 20, 21, 22, 23, 24].

A GNN extracts local features around nodes by passing neighboring nodes’ messages to the center.

Traditionally, such message passing happens at all nodes

simultaneously

to extract local substructure

features for all nodes [

]. Then, these feature vectors are summed to be the graph embedding.

Although such symmetric message passing works for undirected graphs, it fails to capture the

computation dependencies deﬁned by DAGs – nodes within a DAG naturally have some ordering

based on its dependency structures, which is ignored by existing GNNs but crucial for performing

the computation on the DAG. If we encode a DAG using existing GNNs, the embedding may only

capture its local structures but fail to encode the computation.

To encode the computation on a DAG, we propose an

asynchronous

message passing scheme, which

no longer passes messages for all nodes simultaneously, but respects the computation dependencies

among the nodes. For example, suppose node A has two parent nodes, B and C, in a DAG. Our

scheme does not perform feature learning for A until the feature learning on B and C are both ﬁnished.

Then, the aggregated message from B and C is passed to A to trigger A’s feature learning.

We incorporate this feature learning scheme in both our encoder and decoder, and propose the DAG

variational autoencoder (D-VAE). D-VAE has an excellent theoretical property for modeling DAGs –

we prove that D-VAE can

injectively

encode computations on DAGs. This means, we can build a

mapping from the discrete space to the continuous latent space so that every DAG computation has

its unique embedding in the latent space, which justiﬁes performing optimization in the latent space

instead of the original design space.

We validate the proposed D-VAE on neural networks and Bayesian networks. We show that our model

not only generates novel, realistic, and valid DAGs, but also produces smooth latent spaces effective

for searching better neural architectures and Bayesian networks through Bayesian optimization.

Our contributions in this paper are: 1) We propose D-VAE, a variational autoencoder for DAGs

with a novel asynchronous message passing scheme. 2) We for the ﬁrst time apply graph generative

models to optimizing architectures of learning algorithms. 3) We theoretically prove D-VAE encodes

computations injectively, thus justifying optimizing discrete architectures in a continuous latent space.

2 Related Work

2.1 Variational Autoencoder

Variational autoencoder (VAE) [

] provides a framework to learn both a probabilistic generative

model

(x|z)

(the decoder) as well as an approximated posterior distribution

(z|x)

(the encoder).

VAE is trained through maximizing the evidence lower bound (ELBO)

L(φ, θ; x) = E

z∼q

(z|x)

[log p

(x|z)] − KL[q

(z|x)kp(z)]. (1)

The prior distribution

p(z)

is typically taken to be

N(0, I)

. The posterior approximation

(z|x)

usually a multivariate Gaussian distributions whose mean and covariance matrix are parameterized

by the encoder network and the covariance matrix is often constrained to be diagonal. The generative

model

(x|z)

can in principle take arbitrary parametric forms whose parameters are output by

the decoder network. After learning

(x|z)

, we can generate new data by decoding latent vectors

sampled from the prior distribution

p(z)

. For generating discrete data types,

(x|z)

is often

decomposed into a series of decision steps.

sin

(")

x sin

(")

1 2

Figure 1: Examples of computations on DAGs.

2.2 Neural Architecture Search

Neural architecture search (NAS) aims at automating the design of neural network architectures by

searching candidate architectures in a search space automatically. It has seen major advances in

recent years [

]. See Hutter et al.

[34]

for an overview. Methods for NAS can be

categorized according to their search space, search strategy, and performance estimation strategy etc.,

which all greatly affect the performance of the ﬁnal architecture. Based on search strategy, there are

1) reinforcement learning based methods [

] which train controllers to generate architectures

with high rewards in terms of validation accuracy, 2) Bayesian optimization based methods [

] which

deﬁne kernels to measure architecture similarities and extrapolate the architecture space heuristically,

3) evolutionary approaches [

] which use evolutionary algorithms for optimizing the neural

architectures, and 4) differentiable methods [

] which use continuous relaxation/embedding of

the search space to enable gradient-based optimization. In Appendix A we include more discussion

on several most related works. In this paper, we introduce a new NAS method different from existing

approaches. Based on the proposed D-VAE, we learn a compact encoding space and apply principled

Bayesian optimization methods over this continuous latent space to select latent points that are most

promising to decode to better neural architectures.

2.3 Bayesian Network Structure Learning

Bayesian network (BN) is an important type of graphical model with wide applications [

]. Bayesian

network structure learning (BNSL) is to learn the structure of the underlying Bayesian network from

observed data. BNSL dates at least back to Chow and Liu

[39]

where the Chow-Liu tree algorithm

was developed for learning tree structured models. Much effort has been devoted to this area in recent

decades (see Chapter 18 of Koller and Friedman

[1]

for a detailed discussion), and there is still a

great deal of research on this topic [

]. One of the main approaches for BNSL is score-based

search. That is, we deﬁne some “goodness-of-ﬁt” score for a given network structure, and search

for one with the optimal score. Commonly used scores include BIC and BDeu etc., mostly based

on marginal likelihood [

]. Which score to use is by itself an ongoing research topic [

]. It is well

known that ﬁnding an optimally scored BN with indegree at most

is NP-hard for

d ≥ 2

[

], so

exact algorithms such as dynamic programming [

] or shortest path approaches [

] can only

solve small-scale problems. We have to resort to heuristic methods such as local search, simulated

annealing etc. [

]. All these approaches optimize the network structure in the discrete space. In

this paper, we propose to embed the discrete space into a continuous Euclidean space, and show that

model scores such as BIC can be modeled smoothly in the embedded space, and hence approximate

the NP-hard combinatorial search problem to a continuous optimization problem.

3 DAG Variational Autoencoder (D-VAE)

In this section, we describe our proposed DAG variational autoencoder (D-VAE). D-VAE uses an

asynchronous message passing scheme to encode and decode DAGs respecting the computational

dependencies. In contrast to the simultaneous message passing in traditional GNNs, D-VAE allows

encoding computations rather than encoding local structures. We ﬁrst deﬁne computation below.

Deﬁnition 1.

Given a set of elementary operations

, a computation

is the composition of a ﬁnite

number of operations

o ∈ O

applied to an input signal

, with the output of each operation being the

input to its following operations.

The set of elementary operations

depends on speciﬁc applications. For example, when we are

interested in computations given by a calculator, the elementary operation set will be all the operations

deﬁned on the functional buttons of the calculator, such as

−

etc. When modeling neural

conv3

max3

conv5

max3

conv3

output

conv5

max3

conv3

conv5

max3

conv5 input

input

max3

conv5

max3

conv3

input

conv3 conv3

max3

conv5

input

conv5

conv3

max3

input

conv5

conv3

max3

conv5

max3

input

conv3

inputinput

1 2 3 4 5 6 7

Aggregate messages

from predecessors

Update the hidden

state of this node

output

conv5

max3

conv5

max3

conv3

output

conv5

max3

conv3

output

conv5

max3

conv3

output

max3

conv3

output

conv3

output

conv5

Figure 2: An illustration of the encoding procedure for neural architectures. A node only updates its hidden

state until all its predecessors’ hidden states have been computed. Nodes in blue boxes have already updated

their hidden states, and nodes in red boxes are the ones being updated.

networks, the set of elementary operations can be a predeﬁned set of basic layers, such as 3

convolution, 5

5 convolution, 7

7 convolution, 2

2 max pooling etc. Note that a computation is not

restricted to be a line of operations with the last operation’s output being the current operation’s input.

It allows an operation to take arbitrary previous operations’ outputs as input, with these operations

forming a directed acyclic graph (DAG). The graph must be acyclic, since otherwise it will represent

an inﬁnite number of operations so that the computation never stops. Figure 1 shows two examples.

Having deﬁned computation, we are ready to introduce how D-VAE encodes and decodes DAGs.

3.1 Encoding

We ﬁrst describe how the D-VAE encoder works. The D-VAE encoder can be seen as a graph neural

network (GNN) using an asynchronous message passing scheme. Given a DAG, we assume there is

a single starting node which does not have any predecessors (if there are multiple such nodes, we

add a virtual starting node connecting to all of them). In a neural architecture, this starting node

corresponds to the input layer. We use a function

to update the hidden state of each node based

on its neighbors’ hidden states. The update function

takes two input variables: 1)

, the one-hot

encoding of the current vertex

’s type, and 2)

, the aggregated message from

’s predecessors,

given by:

= A(



: u → v



), (2)

where

u → v

denotes there is a directed edge from

is an aggregation function on the

(ordered) multiset of

’s predecessors’ hidden states. For the starting node without predecessors,

outputs an all-zero initialization vector. Having x

and h

, the hidden state of v is updated by

= U(x

, h

). (3)

Compared to the traditional simultaneous message passing, the message passing for a node in D-VAE

must wait until all of its predecessors’ hidden states have already been computed. To make sure all

the predecessors’ hidden states are available when a new node comes, we can feed in nodes following

a topological ordering of the DAG.

Finally, after all nodes’ hidden states are computed, we use

, the hidden state of

(the ending

node without any successors) as the output of the encoder. In Figure 2, we use a real neural architecture

to illustrate the encoding process. Then we feed

to two multi-layer perceptrons (MLPs) to get

the mean and variance parameters of

in (1). If there are multiple nodes without successors, we

again add a virtual ending node connecting from all these “loose ends”.

Note that although topological orderings are usually not unique for a DAG, we can take either one of

them as the input node order while ensuring the encoding result is always the same, formalized by

the following theorem. We include all theorem proofs in the appendix.

Theorem 1.

If the aggregation function

is invariant to the order of its inputs, then the D-VAE

encoder is permutation-invariant.

Theorem 1 means that we can get the same encoding result for isomorphic DAGs, no matter what

node ordering/indexing is used. This property is much desirable since real world DAGs do not often

have a canonical indexing – we do not want a DAG to be encoded into a different latent vector after

permuting its nodes. The next theorem shows another property of D-VAE that is crucial for its success

in modeling computation graphs, i.e., it is able to injectively encode computations on DAGs.

剩余19页未读，继续阅读

评论收藏

内容反馈

稚气筱筱

粉丝: 14
资源: 320

2019-2019-D-VAE, A Variational Autoencoder for Directed Acyclic

评论0

最新资源

2019-2019-D-VAE, A Variational Autoencoder for Directed Acyclic

评论0

Variational-Autoencoder-master_VAE_

瑕疵检测代码-matlab-Anomaly-detection-using-Variational-Autoencoder-VAE-:使用可变

Generate-Images-with-a-Variational-Autoencoder-VAE-:使用Keras和fashion-MNIST数据集通过VAE生成图像。 Paperspace Gradient的ML Showcase项目

Simple-Variational-Autoencoder:一个完全用NumpyCupy编写的VAE

Disentangled_Variational_Autoencoder:VAE 和 beta-VAE 的 PyTorch 实现

grammarVAE, "Grammar Variational Autoencoder" https的代码.zip

2019-加D-wave公司quadrant组-All SMILES Variational Autoencoder-多GRU+

2019-加D-wave公司quadrant组-All SMILES Variational Autoencoder-研发团队1

variational autoencoder的理解1

Variational-Autoencoder-in-Pytorch:Pytorch中针对MNIST数据集的VAE实现

Variational-Autoencoder-master_VAE_源码.zip.zip

人工智能-自编码器-autoencoder-生成MNIST数据-variational-autoencoder-master

keras_variational_autoencoder例子

tensorflow_variational_autoencoder_demo.py例子

2019-[Havard]Alan-Variational quantum generators, Generative adv

variational_autoencoder.py例子

matlab代码做游戏-Variational-AutoEncoder-for-Dialogue-Generation:对话生成模型和实验的变

Variational-Recurrent-Autoencoder-Tensorflow, "Generating Sentences from a Continuous Space"的tensorflow实现.zip

Variational_Autoencoder_MNIST_Dataset

Variational-Autoencoder-For-Outlier-Detection:使用变异自动编码器在交易系统中进行异常检测的小概念证明

最新版ISO/IEC 27001:2022、ISO 27002:2022中英文合集

Goby红队版-win-x64-2.4.7版本

Chrome Header Editor 插件

ISO SAE 21434-2021 中文版.pdf

OpenVAS GVM 中文翻译补丁

安全认证cisp教材全套

现代永磁同步电机控制原理及MATLAB仿真__袁雷编著1

2024最新：Hvv中常见的面试问题

最新资源