没有合适的资源?快使用搜索试试~ 我知道了~
2019-[斯坦福]-Pre-training Graph Neural Networks-利用遮挡分子局部,强制学领域知识-r
需积分: 0 0 下载量 51 浏览量
2022-08-04
11:54:50
上传
评论
收藏 1.45MB PDF 举报
温馨提示
试读
17页
2019-[斯坦福]-Pre-training Graph Neural Networks-利用遮挡分子局部,强制学领域知识-rrrr1
资源推荐
资源详情
资源评论
Pre-training Graph Neural Networks
Weihua Hu
1∗
, Bowen Liu
2∗
, Joseph Gomes
3
, Marinka Zitnik
1
,
Percy Liang
1
, Vijay S. Pande
3
, Jure Leskovec
1
1
Department of Computer Science,
2
Department of Chemistry,
3
Department of Bioengineering
Stanford University
{weihuahu,liubowen,joegomes,pande}@stanford.edu,
{marinka,pliang,jure}@cs.stanford.edu
Abstract
Many applications of machine learning in science and medicine, including molecu-
lar property and protein function prediction, can be cast as problems of predicting
some properties of graphs, where having good graph representations is critical.
However, two key challenges in these domains are (1) extreme scarcity of labeled
data due to expensive lab experiments, and (2) needing to extrapolate to test graphs
that are structurally different from those seen during training. In this paper, we
explore pre-training to address both of these challenges. In particular, working with
Graph Neural Networks (GNNs) for representation learning of graphs, we wish to
obtain node representations that (1) capture similarity of nodes’ network neighbor-
hood structure, (2) can be composed to give accurate graph-level representations,
and (3) capture domain-knowledge. To achieve these goals, we propose a series of
methods to pre-train GNNs at both the node-level and the graph-level, using both
unlabeled data and labeled data from related auxiliary supervised tasks. We perform
extensive evaluation on two applications, molecular property and protein function
prediction. We observe that performing only graph-level supervised pre-training
often leads to marginal performance gain or even can worsen the performance com-
pared to non-pre-trained models. On the other hand, effectively combining both
node- and graph-level pre-training techniques significantly improves generalization
to out-of-distribution graphs, consistently outperforming non-pre-trained GNNs
across 8 datasets in molecular property prediction (resp. 40 tasks in protein function
prediction), with the average ROC-AUC improvement of 7.2% (resp. 11.7%).
1 Introduction
Many problems in scientific domains, such as chemistry and biology, can be cast as the prediction of
some property of a graph. For example, in chemistry, predicting chemical properties such as toxicity
of molecules is important to accelerate drug discovery, where molecules are naturally represented
by molecular graphs [
9
,
23
,
13
,
21
,
52
,
56
]. In biology, identifying the functionality of proteins is
important to find proteins that associate with a certain disease, where proteins are represented by
local protein-protein interaction (PPI) graphs [
57
,
55
]. Supervised learning of graphs, especially
with Graph Neural Networks (GNNs) [
26
,
15
,
59
,
53
], has shown promising results in these domains
[64, 56, 13, 60].
Despite the promise, there remain two key challenges in applying GNNs to these scientific domains:
(1) the extreme scarcity of labeled data, and (2) out-of-distribution prediction, where the graphs in the
training set can have very different structural properties from those in the test set. First, task-specific
data labeling is a costly and time consuming procedure typically performed in wet lab environments.
Consequently, conventional GNNs can easily overfit to the small training datasets. Second, many
∗
The two first authors made equal contributions.
Preprint. Under review.
arXiv:1905.12265v1 [cs.LG] 29 May 2019
Node embedding
Graph embedding
Node-level
pre-training only
Graph-level
pre-training only
Node-level +
Graph-level pre-training
Pooling
Node-level
Graph
-
level
Domain
Knowledge
Masking
Supervised
Graph
Classification
Structure
Context
Prediction
Self
-
supervised
Graph
Classification
(a) Our approach
(b.i)
(b.ii)
(b.iii)
Figure 1:
(a)
Categorization of the pre-training methods for GNNs. Crucially, our methods, i.e.,
Context Prediction, Masking, and graph-level supervised pre-training, cover both node-level and
graph-level pre-training.
(b)
Node and graph embeddings obtained by different pre-training strategies.
(b.i)
When only node-level pre-training is used, nodes of different shapes (semantically different
nodes) can be well separated, however, the node embeddings are not composable, and thus resulting
graph embeddings (denoted by their classes,
+
and
−
) that are created by pooling embeddings of
individual nodes are not separable.
(b.ii)
With graph-level pre-training only, graph embeddings
are well separated, however the embeddings of individual nodes do not necessarily capture their
domain-specific semantics.
(b.iii)
High-quality node embeddings are such that nodes of different
types are well separated, while at the same time, the embedding space is also composable. This allows
for accurate and robust representations of entire graphs, which allows robust transfer of pre-trained
models to a variety of downstream tasks.
scientific applications naturally involve out-of-distribution prediction. For example, one may want to
predict chemical properties of newly-synthesized molecules which are often structurally different
from the training molecules, or one may want to predict functionality of proteins from a new species
that has different PPI network structure than previously studied species. Unfortunately, deep learning
models are known to be extremely poor at out-of-distribution prediction [20, 16].
One promising approach to address the above two challenges is to pre-train GNNs using large
amounts of easily-accessible unlabeled data as well as relatively easily-accessible labeled data that
comes from related auxiliary tasks. For example, to perform a variety of downstream molecular
property prediction tasks (e.g., predicting toxicity or enzyme binding), one could use large amounts of
easily-accessible molecule data to pre-train a model to capture chemistry domain knowledge, such as
valency and chemical properties of different functional groups. Afterwards, very little hard-to-obtain
labeled data would be needed to specialize the pre-trained model to the given downstream prediction
task. Beyond its benefit of increasing data efficiency, pre-training could also improve predictive
performance in out-of-distribution samples [
16
]. Therefore, pre-training could provide an attractive
solution to the above two challenges. However, currently there exists no systematic investigation of
potential strategies for pre-training GNNs and their effectiveness. In fact, as we see in our experiments,
naïve pre-training of GNNs can often only give marginal increase in generalization performance on
downstream tasks, and sometimes even worsen the performance compared to non-pre-trained models.
In this paper, we examine effective pre-training approaches to graph representation learning using
GNNs. Our key observation is that GNNs obtain a representation of an entire graph by combining
the following two steps [
13
]: (1) recursively aggregating neighboring information to obtain low-
dimensional node embeddings that capture neighborhood structure, and (2) pooling/composing node
embeddings to obtain a representation of the entire graph. Based on this observation, our goals for
pre-training GNNs are to produce node embeddings that:
1. capture structural similarity of nodes’ network neighborhoods.
2.
are composable so that node embeddings can be pooled into an accurate graph-level repre-
sentation.
3. capture domain-knowledge at the level of individual nodes and entire graphs.
Our approach to achieve these goals, which we briefly summarize below, is categorized in Figure 1 (a).
Importantly, we aim to pre-train GNNs both at the level of individual nodes as well as entire graphs,
which provides composability of embeddings as it builds a bridge between local node embeddings and
the global graph embeddings, as illustrated in Figure 1 (b.iii). This is in contrast to naïve approaches
to pre-train GNNs, i.e., either only apply an (off-the-shelf) unsupervised node representation learning
2
technique, as illustrated in Figure 1 (b.i), or only perform supervised pre-training to predict auxillary
properties of entire graphs, as illustrated in Figure 1 (b.ii).
Context Prediction.
Most of the existing off-the-shelf unsupervised node representation learning
methods are designed for node classification [
14
,
37
,
49
,
15
,
25
,
51
] and enforce nearby nodes to
have similar embeddings. This is not suited for representation learning of an entire graph, where
capturing the structural similarity of local neighborhoods is more important [
61
,
42
,
57
]. To learn
node embeddings that capture local graph structure, we introduce Context Prediction, which is a novel
self-supervised node-level pre-training method that applies the distributional hypothesis [
44
,
31
] to
the graph domain. In particular, we use node embeddings to predict surrounding graph structure, so
nodes that have similar surrounding graph structure will be mapped into similar representations.
Masking.
To learn node embeddings that capture domain knowledge, we propose a novel self-
supervised node-level pre-training method called Masking. In Masking, we randomly mask input
node/edge attributes and let GNNs predict the masked attributes from the surrounding structure. For
example, in the chemistry application, we can use node embeddings to predict atom types of masked
atoms, as illustrated in Figure 2 (b). This forces the model to capture chemistry domain knowledge,
such as valency and the electronic or steric properties of functional groups [30].
Graph-level Prediction.
To learn composable node embeddings that are useful for downstream
tasks, we can either perform (1) supervised graph-level pre-training on domain-specific auxiliary
tasks, or (2) self-supervised pre-training to predict structural properties of the graphs. Here, to directly
encode domain knowledge into graph embeddings, we take the first approach and combine our novel
Context Prediction and Masking methods with graph-level supervised pre-training. This ensures that
individual node embeddings are easily composed to obtain domain-specific representations of an
entire graph, as illustrated in Figure 1 (b.iii).
We extensively evaluate the above pre-training methods and their combinations (one from node-level
and another from graph-level) on two scientific applications of graph classification: molecular property
prediction in chemistry, and protein function prediction in biology. First, on many downstream tasks,
performing only graph-level supervised pre-training gives marginal performance gain or sometimes
even worsen the generalization performance compared to a non-pre-trained model. This phenomenon
is referred to as negative transfer [
35
,
43
], which poses a significant problem when deploying pre-
trained models to real world applications, and has been previously observed for multi-task learning
of molecular property prediction tasks [
39
,
40
,
54
,
22
]. When our node-level self-supervised pre-
training is combined with the graph-level supervised pre-training, negative transfer is completely
avoided across all the 8 downstream datasets of molecular prediction and all the 40 downstream
tasks of protein function prediction; thus, robustly transferable pre-trained models are achieved.
Furthermore, on these downstream tasks, GNNs pre-trained with such combined strategies achieve
significantly better out-of-distribution generalization performance than GNNs pretrained with a single
type of (or no) pre-training method. Specifically, on molecular property (resp. protein function)
prediction tasks, our combined pre-training methods give 7.2% (resp. 11.7%) higher average ROC-
AUC scores compared to the non-pre-trained GNNs, 4.1% (resp. 6.1%) higher average ROC-AUC
scores compared to GNNs pre-trained only with graph-level supervised auxillary tasks, and 3.1%
(resp. 9.8%) higher average ROC-AUC scores compared to GNNs pre-trained only with node-level
self-supervised tasks.
2 Preliminaries on Graph Neural Networks
We begin by formalizing the task of supervised learning of graphs, and review the basic components
of GNNs [
13
]. Then, we review existing methods for unsupervised representation learning on graphs.
Supervised learning of graphs.
Let
G = (V, E)
denote a graph with node feature vectors
X
v
for
v ∈ V
and edge feature vectors
e
uv
for
(u, v) ∈ E
. Given a set of graphs {
G
1
, ...,
G
N
} and their
labels {
y
1
, . .. ,
y
N
}, the task of graph supervised learning is to learn a representation vector
h
G
that
helps predict the label of an entire graph, y
G
= g(h
G
).
Graph Neural Networks (GNNs)
. GNNs use the graph structure as well as node features and edge
features to learn a representation vector of a node,
h
v
, and of the entire graph
h
G
. Modern GNNs
follow a neighborhood aggregation strategy, where we iteratively update the representation of a
node by aggregating representations of its neighboring nodes and edges [
13
]. After
k
iterations of
aggregation, a node’s representation captures the structural information within its
k
-hop network
3
剩余16页未读,继续阅读
资源评论
郑瑜伊
- 粉丝: 19
- 资源: 317
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功