没有合适的资源?快使用搜索试试~ 我知道了~
藏经阁-蚂蚁金服人工智能部研究员ICML贡献论文03.pdf
需积分: 5 0 下载量 90 浏览量
2023-08-30
20:54:07
上传
评论
收藏 1.22MB PDF 举报
温馨提示
试读
30页
藏经阁-蚂蚁金服人工智能部研究员ICML贡献论文03.pdf
资源推荐
资源详情
资源评论
Stochastic Training of Graph Convolutional Networks with Variance Reduction
Jianfei Chen
1
Jun Zhu
1
Le Song
2 3
Abstract
Graph convolutional networks (GCNs) are power-
ful deep neural networks for graph-structured data.
However, GCN computes the representation of a
node recursively from its neighbors, making the
receptive field size grow exponentially with the
number of layers. Previous attempts on reducing
the receptive field size by subsampling neighbors
do not have a convergence guarantee, and their
receptive field size per node is still in the order
of hundreds. In this paper, we develop control
variate based algorithms which allow sampling
an arbitrarily small neighbor size. Furthermore,
we prove new theoretical guarantee for our algo-
rithms to converge to a local optimum of GCN.
Empirical results show that our algorithms enjoy
a similar convergence with the exact algorithm
using only two neighbors per node. The runtime
of our algorithms on a large Reddit dataset is only
one seventh of previous neighbor sampling algo-
rithms.
1. Introduction
Graph convolution networks (GCNs) (Kipf & Welling,
2017) generalize convolutional neural networks (CNNs) (Le-
Cun et al., 1995) to graph structured data. The “graph
convolution” operation applies same linear transformation
to all the neighbors of a node, followed by mean pooling
and nonlinearity. By stacking multiple graph convolution
layers, GCNs can learn node representations by utilizing
information from distant neighbors. GCNs and their vari-
ants (Hamilton et al., 2017a; Veli
ˇ
ckovi
´
c et al., 2017) have
been applied to semi-supervised node classification (Kipf &
Welling, 2017), inductive node embedding (Hamilton et al.,
2017a), link prediction (Kipf & Welling, 2016; Berg et al.,
2017) and knowledge graphs (Schlichtkrull et al., 2017),
1
Dept. of Comp. Sci. & Tech., TNList Lab, State Key Lab for
Intell. Tech. & Sys., Tsinghua University, Beijing, 100084, China
2
Georgia Institute of Technology
3
Ant Financial. Correspondence
to: Jun Zhu <dcszj@mail.tsinghua.edu.cn>.
Proceedings of the
35
th
International Conference on Machine
Learning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018
by the author(s).
outperforming multi-layer perceptron (MLP) models that
do not use the graph structure, and graph embedding ap-
proaches (Perozzi et al., 2014; Tang et al., 2015; Grover &
Leskovec, 2016) that do not use node features.
However, the graph convolution operation makes GCNs dif-
ficult to be trained efficiently. The representation of a node
at layer
L
is computed recursively by the representations
of all its neighbors at layer
L − 1
. Therefore, the receptive
field of a single node grows exponentially with respect to
the number of layers, as illustrated in Fig. 1(a). Due to the
large receptive field size, Kipf & Welling (2017) propose
to train GCN by a batch algorithm, which computes the
representations of all the nodes altogether. However, batch
algorithms cannot handle large-scale datasets because of
their slow convergence and the requirement to fit the entire
dataset in GPU memory.
Hamilton et al. (2017a) make an initial attempt to develop
stochastic training algorithms for GCNs via a scheme of
neighbor sampling (NS). Instead of considering all the neigh-
bors, they randomly subsample
D
(l)
neighbors at the
l
-th
layer. Therefore, they reduce the receptive field size to
Q
l
D
(l)
, as shown in Fig. 1(b). They find that for two-layer
GCNs, keeping
D
(1)
= 10
and
D
(2)
= 25
neighbors can
achieve comparable performance with the original model.
However, there is no theoretical guarantee on the conver-
gence of the stochastic training algorithm with NS. More-
over, the time complexity of NS is still
D
(1)
D
(2)
= 250
times larger than training an MLP, which is unsatisfactory.
In this paper, we develop novel control variate-based
stochastic approximation algorithms for GCN. We utilize
the historical activations of nodes as a control variate. We
show that while the variance of the NS estimator depends
on the magnitude of the activation, the variance of our algo-
rithms only depends on the difference between the activation
and its historical value. Furthermore, our algorithms bring
new theoretical guarantees. At testing time, our algorithms
give exact and zero-variance predictions, and at training
time, our algorithms converge to a local optimum of GCN
regardless of the neighbor sampling size
D
(l)
. The the-
oretical results allow us to significantly reduce the time
complexity by sampling only two neighbors per node, yet
still retain the quality of the model.
We empirically test our algorithms on six graph datasets, and
arXiv:1710.10568v3 [stat.ML] 1 Mar 2018
Stochastic Training of Graph Convolutional Networks with Variance Reduction
Input
Layer 1
Layer 2
(a) Exact
Input
Layer 1
Layer 2
(b) Neighbour sampling
Input
Layer 1
Layer 2
(c) Control variate
Latest activation
Historical activation
Input
GraphConv
Dropout
Dropout
GraphConv
(1)
H
(2)
H
GraphConv
GraphConv
(1)
(2)
(d) CVD network
Figure 1. Two-layer graph convolutional networks, and the receptive field of a single vertex.
Dataset V E Degree Degree 2
Citeseer 3,327 12,431 4 15
Cora 2,708 13,264 5 37
PubMed 19,717 108,365 6 60
NELL 65,755 318,135 5 1,597
PPI 14,755 458,973 31 970
Reddit 232,965 23,446,803 101 10,858
Table 1.
Number of vertexes, edges, and average number of 1-hop
and 2-hop neighbors per node for each dataset. Undirected edges
are counted twice and self-loops are counted once.
show that our techniques significantly reduce the bias and
variance of the gradient from NS with the same receptive
field size. Despite sampling only
D
(l)
= 2
neighbors, our
algorithms achieve the same predictive performance with the
exact algorithm in a comparable number of epochs on all the
datasets, i.e., we reduce the time complexity while having
almost no loss on the speed of convergence, which is the best
we can expect. On the largest Reddit dataset, the training
time of our algorithm is 7 times shorter than that of the best-
performing competitor among the exact algorithm (Kipf &
Welling, 2017), neighbor sampling (Hamilton et al., 2017a)
and importance sampling (Chen et al., 2018) algorithms.
2. Backgrounds
We now briefly review graph convolutional networks
(GCNs), stochastic training, and the neighbor sampling (NS)
and importance sampling (IS) algorithms.
2.1. Graph Convolutional Networks
We present our algorithm with a GCN for semi-supervised
node classification (Kipf & Welling, 2017). However, the
algorithm is neither limited to the task nor the model. Our
algorithm is applicable to other models (Hamilton et al.,
2017a) and tasks (Kipf & Welling, 2016; Berg et al., 2017;
Schlichtkrull et al., 2017; Hamilton et al., 2017b) that in-
volve computing the average activation of neighbors.
In the node classification task, we have an undirected graph
G = (V, E)
with
V = |V|
vertices and
E = |E|
edges,
where each vertex
v
consists of a feature vector
x
v
and a
label
y
v
. We observe the labels for some vertices
V
L
. The
goal is to predict the labels for the rest vertices
V
U
:= V\V
L
.
The edges are represented as a symmetric
V ×V
adjacency
matrix
A
, where
A
uv
is the weight of the edge between
u
and
v
, and the propagation matrix
P
is a normalized version
of
A
:
˜
A = A + I
,
˜
D
uv
=
P
v
˜
A
uv
, and
P =
˜
D
−
1
2
˜
A
˜
D
−
1
2
.
A graph convolution layer is defined as
Z
(l+1)
= P H
(l)
W
(l)
, H
(l+1)
= σ(Z
l+1
), (1)
where
H
(l)
is the activation matrix in the
l
-th layer, whose
each row is the activation of a graph node.
H
(0)
= X
is the
input feature matrix,
W
(l)
is a trainable weight matrix, and
σ(·)
is an activation function. Denote
|·|
as the cardinality
of a set. The training loss is defined as
L =
1
|V
L
|
X
v∈V
L
f(y
v
, z
(L)
v
), (2)
where
f(·, ·)
is a loss function. A graph convolution layer
propagates information to nodes from their neighbors by
computing the neighbor averaging
P H
(l)
. Let
n(u)
be
the set of neighbors of node
u
, and
n(u)
be its cardi-
nality, the neighbor averaging of node
u
,
(P H
(l)
)
u
=
P
V
v=1
P
uv
h
(l)
v
=
P
v∈n(u)
P
uv
h
(l)
v
, is a weighted sum of
neighbors’ activations. Then, a fully-connected layer is ap-
plied on all the nodes, with a shared weight matrix
W
(l)
across all the nodes.
We denote the receptive field of node
u
at layer
l
as all
the activations
h
(l)
v
on layer
l
needed for computing
z
(L)
u
.
If the layer is not explicitly mentioned, it means layer 0.
The receptive field of node
u
is all its
L
-hop neighbors,
i.e., nodes that are reachable from
u
within
L
hops, as
illustrated in Fig. 1(a). When
P = I
, GCN reduces to a
multi-layer perceptron (MLP) model which does not use the
graph structure. For MLP, the receptive field of a node u is
just the node itself.
2.2. Stochastic Training
It is generally expensive to compute the batch gradient
∇L =
1
|V
L
|
P
v∈V
L
∇f(y
v
, z
(L)
v
)
, which involves iterat-
ing over the entire labeled set of nodes. A possible solution
Stochastic Training of Graph Convolutional Networks with Variance Reduction
is to approximate the batch gradient by a stochastic gradient
1
|V
B
|
X
v∈V
B
∇f(y
v
, z
(L)
v
), (3)
where
V
B
⊂ V
L
is a minibatch of labeled nodes. However,
this gradient is still expensive to compute, due to the large
receptive field size. For instance, as shown in Table 1, the
number of 2-hop neighbors on the NELL dataset is averagely
1,597, which means computing the gradient for a single node
in a 2-layer GCN involves touching
1, 597/65, 755 ≈ 2.4%
nodes of the entire graph.
In subsequent sections, two other stochasticity will be intro-
duced besides the random selection of the minibatch: the
random sampling of neighbors (Sec. 2.3) and the random
dropout of features (Sec. 5).
2.3. Neighbor Sampling
To reduce the receptive field size, Hamilton et al. (2017a)
propose a neighbor sampling (NS) algorithm. NS randomly
chooses
D
(l)
neighbors for each node at layer
l
and devel-
ops an estimator
NS
(l)
u
of
(P H
(l)
)
u
based on Monte-Carlo
approximation:
(P H
(l)
)
u
≈ NS
(l)
u
:=
n(u)
D
(l)
X
v∈
ˆ
n
(l)
(u)
P
uv
h
(l)
v
,
where
ˆ
n
(l)
(u) ⊂ n(u)
is a subset of
D
(l)
random neigh-
bors. Therefore, NS reduces the receptive field size from all
the
L
-hop neighbors to the number of sampled neighbors,
Q
L
l=1
D
(l)
. We refer
NS
(l)
u
as the NS estimator of
(P H
(l)
)
u
,
and (P H
(l)
)
u
itself as the exact estimator.
Neighbor sampling can also be written in a matrix form as
Z
(l+1)
=
ˆ
P
(l)
H
(l)
W
(l)
, H
(l+1)
= σ(Z
(l+1)
), (4)
where the propagation matrix
P
is replaced by a sparser
unbiased estimator
ˆ
P
(l)
, i.e.,
E
ˆ
P
(l)
= P
, where
ˆ
P
(l)
uv
=
n(u)
D
(l)
P
uv
if
v ∈
ˆ
n
(l)
(u)
, and
ˆ
P
(l)
uv
= 0
otherwise. Hamilton
et al. (2017a) propose to perform the approximate forward
propagation as Eq. (4), and do stochastic gradient descent
(SGD) with the auto-differentiation gradient. The approxi-
mated gradient has two sources of randomness: the random
selection of minibatch
V
B
⊂ V
L
, and the random selection
of neighbors.
Though
ˆ
P
(l)
is an unbiased estimator of
P
,
σ(
ˆ
P
(l)
H
(l)
W
(l)
)
is not an unbiased estimator of
σ(P H
(l)
W
(l)
)
, due to the non-linearity of
σ(·)
. In the
sequel, both the prediction
Z
(L)
and gradient
∇f(y
v
, z
(L)
v
)
obtained by NS are biased, and the convergence of SGD
is not guaranteed, unless the sample size
D
(l)
goes to
infinity. Because of the biased gradient, the sample
size
D
(l)
needs to be large for NS, to keep comparable
predictive performance with the exact algorithm. Hamilton
et al. (2017a) choose
D
(1)
= 10
and
D
(2)
= 25
, and the
receptive field size
D
(1)
× D
(2)
= 250
is much larger than
that of MLP, which is 1, so the training is still expensive.
2.4. Importance Sampling
FastGCN (Chen et al., 2018) is another sampling-based
algorithm similar as NS. Instead of sampling neighbors
for each node, FastGCN directly subsample the receptive
field for each layer altogether. Formally, it approximates
(P H
(l)
)
u
with S samples v
1
, . . . , v
S
∈ V as
(P H
(l)
)
u
= V
V
X
v=1
1
V
P
uv
h
(l)
v
≈
V
S
X
v
s
∼q( v)
P
uv
h
(l)
v
s
/q(v
s
),
where the importance distribution
q(v) ∝
P
V
u=1
P
2
uv
=
1
n(v)
P
(u,v)∈E
1
n(u)
, according the definition of
P
in
Sec. 2.1. We refer to this estimator as importance sam-
pling (IS). Chen et al. (2018) show that IS performs better
than using a uniform sample distribution
q(v) ∝ 1
. NS
can be viewed as an IS estimator with the importance dis-
tribution
q(v) ∝
P
(u,v)∈E
1
n(u)
, because each node
u
has
probability
1
n(u)
to choose the neighbor
v
. Though IS may
have a smaller variance than NS, it still only guarantees
the convergence as the sample size
S
goes to infinity. Em-
pirically, we find IS to work even worse than NS because
sometimes it can select many neighbors for one node, and
no neighbor for another, in which case the activation of the
latter node is just meaningless zero.
3. Control Variate Based Algorithm
We present a novel control variate based algorithm that uti-
lizes historical activations to reduce the estimator variance.
3.1. Control Variate Based Estimator
While computing the neighbor average
P
v∈n(u)
P
uv
h
(l)
v
,
we cannot afford to evaluate all the
h
(l)
v
terms because they
need to be computed recursively, i.e., we again need the
activations h
(l−1)
w
of all of v’s neighbors w.
Our idea is to maintain the history
¯
h
(l)
v
for each
h
(l)
v
as an
affordable approximation. Each time when
h
(l)
v
is computed,
we update
¯
h
(l)
v
with
h
(l)
v
. We expect
¯
h
(l)
v
and
h
(l)
v
to be simi-
lar if the model weights do not change too fast during the
training. Formally, let
∆h
(l)
v
= h
(l)
v
−
¯
h
(l)
v
, we approximate
(P H
(l)
)
u
=
X
v∈n(u)
P
uv
∆h
(l)
v
+
X
v∈n(u)
P
uv
¯
h
(l)
v
≈ CV
(l)
u
:=
n(u)
D
(l)
X
v∈
ˆ
n
(l)
(u)
P
uv
∆h
(l)
v
+
X
v∈n(u)
P
uv
¯
h
(l)
v
, (5)
where we represent
h
(l)
v
as the sum of
∆h
(l)
v
and
¯
h
(l)
v
, and
Stochastic Training of Graph Convolutional Networks with Variance Reduction
we only apply Monte-Carlo approximation on the
∆h
(l)
v
term. Averaging over all the
¯
h
(l)
v
’s is still affordable because
they do not need to be computed recursively. Since we ex-
pect
h
(l)
v
and
¯
h
(l)
v
to be close,
∆h
v
will be small and
CV
(l)
u
should have a smaller variance than
NS
(l)
u
. Particularly,
if the model weight is kept fixed,
¯
h
(l)
v
should eventually
equal with
h
(l)
v
, so that
CV
(l)
u
= 0 +
P
v∈n(u)
P
uv
¯
h
(l)
v
=
P
v∈n(u)
P
uv
h
(l)
v
= (P H
(l)
)
u
, i.e., the estimator has zero
variance. This estimator is referred as CV. We will com-
pare the variance of NS and CV estimators in Sec. 3.2 and
show that the variance of CV will be eventually zero dur-
ing the training in Sec. 4. The term
CV
(l)
u
− NS
(l)
u
=
P
v∈n(u)
P
uv
¯
h
(l)
u
−
n(u)
D
(l)
P
v∈
ˆ
n
(l)
(u)
P
uv
¯
h
(l)
u
is a control
variate (Ripley, 2009, Chapter 5) added to the neighbor
sampling estimator NS
(l)
u
, to reduce its variance.
In matrix form, let
¯
H
(l)
be the matrix formed by stacking
¯
h
(l)
v
, then CV can be written as
Z
(l+1)
=
ˆ
P
(l)
(H
(l)
−
¯
H
(l)
) + P
¯
H
(l)
W
(l)
. (6)
3.2. Variance Analysis
We analyze the variance of the estimators assuming all the
features are 1-dimensional. The analysis can be extended to
multiple dimensions by treating each dimension separately.
We further assume that
ˆ
n
(l)
(u)
is created by sampling
D
(l)
neighbors without replacement from
n(u)
. The following
proposition is proven in Appendix A:
Proposition 1.
If
ˆ
n
(l)
(u)
contains
D
(l)
samples from
n(u)
without replacement,
then
Var
ˆ
n
(l)
(u)
h
n(u)
D
(l)
P
v∈
ˆ
n
(l)
(u)
x
v
i
=
C
(l)
u
2D
(l)
P
v
1
∈n(u)
P
v
2
∈n(u)
(x
v
1
− x
v
2
)
2
, where
C
(l)
u
= 1 − (D
(l)
− 1)/(n(u) −1).
By Proposition 1, we have
Var
ˆ
n
(l)
(u)
h
NS
(l)
u
i
=
C
(l)
u
2D
(l)
P
v
1
∈n(u)
P
v
2
∈n(u)
(P
uv
1
h
(l)
v
1
−P
uv
2
h
(l)
v
2
)
2
, which is
the total distance of the weighted activations of all pairs of
neighbors, and is zero iff
P
uv
h
v
is identical for all neighbors,
in which case any neighbor contains all the information of
the entire neighborhood.
The variance of the CV estimator is
Var
ˆ
n
(l)
(u)
h
CV
(l)
u
i
=
C
(l)
u
2D
(l)
P
v
1
∈n(u)
P
v
2
∈n(u)
(P
uv
1
∆h
(l)
v
1
− P
uv
2
∆h
(l)
v
2
)
2
,
which replaces
h
(l)
v
by
∆h
(l)
v
. Since
∆h
(l)
v
is usually much
smaller than
h
(l)
v
, the CV estimator enjoys much smaller
variance than the NS estimator. Furthermore, as we will
show in Sec. 4.2,
∆h
(l)
v
converges to zero during training,
so we achieve not only variance reduction but variance
elimination, as the variance vanishes eventually.
3.3. Implementation Details and Time Complexity
Training with the CV estimator is similar as with the NS
estimator (Hamilton et al., 2017a). Particularly, each
iteration of the algorithm involves the following steps:
Stochastic GCN with Variance Reduction
1. Randomly select a minibatch V
B
∈ V
L
of nodes;
2.
Build a computation graph that only contains the acti-
vations
h
(l)
v
and
¯
h
(l)
v
needed for the current minibatch;
3.
Get the predictions by forward propagation as Eq. (6);
4.
Get the gradients by backward propagation, and up-
date the parameters by SGD;
5. Update the historical activations.
Step 3 and 4 are handled automatically by frameworks such
de TensorFlow (Abadi et al., 2016). The computational
graph at Step 2 is defined by the receptive field
r
(l)
and the
propagation matrices
ˆ
P
(l)
at each layer. The receptive field
r
(l)
specifies the activations
h
(l)
v
of which nodes should be
computed for the current minibatch, according to Eq. (6).
We can construct
r
(l)
and
ˆ
P
(l)
from top to bottom, by
randomly adding
D
(l)
neighbors for each node in
r
(l+1)
,
starting with
r
(L)
= V
B
. We assume
h
(l)
v
is always needed
to compute
h
(l+1)
v
, i.e.,
v
is always selected as a neighbor of
itself. The receptive fields are illustrated in Fig. 1(c), where
red nodes are in receptive fields, whose activations
h
(l)
v
are
needed, and the histories
¯
h
(l)
v
of blue nodes are also needed.
Finally, in Step 5, we update
¯
h
(l)
v
with
h
(l)
v
for each
v ∈ r
(l)
.
We have the pseudocode for the training in Appendix D.
GCN has two main types of computation, namely, the sparse-
dense matrix multiplication (SPMM) such as
P H
(l)
, and the
dense-dense matrix multiplication (GEMM) such as
UW
(l)
.
We assume that the node feature is
K
-dimensional and the
first hidden layer is A-dimensional.
For batch GCN, the time complexity is
O(EK)
for SPMM
and
O(V KA)
for GEMM. For our stochastic training al-
gorithm with control variates, the dominant SPMM com-
putation is the average of neighbor history
P
¯
H
(0)
for the
nodes in
r
(1)
, whose size is
O(|V
B
|
Q
L
l=2
D
(l)
)
. For exam-
ple, in a 2-layer GCN where we sample
D
(l)
= 2
neigh-
bors for each node,
Q
L
l=2
D
(l)
= 2
. Therefore, the time
complexity of SPMM is
O(V DK
Q
L
l=2
D
(l)
)
per epoch,
where
D
is the average degree of nodes in
r
(1)
.
1
The
dominant GEMM computation is the first fully-connected
layer on all the nodes in
r
(1)
, whose time complexity is
O(V KA
Q
L
l=2
D
(l)
) per epoch.
1
V D 6= E
because the probability of each node to present in
r
(1)
is different. Nodes of higher degree have larger probability to
present. We can also subsample neighbors’ history if D is large.
Stochastic Training of Graph Convolutional Networks with Variance Reduction
4. Theoretical Results
Besides smaller variance, CV also has stronger theoretical
guarantees than NS. In this section, we present two theo-
rems. One states that if the model parameters are fixed,
e.g., during testing, CV produces exact predictions after
L
epochs; and the other establishes the convergence towards a
local optimum regardless of the neighbor sampling size.
In this section, we assume that the algorithm is run by
epochs. In each epoch, we randomly partition the vertex set
V
as
I
minibatches
V
1
, . . . , V
I
, and in the
i
-th iteration, we
run a forward pass to compute the predictions for nodes in
V
i
, an optional back propagation to compute the gradients,
and update the history. Note that in each epoch we scan all
the nodes instead of just training nodes, to ensure that the
history of each node is updated at least once per epoch.
We denote the model parameters in the
i
-th iteration as
W
i
. At training time,
W
i
is updated by SGD over time; at
testing time,
W
i
is kept fixed. To distinguish, the activa-
tions produced by CV at iteration
i
are denoted as
Z
(l)
CV,i
and
H
(l)
CV,i
, and the activations produced by the exact algo-
rithm (Eq. 1) are denoted as
Z
(l)
i
and
H
(l)
i
. At iteration
i
, the
network computes the predictions and gradients for the mini-
batch
V
i
, where
g
CV,i
(W
i
) :=
1
|V
i
|
P
v∈V
i
∇f(y
v
, z
(L)
CV,i,v
)
and
g
i
(W
i
) :=
1
|V
i
|
P
v∈V
i
∇f(y
v
, z
(L)
i,v
)
are the stochas-
tic gradients computed by CV and the exact algorithm.
∇L(W
i
) =
1
|V
L
|
P
v∈V
L
∇f(y
v
, z
(L)
v
)
is the determinis-
tic batch gradient computed by the exact algorithm. The
subscript
i
may be omitted for the exact algorithm if
W
i
is a constant sequence. We let
[L] = {0, . . . , L}
and
[L]
+
= {1, . . . , L}
. The gradient
g
CV,i
(W
i
)
has two
sources of randomness: the random selection of the mini-
batch
V
i
and randomness of the neighbors
ˆ
P
, so we may
take expectation of
g
CV,i
(W
i
)
w.r.t. either
V
i
or
ˆ
P
, or both.
4.1. Exact Testing
The following theorem reveals the connection of the exact
and approximate predictions by CV.
Theorem 1.
For a constant sequence of
W
i
= W
and any
i > LI
(i.e., after
L
epochs), the activations computed
by CV are exact, i.e.,
Z
(l)
CV,i
= Z
(l)
for each
l ∈ [L]
and
H
(l)
CV,i
= H
(l)
for each l ∈ [L −1].
Theorem 1 shows that at testing time, we can run forward
propagation with CV for
L
epoches and get exact predic-
tion. This outperforms NS, which cannot recover the exact
prediction unless the neighbor sample size goes to infinity.
Comparing with directly making exact predictions by an
exact batch algorithm, CV is more scalable because it does
not need to load the entire graph into memory. The proof
can be found in Appendix B.
4.2. Convergence Guarantee
The following theorem shows that SGD training with the
approximated gradients
g
CV,i
(W
i
)
still converges to a local
optimum, regardless of the neighbor sampling size
D
(l)
.
Therefore, we can choose arbitrarily small
D
(l)
without
worrying about the convergence.
Theorem 2.
Assume that (1) the activation
σ(·)
is
ρ
-
Lipschitz, (2) the gradient of the cost function
∇
z
f(y, z)
is
ρ
-Lipschitz and bounded, (3)
kg
CV,V
(W )k
∞
,
kg(W )k
∞
,
and
k∇L(W )k
∞
are all bounded by
G > 0
for all
ˆ
P , V
and
W
. (4) The loss
L(W )
is
ρ
-smooth, i.e.,
|L(W
2
)−L(W
1
)−
h∇L(W
1
), W
2
−W
1
i| ≤
ρ
2
kW
2
− W
1
k
2
F
∀W
1
, W
2
, where
hA, Bi = tr(A
>
B)
is the inner product of matrix
A
and ma-
trix
B
. Then, there exists
K > 0
, s.t.,
∀N > LI
, if we run
SGD for
R ≤ N
iterations, where
R
is chosen uniformly
from [N]
+
, we have
E
R
k∇L(W
R
)k
2
F
≤ 2
L(W
1
) − L(W
∗
) + K + ρK
√
N
,
for the updates
W
i+1
= W
i
−γg
CV,i
(W
i
)
and the step size
γ = min{
1
ρ
,
1
√
N
}.
Particularly,
lim
N→∞
E
R
k∇L(W
R
)k
2
= 0
. Therefore,
our algorithm converges to a local optimum as the max
number of iterations
N
goes to infinity. The full proof is in
Appendix C. For short, we show that
g
CV,i
(W
i
)
is unbiased
as
i → ∞
, and then show that SGD with such asymptoti-
cally unbiased gradients converges to a local optimum.
5. Handling Dropout of Features
In this section, we consider introducing a third source of ran-
domness, the random dropout of features (Srivastava et al.,
2014). Let
Dropout
p
(X) = M ◦ X
be the dropout oper-
ation, where
M
ij
∼ Bern(p)
are i.i.d. Bernoulli random
variables, and
◦
is the element-wise product. Let
E
M
be the
expectation over dropout masks.
With dropout, all the activations
h
(l)
v
are random vari-
ables whose randomness comes from dropout, even in
the exact algorithm Eq. (1). We want to design a
cheap estimator for the random variable
(P H
(l)
)
u
=
P
v∈n(u)
P
uv
h
(l)
v
, based on a stochastic neighborhood
ˆ
n
(l)
(u)
. An ideal estimator should have the same dis-
tribution with
(P H
(l)
)
u
. However, such an estimator
is difficult to design. Instead, we develop an estimator
CVD
(l)
u
that eventually has the same mean and variance
with
(P H
(l)
)
u
, i.e.,
E
ˆ
n
(l)
(u)
E
M
CVD
(l)
u
= E
M
(P H
(l)
)
u
and Var
ˆ
n
(l)
(u)
Var
M
CVD
(l)
u
= Var
M
(P H
(l)
)
u
.
剩余29页未读,继续阅读
资源评论
weixin_40191861_zj
- 粉丝: 62
- 资源: 1万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功