Published as a conference paper at ICLR 2017
In this way, we can still recover a rich class of convolutional filter functions by stacking multiple
such layers, but we are not limited to the explicit parameterization given by, e.g., the Chebyshev
polynomials. We intuitively expect that such a model can alleviate the problem of overfitting on
local neighborhood structures for graphs with very wide node degree distributions, such as social
networks, citation networks, knowledge graphs and many other real-world graph datasets. Addition-
ally, for a fixed computational budget, this layer-wise linear formulation allows us to build deeper
models, a practice that is known to improve modeling capacity on a number of domains (He et al.,
2016).
In this linear formulation of a GCN we further approximate λ
max
≈ 2, as we can expect that neural
network parameters will adapt to this change in scale during training. Under these approximations
Eq. 5 simplifies to:
g
θ
0
? x ≈ θ
0
0
x + θ
0
1
(L − I
N
) x = θ
0
0
x − θ
0
1
D
−
1
2
AD
−
1
2
x , (6)
with two free parameters θ
0
0
and θ
0
1
. The filter parameters can be shared over the whole graph.
Successive application of filters of this form then effectively convolve the k
th
-order neighborhood of
a node, where k is the number of successive filtering operations or convolutional layers in the neural
network model.
In practice, it can be beneficial to constrain the number of parameters further to address overfitting
and to minimize the number of operations (such as matrix multiplications) per layer. This leaves us
with the following expression:
g
θ
? x ≈ θ
I
N
+ D
−
1
2
AD
−
1
2
x , (7)
with a single parameter θ = θ
0
0
= −θ
0
1
. Note that I
N
+ D
−
1
2
AD
−
1
2
now has eigenvalues in
the range [0, 2]. Repeated application of this operator can therefore lead to numerical instabilities
and exploding/vanishing gradients when used in a deep neural network model. To alleviate this
problem, we introduce the following renormalization trick: I
N
+D
−
1
2
AD
−
1
2
→
˜
D
−
1
2
˜
A
˜
D
−
1
2
, with
˜
A = A + I
N
and
˜
D
ii
=
P
j
˜
A
ij
.
We can generalize this definition to a signal X ∈ R
N×C
with C input channels (i.e. a C-dimensional
feature vector for every node) and F filters or feature maps as follows:
Z =
˜
D
−
1
2
˜
A
˜
D
−
1
2
XΘ , (8)
where Θ ∈ R
C×F
is now a matrix of filter parameters and Z ∈ R
N×F
is the convolved signal
matrix. This filtering operation has complexity O(|E|F C), as
˜
AX can be efficiently implemented
as a product of a sparse matrix with a dense matrix.
3 SEMI-SUPERVISED NODE CLASSIFICATION
Having introduced a simple, yet flexible model f(X, A) for efficient information propagation on
graphs, we can return to the problem of semi-supervised node classification. As outlined in the in-
troduction, we can relax certain assumptions typically made in graph-based semi-supervised learn-
ing by conditioning our model f (X, A) both on the data X and on the adjacency matrix A of the
underlying graph structure. We expect this setting to be especially powerful in scenarios where the
adjacency matrix contains information not present in the data X, such as citation links between doc-
uments in a citation network or relations in a knowledge graph. The overall model, a multi-layer
GCN for semi-supervised learning, is schematically depicted in Figure 1.
3.1 EXAMPLE
In the following, we consider a two-layer GCN for semi-supervised node classification on a graph
with a symmetric adjacency matrix A (binary or weighted). We first calculate
ˆ
A =
˜
D
−
1
2
˜
A
˜
D
−
1
2
in
a pre-processing step. Our forward model then takes the simple form:
Z = f(X, A) = softmax
ˆ
A ReLU
ˆ
AXW
(0)
W
(1)
. (9)
3
评论0