ically the kernels are designed to exploit random walks [10,
15, 5, 28], shortest paths [4], cyclic patterns [12], subtrees [22,
21, 19, 24] and subgraphs [25, 17, 26]. Another class of graph
kernels, e.g., the diffusion kernel [16], deal with the similar-
ity between nodes of a single graph. However, our focus in
this paper is on kernels between different graphs, which we
discuss in more detail below.
Random Walk Kernels: The similarity of two graphs
G
i
, G
j
∈ D can be quantified by counting labeled walks
that are common to both of them. The random walk kernel
[10], one of the first graph kernels, is based on this idea.
The kernels in [15, 5] are also based on random walks over
labeled graphs. Computing the pair-wise kernel values has
worst case O(n
6
) complexity, where n denotes the number of
nodes in G
i
and G
j
. A more efficient version of the random
walk kernel was proposed in [28], reducing the complexity to
O(n
3
) per pair of graphs. One potential problem with these
kernels is that artificially high kernel values may be obtained
by rep eatedly visiting same nodes and edges multiple times
[18]. We refer to [29] for a recent overview of random walk
based graph kernels.
Shortest Path Kernels: The shortest-path graph ker-
nel [4] first computes the shortest-path graph S = (V
S
, E
S
)
for each graph G = (V, E) ∈ D. Here V
S
= V , and a
weighted edge (v
a
, v
b
) exists in E
S
if v
a
and v
b
are con-
nected by a path in G, with t he edge weight representing
the shortest path length between v
a
and v
b
(infinity if they
are not reachable). Given the sh ortest- path graphs S
i
and
S
j
for two inp ut graph G
i
and G
j
the kernel is defined as th e
sum over all pairs of edges from S
i
and S
j
, using any suitable
positive definite kernel on the edges. The all-pairs shortest-
path graphs can be computed in O(n
3
) time, and the kernel
can then be computed in O(n
4
) time, since S
i
and S
j
each
have O(n
2
) edges. Other variants of the shortest path ker-
nel include equal length shortest paths, k shortest paths, k
shortest d isjoint paths, and so on [4].
Cyclic Pattern Kernels: The cyclic pattern kernel [12] is
based on counting the number of common cycles t hat occur
in both graphs. Since there is n o known polynomial time
algorithm to find all the cycles in a graph, sampling and
time-bounded enumeration of cycles are used to measure
the similarity of the graphs.
Subtree Kernels: Subtree kernels are based on common
subtrees in the graphs [22]. The main idea is to consider
pairs of nodes from G
i
and G
j
and see if they share common
tree-like neighborhoods, i.e., to count the pairs of identical
subtrees of height h rooted at vertex v
a
∈ G
i
and v
b
∈ G
j
.
The kernel is defined as the sum over all pairs of vertices of
a suitably defined vertex pair kernel. The complexity of this
approach is O(n
2
h4
d
), where d denotes the maximum de-
gree. Another subtree kernel was proposed in [21], based on
a path representation of the trees obtained via a depth-first
search on the input graphs. The kernel function is computed
on these paths (e.g., the ratio of t he longest common path).
The recently p roposed Weisfeiler-Lehman Kernel [24], is a
fast subtree kernel that scales up t o large, labeled graphs. It
uses the Weisfeiler-Lehman isomorphism test, which u ses it-
erative multiset-label determination, label compression, and
relabeling steps. The isomorphism test terminates after a
pre-specified number of iterations h. If the sets of labels
for nodes are not identical, t hen two graphs are considered
as non-isomorphic, otherwise, they are isomorphic. The WL
graph kernel counts th e matching multiset labels for th e two
graphs G
i
and G
j
in each iteration of the W L isomorphism
test. The WL kernel has O(mh) complexity, where m is the
number of edges in the graphs.
Graphlet and Subgraph Kernels: Similar graphs should
have similar subgraphs. Graphlet kernels measure the simi-
larity of two graphs by the dot produ ct of count vectors of all
possible connected subgraphs of order k (i.e., the graphlets,
also called as k-minors) [25, 17]. For any k (u sually set to 3,
4, or 5), there are 2
(
k
2
)
possible graphlets of size k, but many
of them are isomorphic. Usually, to avoid the dependence
on the size, the count vector is normalized into a probabil-
ity vector, and the graphlet kernel is re-defined as the dot
product of the normalized count vectors for two graph s. Ex-
haustive enumeration of all graphlets has complex ity O(n
k
).
For a graph with bounded degree d, t he conn ected graphlets
can be enumerated in O(nd
(k−1)
) [25].
Frequent subgraph mining can also be used to define a
kernel between two graphs [26]. Let F = {s
1
, . . . , s
p
} denote
the set of p frequent and discriminative subgraph patterns
mined from D. Each graph G
i
∈ D is then represented as
a binary feature vector {0, 1}
p
where feature j is set to 1
if and only if s
j
is isomorphic to a su bgraph in G
i
. The
kernel between G
i
and G
j
can be defined over their binary
feature vectors. CORK [26] implements this approach; it
uses gSpan [31] to m in e the subgraphs, and selects near-
optimal features (subgraphs) from that set, that are most
discriminative for classification.
In our experiments in Section 4, we compare with the fol-
lowing graph kernel meth ods: fast geometric Random-walk
(RW) kernel [28], Shortest-path (SP) kernel [4], Graphlet
(GK) kernel [25], Ramon-G
¨
artner (RG) su btree kernel [22],
and Weisfeiler-Lehman (WL) subtree kernel [24]. We also
compare with CORK [26].
3. GRAPH ATTRIBUTES FOR CLASSIFI-
CATION
As we have seen above, while many soph isticated graph
kernels have been proposed, efficiency and scalability remain
as challenges, for large graph datasets. Our basic idea is to
compute several topological and label attributes for each
graph in the dataset, and to use the derived feature-vector
attributes for classification. Like most of the graph kernel
work, we use a Support Vector Machine (SVM) as the classi-
fier of choice. The graph attributes we use are listed below.
Figure 1: A triangle with its t hree t rip les
1. Average degree: The degree of a node is defined as
the number of its neighboring edges. Average degree
is the average value of the degree of all nodes in the
graph, i.e.,
¯
d(G) =
P
n
i
d(u
i
)/n, where d(u
i
) denotes
the degree of node u
i
.
2. Average clustering coefficient: For a node u, the
clustering co efficient c(u) represents the likelihood that
any two neighbors of u are connected. More formally,
the clustering coefficient of a node u is defined as: