318 Y. Sun, B. Li and Y. Yuan et al. / Neurocomputing 330 (2019) 317–327
as the new structure information are available, the classification
model should be effectively in the updated dataset.
• Several parallel works have been proposed to solve the single-
graph case [23–25] . However, we focus on the case of min-
ing the graph transaction, mining a large collection graphs. The
work in [14] employs MapReduce for mining a large collection
of graphs. However, the large number of duplicate subgraphs
still creates significant performance problems.
After mining graphs patterns, building an effective classification
model is another issue. An effective classification model is impor-
tant and should have a fast learning speed and a high classifica-
tion accuracy. Support Vector Machines (SVM) [26] is adopted in
previous works [13,27,28] . SVM is an effective classifier with good
classification accuracy. However, the slow training speed and trivial
human intervene become severe problems. Recent years, a gener-
alized Single Hidden-layer Feedforward Network, named Extreme
Learning Machine (ELM), is proposed in [29] . Applications based on
ELM show that ELM has a much faster training speed and provides
a better generalization performance than traditional learning algo-
rithms of feedforward neural networks. Furthermore, ELM is easy
and efficient to be implemented [30] .
To verify the shortcomings of existing approaches, in this paper
we make the following contributions:
• We present a novel compression-based graph pattern mining
algorithm. Due to there being many common structures among
graphs with same class labels, the original graph dataset can
be compressed to reduce the graph number and the required
memory space. We merge graphs with the class labels into one
compressed graph by discovering their common structures and
mine graph patterns in the compressed dataset.
• We propose an incremental graph pattern mining algorithm to
can adopt to the situation that the graph dataset and graph
structures are highly dynamic and frequently updated over
time. When an update G comes, it may influence the detected
patterns and there may be new patterns in the updated dataset.
In the incremental mining algorithm, we mine new patterns
and update detected patterns based on G instead of recom-
pute in the updated dataset.
• We develop a new distributed graph pattern mining algorithm
to handle complex graphs which have repetitive edges with the
same edge label and vertex label. MapReduce jobs are used to
construct subgraphs and count frequency. We devise a graph
encoding method to encode subgraphs so as to optimize sub-
graphs counting and to prevent constructing duplicate sub-
graphs.
• We extend ELM and its variants into our proposed large
graph mining solutions and implement three graph classifi-
cation frameworks. ELM is an efficient classifier and shows
its outstanding generalization performance in the big graph
dataset.
The remainder of this paper is organized as follows:
Section 2 briefly introduces some basic concepts and formal-
izes the graph classification problem. In Section 3 , we give a brief
introduction to ELM. The compression-based graph classification
framework is proposed in Section 4 . Section 5 introduces the
incremental graph classification framework. Distributed graph
classification is described in Section 6 . Extensive experiments are
conducted on a series of real-life datasets and the performance
is evaluated in Section 7 . Section 8 discusses the related works.
Finally, we give our conclusions and future works in Sections 9 and
10 .
2. Problem definition
2.1. Graph data and graph isomorphism
Definition 1 (Undirected graph) . For an undirected graph g =
(V (g) , E(g) , L
v
, L
e
, L
c
) , V ( g ) is the set of vertices and E ( g ) is the set
of edges, E ( g ) ⊆ V ( g ) × V ( g ), L
v
and L
e
are the labels of vertices
and edges respectively. Each edge can be represented by a 3-tuple,
e = (u, w, v ) , where u and v are vertices in V ( g ) and w is the label
of e in L
e
. L
c
is the class label of g .
For a graph dataset D = { G
1
, G
2
, . . . , G
n
} in Fig. 1 , G
1
, G
2
, G
3
, G
4
are labeled with C
1
, while G
5
, G
6
, G
7
, G
8
are labeled with C
2
.
Definition 2 (Subgraph isomorphism) . For two graphs g and g
, l is
a label function mapping a vertex or an edge to a label. A subgraph
isomorphism is an injective function f: V ( g ) → V ( g
), such that 1)
∀ v ∈ V (g) , l(v ) = l( f (v )) , and 2) ∀ ( u, v ) ∈ E ( g ), ( f ( u ), f ( v )) ∈ E ( g ) and
l(u, v ) = l( f (u ) , f (v )) , where l and l
are the labeling functions
of g and g
, respectively. Then g is a subgraph of g
, denoted as
g ⊆ g
.
2.2. Graph pattern mining and graph classification
Definition 3 (Frequency) . Given a dataset D = { G
1
, G
2
, . . . , G
n
} and
a graph pattern g , the supporting dataset of g is D
g
= { G
i
| g ⊆
G
i
, G
i
∈ D } . The frequency of g is | D
g
|/| D |, denoted as freq ( g ).
Definition 4 (Support graph vector) . Given a set of patterns
p
1
, p
2
, . . . , p
l
, a graph g can be represented as a 0–1 vector x =
[ x
1
, x
2
, . . . , x
l
] , where x
i
= 1 if p
i
⊆ G ; otherwise, x
i
= 0 . The vector
x is called a support graph vector of the graph g .
Definition 5 (Graph classification) . Given a graph g and a set of
class labels C
1
, C
2
, . . . , C
m
, the goal of graph classification function
GC ( g ) is to predict the most probable class label for g , modeled as
GC(g) = C
i
, where i ∈ [1, m ].
Definition 6 (Graph compression classification) . For a graph
dataset D and a graph compression function R, D
r
= R (D ) is a
graph dataset computed from D by R , referred to as the com-
pressed dataset of D , such that | D
r
| | D |, satisfying GC(D ) =
GC(D
r
) , where GC is the graph classification function.
The graph compression function reduces the size of the graph
dataset, while structure information of graphs can still be holden.
Fig. 2 (c) is a compressed graph constructed by the compression
function among class label C
1
. We do not lose any information and
get a smaller dataset so as to save memory space.
Definition 7 (Incremental graph classification) . For a graph dataset
D and an update G , incremental graph classification satisfies
GC(D + G ) = GC (D ) + GC (G ) .
Incremental graph classification can be adopted to solve dy-
namic graphs when the graph structures are frequently updated
over time. Dynamic information can be combined with previous
results, and the classification model can adjust to the latest status.
Definition 8 (Distributed graph classification) . For a graph dataset
D , a distributed graph classification DC satisfies GC(D ) = DC(D ) .
Distributed storage and computing is another efficient way to
solve the big graph dataset. We mine graph patterns and train clas-
sification models through MapReduce framework. Distributed solu-
tions can make up for the requirement of the computing time and
the memory.