appropriate way of modeling non-consecutive and long-distance
semantics is expected for text topical classification.
In this paper, we propose a Hierarchically Regularized Deep
Graph-CNN (HR-DGCNN) framework to tackle the above problems
with the following considerations.
Input.
Instead of viewing long documents as sequences, we first
convert them to graphs. A natural way to construct the graph is
based on word co-occurrence, i.e., if two words co-occur in a small
window of texts, we build an edge between them. Then given a
constructed graph, any sub-graphs can be regarded as long distance
n-grams [
34
]. For each node of the graph, we use a pre-trained vector
based on word2vec [
32
] as input features. In this way, our input can
be regarded as a graph of vectors. Although word2vec optimization
has been proven to be identical to co-occurrence matrix factorization
under mild conditions [
26
], it is still preferable to explicitly represent
documents as graphs, since for upper level convolution, the longer
distance co-occurrence of words (which corresponds convolution
over sub-graphs) can be explicitly computed and evaluated.
Convolution Layers.
For lower intermediate layers, we follow
the graph normalization approach [
33
] to make the following convo-
lution operators possible. This graph normalization can be regarded
as a local operator to convert a graph to a sorted sequence, where
the order is based on the importance of the node on the graph. Other
graph convolution approaches are discussed in the related work (Sec-
tion 2). For the upper intermediate layers, we generally follow the
well defined AlexNet [
22
] and VGG [
37
] networks for ImageNet
classification. Different from image data, which at most has three
channels, i.e., RGB values, word embeddings have much more chan-
nels. A typical word embedding can have 50 dimensions. In this way,
the input tensor for convolution is slightly different from images, and
thus, we coordinately modify the configuration of all the following
convolution layers to make the feature representation more effective.
Output.
For large scale hierarchical text classification, there have
been many existing studies to design better output cost functions [
12
,
46
]. Here, we use the cross entropy objective function to determine
labels and adopt the simple but effective recursive regularization
framework proposed in [
13
]. The idea is if the two labels are parent
and child in the hierarchy, we assume that the classification from
these two labels to other labels are similar. In the global view of the
hierarchy, it means the children label classifiers should inherit the
parent classifier. To handle large-scale labels, we also use a tree cut
algorithm to automatically divide the trees into parts, and conquer
the regularized models for different parts.
In the experiments, we compare our proposed approach with
state-of-the-art methods, including traditional algorithms and deep
learning approaches. We use two benchmark datasets to demonstrate
both effectiveness and efficiency. RCV1 [
27
] dataset contains 23,149
training news articles and 784,446 testing news articles with 103
classes. NYTimes
2
contains 1,855,658 news articles in 2,318 cate-
gories. The results showed that our approach is very promising to
work on large scale hierarchical text topical classification problems.
The contributions of this paper can be highlighted as follows.
•
First, we introduce a Deep Graph-CNN approach to text
classification. There have been proof that bag-of-graph
2
https://catalog.ldc.upenn.edu/ldc2008t19
representation [
34
] and CNN representation [
8
] are effec-
tive for text topic classification. However, this is the first
attempt to show Graph-CNN is even more effective.
•
Second, for large scale hierarchical text classification, we
demonstrate that recursive regularization can also be ap-
plied to deep learning. This can be a general framework
for deep learning applied to classifications problems when
classifying data into a hierarchy of labels.
•
Third, we use two benchmark datasets to demonstrate the
efficiency and effectiveness of our algorithm. They are with
either large test set, large label set, or large training set.
The rest of the paper is organized as follows. We first review the
related work in Section 2. Then we introduce the detailed input and
architecture of our algorithm in Sections 3 and 4. Then we show
the experiments in Section 5. Finally, we conclude this paper in
Section 6. Our system is publicly available at https://github.com/
HKUST-KnowComp/DeepGraphCNNforTexts.
2 RELATED WORK
In this section, we briefly review the related work in following two
categories.
2.1 Traditional Text Classification
Tradition text classification uses feature engineering (e.g., extracting
features beyond BOW) and feature selection to obtain good fea-
tures for text classification [
1
]. Dimensionality reduction can also
be used to reduce the feature space. For example, Latent Dirichlet
Allocation [
4
] has been used to extract “topics” from corpus, and
then represent documents in the topic space. It can be better than
BOW when the feature numbers are small. However, when the size
of words in vocabulary increases, it does not show advantage over
BOW on text classification task [
4
]. There is also existing work
on converting texts to graphs [
34
,
42
]. Similar to us, they used
co-occurrence to construct graphs from texts, and then they either
applied similarity measure on graph to define new document sim-
ilarities [
42
] or applied graph mining algorithms to find frequent
sub-graphs in the corpus to define new features for text [
34
]. Both
of them showed some positive results for small label space classi-
fication problems, and the cost of graph mining is more than our
approach which simply performs breadth-first search.
For hierarchical classification with large label space, many ef-
forts have been put on how to leverage the hierarchy of labels to
reduce to time complexity or improve the classification results. For
example, top-down classification mechanism has been shown to be
more efficient than bottom-up classification (or flat classification
which treats each label in the leaf as a category) when there are many
labels [
30
,
44
]. Moreover, the parent-child dependency of labels in
the hierarchy can also be used to improve the model. For example,
some hierarchical cost-sensitive loss can be developed [
5
]. The idea
of transfer learning can also be borrowed to improve each of the
classifiers in the hierarchy [
43
]. Recently, a simpler recursive regu-
larization of weight vectors of linear classifiers in the hierarchical
model has been developed, and shown to be the state-of-the-arts in
large-scale hierarchical text classification problems [13, 14].
2