lation among labels to leverage such priors for better perfor-
mance? Secondly, given input X, the common practice for
predicting its labels can be formulated as a two-stage map-
ping y = F
1
◦ F
0
(X), where F
0
: X 7→ f denotes the CNN
feature extraction process and F
1
: f 7→ y is the mapping
from feature space to label space. Labels are only explicitly
involved in the last stage as supervision in the training phase.
Therefore, the further question is, for a specific multi-label
classification task, whether and how the mutual-related label
space can explicitly help the feature learning process F
0
?
To take into account the label correlations, some ap-
proaches have been proposed. For example, probabilistic
graph model was used in (Li et al. 2016; Li, Zhao, and
Guo 2014) and RNN was used in (Wang et al. 2016a) to
capture dependencies among labels. However, probabilis-
tic graph models may suffer from scalability issues given
their computational cost. RNN model relies on predefined
or learned label sequential order and fails to well capture
the global dependencies. Recently, graph convolutional net-
work (Kipf and Welling 2016), aka GCN, has witnessed
prevailing success in modeling relationship among vertices
of a graph. Such a tool was leveraged to model the rela-
tion of the label system for multi-label recognition in (Chen
et al. 2019). Meanwhile, the label graph was built simply
by utilizing the frequency of label co-occurrence. Another
direction is to implicitly model label correlations via local
image regions attention, as was done in (Wang et al. 2017;
Zhu et al. 2017a). In addition, all the aforementioned solu-
tions follow the conventional practice of two-stage mapping
and the whole structure of label system is ignored in learning
the feature space.
In this paper, we attempt to find possible answers for the
two questions. We propose a label graph superimposed deep
convolution network called KSSNet for this task. The super-
imposing means the following two folds in our framework:
(1) to model the priors of co-occurrence of labels follow-
ing the GCN paradigm, instead of using statistics of label
co-occurrence alone to build the relation graph of the label
system, we propose to superimpose knowledge based graph
into statistics based graph for constructing the final one. (2)
In order to learn better feature representations for a specific
multi-label recognition task anchored on its label structures,
we design a novel superimposed CNN and GCN network to
extract label structure aware descriptors. Specifically, we
first construct two adjacency matrices A
S
∈ R
N×N
and
A
K
∈ R
N×N
to denote correlation graphs of labels, which
is constructed by co-occurrence statistics and a knowledge
graph named ConceptNet (Speer, Chin, and Havasi 2017)
respectively. The initial embedding of all nodes (namely, la-
bels) is extracted from ConceptNet. The final adjacency ma-
trix is a superimposed version. Then we apply multi-layer
graph convolution on the final superimposed graph to model
the label correlation. Besides, different from conventional
graph augmented CNN solutions which utilize information
of label system at the final recognition stage, we add lat-
eral connections between CNN and GCN at shallow, middle
and deep layers to inject information of the label system into
backbone CNN for the purpose of labels awareness in fea-
ture learning. We have carried out extensive experiments
on MS-COCO dataset (Lin et al. 2014) for multi-label im-
age recognition and Charades (Sigurdsson et al. 2016) for
multi-label video classification. Results show that our solu-
tion obtains absolute mAP improvement of 6.4% and 12.0%
in MS-COCO and Charades with very limited computation
cost overhead, when compared to its plain CNN counter-
part. Our model achieves new state-of-the-art and outper-
forms current state-of-the-art solution by 1.3% and 2.4% in
mAP on MS-COCO and Charades, respectively.
Related Work
State-of-the-art image or video classification frameworks
(He et al. 2016a; Carreira and Zisserman 2017; Feichten-
hofer et al. 2018; He et al. 2019; Wu et al. 2019) can be
directly applied for multi-label classification by replacing
the cross-entropy loss with multi-binary classification loss.
The straightforward extension leaves label correlation unex-
plored thus degrading the recognition performance. We pro-
pose our solution to alleviate this problem and it is closely
related with the following jobs.
Many existing works on multi-label classification pro-
posed to capture label relationship for performance improve-
ment. The co-occurrence of labels can be well formulated
by probabilistic graph models, in the literature, there have
many methods based on such mathematical theory to model
the labels (Li et al. 2016; Li, Zhao, and Guo 2014). To
tackle the problem of computation cost burden of proba-
bilistic graph models, the neural network based solution is
becoming prevalence recently. In (Wang et al. 2016a), re-
current network was used to encode labels into embedding
vectors for label correlation modeling purpose. Context gat-
ing strategy was utilized in (Lin, Xiao, and Fan 2018) to inte-
grate the post processing of label re-ranking into the whole
network architecture. There are also works done by lever-
aging the attention mechanism in order for modeling label
relationship. In (Wang et al. 2017) and (Zhu et al. 2017a),
either image region-level spatial attention map or attentive
semantic-level label correlation modeling was used to boost
the final recognition performance. (Wang, Jia, and Breckon
2019) proposed to improve the performance by model en-
semble.
Graph has been proved to be more effective for label
structure modeling. Tree-structure label graph built with
maximum spanning tree algorithm in (Li, Zhao, and Guo
2014) and knowledge graph for describing label dependency
in (Lee et al. 2018) are two typical label graph solutions.
Recently, GCN was introduced in (Kipf and Welling 2016)
and it has been successfully utilized for non-grid structured
data modeling. Researchers have leveraged GCN for many
computer vision tasks and great performance was achieved.
For instance, it was leveraged in (Yan, Xiong, and Lin 2018;
Gao et al. 2018) to model the relationship of skeletons of hu-
mans bodies for human action recognition and knowledge-
aware GCN was applied for zero-shot video classification
in (Gao, Zhang, and Xu 2019). Our work mostly relates to
the one proposed in (Chen et al. 2019), which used GCN to
propagate information among labels and merges label infor-
mation with CNN features at the final classification stage.
Differently, our work builds GCN by superimposing the