which cannot explicitly consider to model the interactions
between objects, SGC can generate sentences based on the
graph-based formulation by leveraging the high-level concepts
and the attention clustering region.
Consider the image of train in Fig. 1. To accurately describe the
image content, any captioning model needs to identify the visual
elements such as train, tracks, and workers and then combine them
to generate a coherent sentence. While previous captioning models
ignore the topological structure, which is an essential representa-
tion for an image, we first generates the scene graph (inside blue
dotted grid) and then the proposed SGC composes a caption by
understanding the complex elements and the interactions between
objects, e.g., near and on, as well as their localizations.
We have three contribution points: (1) we propose a novel
framework to embed the scene graph into the compact representa-
tion, which can capture the explicit semantic concepts and the
graph topology information. Specifically, the prediction probabili-
ties of concepts in graph are aggregated to construct the multi-
label sematic vector. Meanwhile, we extract the shallow network
features from the extended adjacent matrix to construct the topol-
ogy vector. (2) we develop the scene-graph-driven method for
attention graph generation by exploiting high internal homogene-
ity and external inhomogeneity among the nodes of the scene
graph. Then, the attention graph is used to crop the visual region
for assisting to describe the image. (3) the encoded images, atten-
tions, concepts, and topology information are fed into LSTM which
translates these information into text on the held-out MSCOCO
dataset.
2. Related work
Image Caption. The task of image captioning is to generate a
sentence describing the visual content of a given image. Early, in
the template-based approaches [24–26], the sentences were gen-
erated with syntactic and semantic constraints by using detected
objects from the images. Recently, researchers follow a CNN-RNN
framework [8,27,5,1]: Mao et al. [8] proposed a multimodal RNN
(m-RNN) to estimate the probability distribution of the next word
given previous words and the CNN feature of an image at each time
step. Chen et al. [27] learnt a bi-directional mapping between
images and their descriptions, which allowed to reconstruct the
image description given the visual feature. Others [28,20,29–31]
explicitly exploited the latent intermediate knowledge. For exam-
ples, Xu et al. [28] proposed a visual attention model to automati-
cally fixate on salient objects for generating corresponding words
in the output sequence. Wu et al. [20] incorporated high-level
semantic concepts into the CNN-RNN framework. Further, Hen-
dricks et al. [9] proposed the Deep Compositional Captioner
(DCC) to generate descriptions of novel objects which were not
present in the paired image-sentence dataset.
However, the aforementioned methods ignore the topological
structure which is an essential representation for an image. Our
framework can generate and explore the topological property of
the scene graph for the image, which includes complex elements
and interactions between objects as well as their localizations, to
construct the linguistic logic image description.
Structural Representations. Graph-structural representations
have attained widespread use in the computer graphics to repre-
sent compositional scenes. Fisher et al. [32] used graph kernels
to compare 3D scenes, and Chang et al. [33] generated 3D scenes
from natural language descriptions using the scene graphs. Parse
graphs obtained in scene parsing [34,35] were typically the result
of applying a grammar designed for a particular domain (such as
indoor scenes [34]).
Recent work by Lin et al. [36] constructed semantic graphs from
text queries using hand-defined rules to transform parse trees and
used these semantic graphs to retrieve videos in the context of
autonomous driving. However, their system is constrained to the
six object classes from KITTI [37].
More semi-structural representations of visual scenes explicitly
encoded certain types of properties, such as attributes [38–40],
object co-occurrence [41], or spatial relationships between objects
[42–45]. In contrast, we can generate the scene graph which is a
structural representation of images. Each node is explicitly
grounded in an image region, avoiding the inherent uncertainty
of text-based representations. Meanwhile, our system uses a much
larger, open-world vocabulary, boosting the generated scene
graphs to be more fruitful.
Real-World Scene Graph Datasets. Datasets have been a driv-
ing force of computer vision algorithms and some scene graph
datasets have been published recently. Sadeghi et al. [42] proposed
a dataset suitable for the phrasal recognition domain. The visual
phrases were formed by either interactions between objects or
activities of the single object. However, it only listed the limited
number of 17 visual phrases using 8 selected object classes from
Pascal VOC2008 dataset [46].
The scene graph dataset [18] takes {object, attribute} and
{object, relationship, object} tuples into consideration. It uses an
open vocabulary to label the most meaningful information of each
image instead of being constrained to a fixed set of predefined
classes. In addition, Lu et al. [47] relabeled the dataset to increase
the number of instances per relationship class, which made it have
advantages in the visual relationship detection. However, the nat-
ure language descriptions for images are not presented in this
dataset, which limits the application of graphs in the image cap-
tioning task.
The visual genome dataset [48] represents each image as a
scene graph where all objects, attributes and relationships are con-
catenated and canonicalized to its corresponding WordNet [49] ID
Fig. 1. Existing deep captioning methods are unable to consider the topological
structure within an image. However, our framework can infer the scene graph of the
visual content and incorporate the interactions between objects to generate the
logical descriptions.
478 N. Xu et al. / J. Vis. Commun. Image R. 58 (2019) 477–485