Image Captioning with Scene-graph Based
Semantic Concepts
Lizhao Gao, Bo Wang* and Wenmin Wang
School of ECE, Peking University Shenzhen Graduate School
Shenzhen, China
gaolz@pku.edu.cn, wangbo@pkusz.edu.cn, wangwm@ece.pku.edu.cn
ABSTRACT
Different from existing approaches for image captioning, in this
paper, we explore the co-occurrence dependency of high-level
semantic concepts and propose a novel method with scene-graph
based semantic representation for image captioning. To embed
scene graph as an intermediate state, we divide the task of image
captioning into two phases, called concept cognition and sentence
construction respectively. We build a vocabulary of semantic
concepts and propose a CNN-RNN-SVM framework to generate
the scene-graph-based sequence, which is then transformed into a
bit vector, as the input of RNN in the next phase. We evaluate our
method on MS COCO dataset. Experimental results show that our
approaches obtain a competitive or superior result to the state-of-
the-arts.
CCS Concepts
• Computing methodologies➝Artificial intelligence➝
Computer vision➝Computer vision problems➝ Object
recognition • Computing methodologies➝Artificial
intelligence➝ Computer vision➝ Computer vision tasks ➝
Visual content-based indexing and retrieval
Keywords
Image captioning, LSTM, CNN, Semantic representation, Scene
graph
1. INTRODUCTION
For human beings it is not a difficult task to generate a natural
language description for an image, but which is a challenging
problem in computer vision because it requires translation between
two different forms of information. Not only must image captioning
models enable to solve the vision challenges of determining what
objects are in an image, but they must also be powerful enough to
capture and express their relationships in natural languages.
Recent work has significantly improved the quality of caption
generation using a convolutional neural network (CNN) as an
image encoder and a recurrent neural network (RNN) as text
decoder to generate a caption. [12] uses long short-term memory
(LSTM) units and only put image features into RNN at the
beginning. [7] proposes to learn to score sentence and image
similarity as a function of R-CNN object detections with outputs of
a bidirectional RNN. [15] adds attention mechanism to image
captioning task and perform well. The high-level semantic
representation is extracted from the shared-CNN [14], which is
different from the original features of others. However, this method
cannot take the association between semantic concepts into account.
When we were babies, we learned new semantic concepts by
observing the visual world and listening to the natural language
descriptions of our parents, which was the source for our own
description for the visual world [2]. Inspired by the process of baby
learning, we propose a new method for image captioning. We
divide the task into two phases, called concept cognition and
sentence construction respectively. In the first phase, we use a
novel CNN-RNN-SVM framework to generate high-level semantic
representation, as the input of an RNN to generate caption in the
second phase. Furthermore, there is the presence of attention
mechanism in the visual system of human beings [15]. We cannot
focus on the whole horizon simultaneously when observing
something. Our attention transfers from one place to another
dynamically and sequentially. Usually, large objects in bright color
can be noticed preferentially followed by smaller ones in turn. So
in captioning task, we build a scene-graph based representation to
mimic the attention mechanism of human beings. Extensive
experiments demonstrated that our model yields competitive results
compared to the state-of-the-arts in image captioning task. It
achieves 26.2% BLEU-4, 22.4% METEOR, 76% CIDEr on the
MSCOCO dataset. Compared to other methods, we have the
following contributions:
(1) We utilize a graph-based semantic representation, scene graph,
as the bridge of images and corresponding natural descriptions.
(2) We implement co-occurrence dependency of semantic concepts
with the help of the recurrent neurons.
(3) Compared with other methods of generation of attributes
features, our method is more simple and does not need too
complicated pre-processing for images.
The paper is organized as follows. Section 2 describes details of our
model. The following section presents the results and analysis of
our experiments on MSCOCO dataset. Finally, section 4 concludes
this paper.
2. OUR MODEL
Our approach is summarized in Figure 1. We utilize a graph-based
semantic representation, scene graph, as the bridge of images and
corresponding captions. To embed scene graph as an intermediate
state, we divide the task of image captioning into two phases, called
concept cognition and sentence construction. In the first phase, we
build a vocabulary of semantic concepts and propose a CNN-RNN-
SVM framework to generate the scene-graph based sequence,
which is then transformed into one bit vector V. In the second phase,