IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 6, OCTOBER 2014 1665
Augmenting Image Descriptions Using
Structured Prediction Output
Yahong Han, Xingxing Wei, Xiaochun Cao, Yi Yang, and Xiaofang Zhou, Senior Member, IEEE
Abstract—The need for richer descriptions of images arises in a
wide spectrum of applications ranging from image
understanding
to image retrieval. While the Automatic Image Annotation (AIA)
has b een extensively studied, image descriptions with the output
labels lack sufficient information. Thi
s p aper proposes to augment
image descriptions using s tructured prediction output. W e define a
hierarchical tree-structured semantic unit to describe images, from
which we can obtain not only the clas
s and subclass one image be-
longs to, but also the attributes one image has. After defining a
new feature map function of structured SVM, we decompose the
loss function into every node of
the hiera rchical tree-structured se-
mantic unit and then predict the tree-structured se mantic unit for
testing images. In the experiments, we evaluate the performance of
theproposedmethodontw
o open benchmark datasets and com-
pare with the state-of-the-art methods. Experimental results show
the better prediction performance of the proposed m ethod and
demonstrate the stre
ngth of augmenting image descriptions.
Index Terms— Image descriptions, image annotation, structured
learning, tree-structured semantic unit.
I. INTRODUCTION
I
NFORMATIVE descriptions of images are important for ei-
ther i
mage understanding [1 ], [2] or image retrieval [3], [4],
[5]. Because images are visually polysemous [1], multiple con-
cepts of multiple levels may be associated with each image. O n
on
e han d, people want to know not only the class (e.g. animal,
vehicle, human, thin g) and subclass (e.g. cat, horse, sofa, tree)
one image belongs to, but also the attributes (e.g. is furry, has
l
eg, can fly) one im age has. T hese informative descriptio ns can
help people understand the image more accurately. On the other
Manuscript received July 17, 2013; revised January 02, 2014; accepted April
24, 2014. Date of publication May 02, 2014; date of current version September
15, 2014. This work was supported in part by the National Program on Key
Basic Research Project (973 Program) under Grant 2013CB329301, the NSFC
under Grants 61202166 and 61332012, the Doctoral Fund of Ministry of Educa-
tion of China under Grant 20120032120042, and the 100 Talents Programme of
the C hinese Academy of Sciences. The work of Y. Yang was supported in part
by the ARC DECRA project under Grant DE130101311. The associate editor
coordinating the review of this manuscript and approving it for publication was
Prof. Alan Hanjalic.
Y. Han is with th e School of Computer Science and Technology and the
Tianjin Key Laboratory of Cognitive Comp u ting and A pplication, Tianjin Uni-
versity, Tianjin, China (e-m ail: yahong@ tju.edu.cn).
X. Wei is with the School of Computer Science and Technology, Tianjin Uni-
versity, Tianjin, China (e-mail: xwei@tju.ed u .cn ).
X. Cao is with State Key Laboratory of Information Security, Institute of In-
formation Engineering , Chinese Academy of Sciences, Beijing, China (e -m a il:
caoxiaochun@iie.ac.cn).
Y. Yang and X. Zhou are w ith the School of Information Technolog y and
Electrical Engineering, the University of Queensland, Brisbane, Australia
(e-mail: yee.i.yang@gmail.com; zxf@itee.uq.edu.au).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TMM.2014.2321530
hand, in order to bridge the semantic gap between users’ inten-
sion and the low-level visual features and theref
ore improve the
quality of image search results, we should try to au gm e nt the se-
mantic descriptions of im ages, e.g., to progress beyond “labels”.
In the previous works, methods of generatin
gimages’
descriptions have been m oved from the image classification
methods (i.e., the single label methods) [6], [7] to the methods
of multi-label image annotation [8], [9
] and region tagging [10 ]
with the visual saliency estimation techniques [11]. Because
more semantic labels are associated with one image or each
object in one image, description s
output fro m the multi -label
image annotation and region tagging are more informative than
that of the image classification. To be m ore informative, some
works even try to generate sent
ences [12] or some structured
semantic triples [13] from images. As some additional natural
language processing techniques [12] and the manually labeled
semantic triples of trainin
g images [13] are needed, better re-
sults for images of some specific applications are obtained from
these two methods. As a new middle-level semantic cue, very
recently, image attrib
utes are well explored to describe objects
both w ith in and across the semantic categories [14], [15] and
therefore can be well utilized to generate more informative
descriptions for ima
ges [16].
In describing images, structured organized concepts [13] or
events [17] are m ore inform ative. For example, the structured
semantic triple s [
13] and the categorized structured events [17]
are more informative to improve the im age understanding. For
semantic descriptions, ontolo gy deals with questions concerning
what entities ex
ist o r can be said to exist, and how such entities
can be grouped, related within a hierarchy, and subdivided ac-
cording to similarities and differences
1
. Psycholinguistic theory
from WordNet
2
does attribute human m emory with a hierarchical
semantics. As a visual counterpart of the online word ontology
of WordNet, Im ageNet
3
organizes the images according to the
semantic h
ierarchy of WordNet. Specifically, in ImageNet, all
images are covered by 12 top-level categorie s, which constitute
12 subtrees. Thus, each image is indexed according to its path
in the hi
erarchy and all “synsets” in this path can be used to
describe the correlations among images within a hier a rchy . As
image data are usually visually polysemous, they are usually
assoc
iated to multiple levels of semantic categories. In this
paper, motivated by t he on tologies, WordNet, and ImageNet, we
propose to define a hierarchical semantic unit to describe each
imag
e, which is used to provide better semantic descriptions
of images. As illustrated in Fig. 1 and Fig. 2, we show two
specific examples of images with d escriptions of hierarchical
1
http://en.wikiped ia.org/wiki/Ontology
2
http://wordnet.princeton.edu/
3
http://www.image-net.org
1520-9210 © 2014 IEEE. Persona l us e is p ermitted, but repub lication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more inform ation.