使用结构化预测输出增强图像描述资源-CSDN文库

15 浏览量 2021-03-18 13:04:25 上传评论收藏 2.72MB PDF 举报

在图像理解和图像检索领域，对图像的描述越发重要。由于图像通常具有视觉上的多义性（visually polysemous），即同一张图像可能关联着不同概念，因而产生了对图像描述的丰富性和准确性需求。人们不仅希望了解图像所属的类别（比如动物、车辆、人类、物体）和子类别（比如猫、马、沙发、树），而且还希望知道图像所具有的属性（比如是否是毛茸茸的、有腿的、能飞的）。这些信息丰富的图像描述有助于人们更准确地理解图像。自动图像注释（Automatic Image Annotation，简称AIA）虽然已经被广泛研究，但仅凭输出的标签，所提供的图像描述信息仍然不够充分。为此，这篇文章提出了利用结构化预测输出来增强图像描述的方法。文章定义了一个层次化的树状结构的语义单元来描述图像，这种结构不仅可以让我们知道图像属于哪一个类别和子类别，还可以让我们了解图像拥有的属性。文章中提出了一个新的结构化支持向量机（structured SVM）的特征映射函数。通过将损失函数分解到树状结构的语义单元的每个节点上，进而对测试图像进行树状结构的语义单元预测。实验部分，作者在两个公开的基准数据集上评估了所提出方法的性能，并与当时最先进的方法进行了比较。实验结果表明，所提出的方法在预测性能方面表现更佳，并展示了增强图像描述的优势。文章中还提到了几个关键词，包括图像描述、图像标注、结构化学习和树状结构的语义单元。这些关键词描述了文章的研究方向和研究内容。图像描述是指根据图像内容生成的文字性描述；图像标注是指自动或半自动地为图像添加标签的过程；结构化学习是指在机器学习中，通过考虑输出标签之间的结构关系来优化预测的模型；树状结构的语义单元则是一种用于描述图像的结构化语义模型，它根据图像的不同层次特征构建层次化的树状结构。此外，文章还提到了图像理解与图像检索之间的关系。图像理解不仅包括识别图像中的物体，还涉及理解物体间的关系、场景的上下文信息等，这通常需要更丰富的图像描述。而图像检索则依赖于图像的描述来帮助用户找到符合搜索意图的图像。一个准确而丰富的图像描述，可以大大提高图像检索的效率和准确性。文章的引言部分强调了信息丰富的图像描述的重要性，并指出了在图像理解和检索领域中，如何更好地描述图像以辅助人们理解及检索的重要性。同时，文章的作者指出，尽管自动图像注释是一个活跃的研究领域，但现有方法产出的标签信息尚不足以满足对图像进行丰富描述的需求。因此，本文提出了一种新的方法，通过结构化预测输出来增强图像描述，以期在图像理解和检索方面取得更好的效果。文章在引言部分还提到了研究背景和研究动机，即如何有效地结合图像的类别信息、子类别信息以及属性信息来形成一个全面而细致的图像描述。而这一描述过程正是通过构建一个树状结构的语义单元，并通过结构化学习技术来实现。这种方法在解决图像内容丰富性描述问题上，理论上比传统基于单一标签的自动图像注释方法更加有效。

资源推荐

资源详情

资源评论

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 6, OCTOBER 2014 1665

Augmenting Image Descriptions Using

Structured Prediction Output

Yahong Han, Xingxing Wei, Xiaochun Cao, Yi Yang, and Xiaofang Zhou, Senior Member, IEEE

Abstract—The need for richer descriptions of images arises in a

wide spectrum of applications ranging from image

understanding

to image retrieval. While the Automatic Image Annotation (AIA)

has b een extensively studied, image descriptions with the output

labels lack sufﬁcient information. Thi

s p aper proposes to augment

image descriptions using s tructured prediction output. W e deﬁne a

hierarchical tree-structured semantic unit to describe images, from

which we can obtain not only the clas

s and subclass one image be-

longs to, but also the attributes one image has. After deﬁning a

new feature map function of structured SVM, we decompose the

loss function into every node of

the hiera rchical tree-structured se-

mantic unit and then predict the tree-structured se mantic unit for

testing images. In the experiments, we evaluate the performance of

theproposedmethodontw

o open benchmark datasets and com-

pare with the state-of-the-art methods. Experimental results show

the better prediction performance of the proposed m ethod and

demonstrate the stre

ngth of augmenting image descriptions.

Index Terms— Image descriptions, image annotation, structured

learning, tree-structured semantic unit.

I. INTRODUCTION

NFORMATIVE descriptions of images are important for ei-

ther i

mage understanding [1 ], [2] or image retrieval [3], [4],

[5]. Because images are visually polysemous [1], multiple con-

cepts of multiple levels may be associated with each image. O n

e han d, people want to know not only the class (e.g. animal,

vehicle, human, thin g) and subclass (e.g. cat, horse, sofa, tree)

one image belongs to, but also the attributes (e.g. is furry, has

eg, can ﬂy) one im age has. T hese informative descriptio ns can

help people understand the image more accurately. On the other

Manuscript received July 17, 2013; revised January 02, 2014; accepted April

24, 2014. Date of publication May 02, 2014; date of current version September

15, 2014. This work was supported in part by the National Program on Key

Basic Research Project (973 Program) under Grant 2013CB329301, the NSFC

under Grants 61202166 and 61332012, the Doctoral Fund of Ministry of Educa-

tion of China under Grant 20120032120042, and the 100 Talents Programme of

the C hinese Academy of Sciences. The work of Y. Yang was supported in part

by the ARC DECRA project under Grant DE130101311. The associate editor

coordinating the review of this manuscript and approving it for publication was

Prof. Alan Hanjalic.

Y. Han is with th e School of Computer Science and Technology and the

Tianjin Key Laboratory of Cognitive Comp u ting and A pplication, Tianjin Uni-

versity, Tianjin, China (e-m ail: yahong@ tju.edu.cn).

X. Wei is with the School of Computer Science and Technology, Tianjin Uni-

versity, Tianjin, China (e-mail: xwei@tju.ed u .cn ).

X. Cao is with State Key Laboratory of Information Security, Institute of In-

formation Engineering , Chinese Academy of Sciences, Beijing, China (e -m a il:

caoxiaochun@iie.ac.cn).

Y. Yang and X. Zhou are w ith the School of Information Technolog y and

Electrical Engineering, the University of Queensland, Brisbane, Australia

(e-mail: yee.i.yang@gmail.com; zxf@itee.uq.edu.au).

Color versions of one or more of the ﬁgures in this paper are available online

at http://ieeexplore.ieee.org.

Digital Object Identiﬁer 10.1109/TMM.2014.2321530

hand, in order to bridge the semantic gap between users’ inten-

sion and the low-level visual features and theref

ore improve the

quality of image search results, we should try to au gm e nt the se-

mantic descriptions of im ages, e.g., to progress beyond “labels”.

In the previous works, methods of generatin

gimages’

descriptions have been m oved from the image classiﬁcation

methods (i.e., the single label methods) [6], [7] to the methods

of multi-label image annotation [8], [9

] and region tagging [10 ]

with the visual saliency estimation techniques [11]. Because

more semantic labels are associated with one image or each

object in one image, description s

output fro m the multi -label

image annotation and region tagging are more informative than

that of the image classiﬁcation. To be m ore informative, some

works even try to generate sent

ences [12] or some structured

semantic triples [13] from images. As some additional natural

language processing techniques [12] and the manually labeled

semantic triples of trainin

g images [13] are needed, better re-

sults for images of some speciﬁc applications are obtained from

these two methods. As a new middle-level semantic cue, very

recently, image attrib

utes are well explored to describe objects

both w ith in and across the semantic categories [14], [15] and

therefore can be well utilized to generate more informative

descriptions for ima

ges [16].

In describing images, structured organized concepts [13] or

events [17] are m ore inform ative. For example, the structured

semantic triple s [

13] and the categorized structured events [17]

are more informative to improve the im age understanding. For

semantic descriptions, ontolo gy deals with questions concerning

what entities ex

ist o r can be said to exist, and how such entities

can be grouped, related within a hierarchy, and subdivided ac-

cording to similarities and differences

. Psycholinguistic theory

from WordNet

does attribute human m emory with a hierarchical

semantics. As a visual counterpart of the online word ontology

of WordNet, Im ageNet

organizes the images according to the

semantic h

ierarchy of WordNet. Speciﬁcally, in ImageNet, all

images are covered by 12 top-level categorie s, which constitute

12 subtrees. Thus, each image is indexed according to its path

in the hi

erarchy and all “synsets” in this path can be used to

describe the correlations among images within a hier a rchy . As

image data are usually visually polysemous, they are usually

assoc

iated to multiple levels of semantic categories. In this

paper, motivated by t he on tologies, WordNet, and ImageNet, we

propose to deﬁne a hierarchical semantic unit to describe each

imag

e, which is used to provide better semantic descriptions

of images. As illustrated in Fig. 1 and Fig. 2, we show two

speciﬁc examples of images with d escriptions of hierarchical

http://en.wikiped ia.org/wiki/Ontology

http://wordnet.princeton.edu/

http://www.image-net.org

See http://www.ieee.org/publications_standards/publications/rights/index.html for more inform ation.

1666 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 16, NO. 6, OCTOBER 2014

Fig. 1. An example of the tree-structured sem an tic unit. According to the se-

mantic hierarchy in I mageNet, we choose the root category and a low-level sub-

category to describe the sample im age in different semantic levels. To be more

informative, multiple attributes are associated and used to describe the subcate-

gory. We organize the category, subcategory, and related attributes into a 3-layer

tree-structured semantic unit, which correspon d s to the root, middle-level node,

and leaf nodes in such hierarchy, respectively.

Fig. 2. An example of the tree-structured semantic unit w ith four layers. In this

ﬁgure, the cat is partially occluded (the leg and foot are missing). Therefore, the

predicted tree-structured semantic unit does not contain the Leg and Foot, but

contains the visib le attributes like H ead , Ear, etc.

semantic unit. Fig. 1 shows an example of 3-layer tree-structured

semantic hierarchy. The sample image is indexed along the path

“animal-chordate-craniate-bird” in ImageNet. We can choose

the category “animal” and subcategory “bird” to describe the

image in different semantic levels. In addition, we assign each

image multiple attributes, e.g.,“has leg”, “is furry”, “has beak”,

to describe the subcategory. We then organize the category,

subcategory, and related attributes into a 3-layer tree-structured

semantic unit, which corresponds to the root, middle-level node,

and leaf nodes in such hierarchy, respectively. Fig. 2 shows an

example of 4-layer tree-structured semantic hierarchy. In Fig. 2,

a mid -cl a ss layer is inserted between the class and superclass

labels, which are obtained by retrieving the parent nodes of the

subclass label in the WordNet. It is worth no tin g that, the cat in

Fig. 2 is partially occluded (the leg and foot are missing). There-

fore, the predicted tree-structured semantic unit does not contain

the Leg and Foot, but contai ns the visible attributes like Head,

Ear, etc. Thus, if we can successfully train a model to predict the

layered semantic concepts, the hierarchical semantic descrip-

tions of images can be better annotated. Using the 3-layer and

4-layer semantic hierarchies, images are described at different

semantic levels and thu s the semantic similarities or differences

of the related images’ semantic can be easily obtained.

From a machine learning point of view, the d eﬁ ned t ree-struc-

tured semantic unit can be taken as the structured prediction

output, taking the low-level image features as the input. We

propose to train a mod el of structured prediction using the

Structured SVM (SSVM) [18], w hich has been well utilized

in a variety of successful applications in computer vision

[19]–[21]. The prop osed framework of augmenting image de-

scriptions using the structured prediction output is illustrated in

Fig. 3. For each training im age

( ), we deﬁne its

tree-structured semantic unit

and extract the low-level visual

features, from which we deﬁne the speciﬁc feature mapping

function

. In order to train a mod e l f or the prediction

of such tree-structured output, we augment SSVM b y decom-

posing the loss function into every node of the tree-structured

semantic unit. T hen we pred ict and assign the tree-structured

semantic unit to each testing image. To evaluate the perfor-

mance of the propo sed f ram e wo rk, we perform experiments

on two open benchmark datasets: (1) The aPascal dataset [14],

which is constructed from the PASCAL VOC 2008 dataset;

(2) The aYahoo dataset [ 14], which collects images from the

Yahoo image search. Experimental results show the better pre -

diction performance of the pro posed method and demo nstrate

the strength of augmenting image descriptions.

We summarize the contribution s of this paper as follows.

(1) A hierarchical tree-structured semantic u nit is de ﬁned

and proposed to augment image descriptio ns beyond

ﬂat “tags” and “attributes”. Within this hierarchical se-

mantic unit, images are described in different semantic

levels and the similarities or differences o

f related im -

ages’ semantics can be easily inferred.

(2) We augment SSV M by decomposing the loss function

into every n ode of the tree-structured seman

tic unit to

predict the hierarchical semantic unit jointly to each

image.

(3) We demonstrate the better performance of th

eaug-

mented structured prediction method and also show

that augmenting im age descriptions owns strength of

improving the performance of image re

trieval.

The rest of this paper is organized as follows. In Section II, we

brieﬂy review the recent work of generating image descriptions.

We present the framework of augmenti

ng image descriptions

usingstructuredpredictionoutputinSectionIII.Thesolutions

and algorithm are given in Section III-D . S ection IV reports all

experimental results. Finally

, we summarize the con clusions in

Section V .

II. R

ELATED WORK

As a middle-level semantic cue, attribute can be used to

generate more informative image description s. F arhad i et al.

[14] propose to describe imag e both within and cross categories

using the attributes. The predicted attributes can be used to

indicate the presence (or absence) of certain properties in one

image. In order to capture more general semantic relation ships,

relative attributes [16] are explored to describe the strength of

an attribute in an image with respect to other images. Because

sentences of natural languages can tell m ore semantics of one

image, Farhadi et al. [12] even try to g enerate short descriptive

sentences from images. In their work, they parse images into

a meaning ful semantic “triple” to describe object, action, and

scene. Similarly, [13] also deﬁnes a semantic “triple” and takes

it as the informative description of each image. Though the

triples used both in [12] and [13] are a kind of structured output

for an image, they need some additional natural language

processing techniques [12] and the manually labeled semantic

triples of training images [13].

To describe images beyon d the ﬂat structures of labels and

tags, some literatures explore the hierarchical semantic descrip-

tions. In [1], a hierarchical representation Vicept is proposed to

bridge the low-level visual features and the semantic concepts.

剩余11页未读，继续阅读

评论收藏

内容反馈

weixin_38642285

粉丝: 5
资源: 946

使用结构化预测输出增强图像描述

StructuredImageSegmentation:通过结构化边缘预测提取图像区域 (WACV 2015)

Python-重新实现用于模糊图像分割的概率UNet中描述的模型

SOMBP预测模型，数据可以多输入单输出做拟合预测模型，直接替换数据就可以使用，程序内有注释，可学习性强，可除两种拟合预测图，以

C语言实用数字图像处理

网络游戏-基于卷积神经网络的对单张室外图像太阳位置的预测方法.zip

特征提取后的图像分类

使用随机结构化森林的自动道路裂缝检测

基于MATLAB的BP神经网络的数字图像识别.docx

十分类图像数据集

NWP.rar_Nwp_两幅图像比较

Image Classification1_neuralnetwork_神经网络图像_神经网络_matlab_水下图像_

使用监督DPE模型进行图片增强.zip

数字图像处理课程教学大纲.pdf

基于机器学习的图像标注系统的设计与实现.docx

2019年基于MATLAB的BP神经网络的数字图像识别.doc.pdf

强边缘提取网络用于非均匀运动模糊图像盲复原.docx

基于卷积神经网络的图像去噪模型（包括训练集和测试集）.rar

【图像识别】基于BP神经网络实现手写体大写字母识别附matlab代码.zip

ai kaggle Classify Leaves zip 数据集 全部文件和图像 没有缺少

基于FCN的侧扫声呐图像分割matlab代码

matlab编写的有关图像识别分类方法的源代码

用于ssTEM图像恢复的统一深度学习框架_Python_Cuda_下载.zip

基于CNN神经网络的彩色细胞形态图像识别源码（仅供学习）

基于神经网络的图像分割仿真源码matlab程序-源码

Yolov5结构图展示清晰易懂

【图像识别】基于模板匹配算法实现英文印刷字符识别matlab代码.zip

神经网络图像分类代码（可直接运行）.rar

最新资源

ai kaggle Classify Leaves zip 数据集全部文件和图像没有缺少