场景图字幕：基于结构视觉表示的图像字幕_图像字幕任务中怎么把图片和文字更好融合资源-CSDN文库

需积分: 19 77 浏览量 2021-05-12 17:38:22 上传评论收藏 2.1MB PDF 举报

该篇研究论文的标题为《场景图字幕：基于结构视觉表示的图像字幕》，文章主要讨论了如何利用深度学习技术，特别是卷积神经网络和循环神经网络，在图像字幕生成任务中利用场景图这一结构化视觉表示方法。场景图是一种以图形形式表示图像内容的结构，它能够捕获视觉场景的全面结构语义信息，包括物体、物体的属性以及物体之间的关系。在深度学习领域，图像字幕生成是一个相当活跃的研究方向，它的目标是让计算机能够理解图像内容并用自然语言生成描述性的文字。深度神经网络，尤其是深度循环神经网络在图像字幕生成任务上已经取得了显著的进展。然而，这些方法并没有显式地利用图像中的结构化视觉和文本知识。本篇研究工作的主要贡献在于提出了一个名为Scene Graph Captioner (SGC)的框架，用于图像字幕生成任务。SGC框架的目标是通过显式地建模物体、物体属性以及物体之间的关系来捕获视觉场景的全面结构语义信息。具体来说，SGC框架首先在大型物体、属性和关系数据集上学习各个模块，生成场景图。然后，它将高水平的图信息和视觉注意力信息融入深度字幕生成框架中。为了将场景图嵌入结构化表示中，作者提出了一种新颖的框架，该框架能够捕捉语义概念和图拓扑结构。此外，研究者开发了一种基于场景图的方法来生成注意力图，通过在场景图中的节点间利用高度的内部同质性和外部异质性。最终，使用基于LSTM（长短期记忆网络）的框架将这些信息转换为文本描述。在论文的介绍部分，作者回顾了深度循环神经网络在图像和视频描述生成任务上取得的成果，并提出了需要一个正式的方式来表示场景，以便于描述图像内容。表示方法必须足够强大，能够描述存在的丰富多样的场景内容，同时又不至于过于繁琐。文章还详细介绍了SGC框架的设计理念和实现细节，包括如何生成场景图、如何将场景图嵌入到结构化表示中，以及如何利用这些结构化信息进行视觉注意力的计算。作者强调了场景图中节点的高内部同质性以及节点间的外部异质性对于生成注意力图的重要性。注意力机制在此框架中扮演了将视觉信息聚焦于图像的关键部分，从而更好地生成相关描述的角色。本研究的实验部分在MSCOCO数据集上进行了框架的评估。 MSCOCO（Microsoft Common Objects in Context）是一个大型的视觉识别和图像字幕生成任务的数据库，它包含了丰富的视觉场景和相应的描述。研究者通过该数据集来测试SGC框架的性能，并与其他先进的图像字幕生成方法进行比较。作者指出，虽然当前的SGC框架在图像字幕生成任务上展示出了潜力，但在对结构化视觉表示的建模上仍然有进一步提升的空间。未来的研究将可能集中在如何进一步改进结构化表示的捕捉方法，以及如何更有效地融合视觉和语言知识。这篇研究论文在图像字幕生成领域提出了一种新的基于结构视觉表示的方法。该方法不仅能够更加细致地描述图像内容，还为未来在图像字幕领域的研究提供了新的视角和可能的发展方向。通过这种方式，该研究为计算机视觉和自然语言处理的交叉学科研究提供了重要的参考价值。

资源推荐

资源详情

资源评论

Scene graph captioner: Image captioning based on structural visual

representation

Ning Xu, An-An Liu

⇑

, Jing Liu

, Weizhi Nie, Yuting Su

School of Electrical and Information Engineering, Tianjin University, Tianjin, China

article info

Article history:

Received 16 May 2018

Revised 26 November 2018

Accepted 11 December 2018

Available online 14 December 2018

Keywords:

Image captioning

Scene graph

Structural representation

Attention

abstract

While deep neural networks have recently achieved promising results on the image captioning task, they

do not explicitly use the structural visual and textual knowledge within an image. In this work, we pro-

pose the Scene Graph Captioner (SGC) framework for the image captioning task, which captures the com-

prehensive structural semantic of visual scene by explicitly modeling objects, attributes of objects, and

relationships between objects. Firstly, we develop an approach to generate the scene graph by learning

individual modules on the large object, attribute and relationship datasets. Then, SGC incorporates

high-level graph information and visual attention information into a deep captioning framework.

Speciﬁcally, we propose a novel framework to embed a scene graph into the structural representation,

which captures the semantic concepts and the graph topology. Further, we develop the scene-graph-

driven method to generate the attention graph by exploiting high internal homogeneity and external

inhomogeneity among the nodes in the scene graph. Finally, a LSTM-based framework translates these

information into text. We evaluate the proposed framework on a held-out MSCOCO dataset.

1. Introduction

In the past year, deep recurrent neural network methods have

demonstrated promising performances on the task of generating

descriptions for images and videos [1–7]. From a structural view-

point, we ﬁrst need a formalized way to represent the scene for

describing images including the comprehensive contents. This rep-

resentation must be powerful enough to describe the rich variety

of scene that can exist, without being too cumbersome. Unfortu-

nately the current systems [8,5,9–14] fail to use the structural nat-

ure of the image, as shown in Fig. 1.

To solve this problem, we assume that a computer vision sys-

tem should explicitly represent objects, attributes, and relation-

ships within the image. Zitnick et al. made important steps

toward this goal by studying abstract scenes composed of clip-

art [15–17], though these were limited in the cartoon images.

Johnson et al. used the scene graph as the query to retrieve seman-

tically real-world images [18] but with human-generated cases.

However, they demonstrated that perfect recognition of detailed

semantic can beneﬁt scene understanding.

Meanwhile, describing the content of an image using properly

formed sentences is a very challenging task. RNN-based methods,

which directly translate image features into text, are reﬂected in

many captioning works [2,1,4,19], without developing high-level

semantic concepts. Recently, some more complex approaches

[9,20–23] are developed to leverage the well-studied object or

action recognition. But it is not enough to integrate the relation-

ships between objects, their attributes and locations, to drive the

natural language generation.

Using the scene graph to tackle image captioning would be a

major leap forward, but it involves two main challenges: (1) how

to construct the scene graph which captures dense annotations

of objects, attributes, and relationships within one image. (2) inte-

grating the scene graph for image captioning is difﬁcult, because

the interactions between objects as well as their localizations can

be highly complex, going beyond simple pairwise relations.

In order to address these challenges, we ﬁrst propose an

approach that can infer the scene graph by learning a set of visual

detection modules. Speciﬁcally, we formalize element (i.e., objects,

attributes and relationships) predictions as individual tasks and

build up the topological structure of the scene graph. Secondly,

we propose the Scene Graph Captioner (SGC), which can integrate

the complex elements and the interactions between objects as well

as their localizations, to generate descriptions, which conveys the

linguistic logic presented in the image. Unlike previous captioners

https://doi.org/10.1016/j.jvcir.2018.12.027

This article is part of the Special Issue on Multimodal_Cooperation.

⇑

Corresponding authors.

E-mail addresses: anan0422@gmail.com (A.-A. Liu), jliu_tju@tju.edu.cn (J. Liu).

J. Vis. Commun. Image R. 58 (2019) 477–485

Contents lists available at ScienceDirect

J. Vis. Commun. Image R.

journal homepage: www.elsevier.com/locate/jvci

which cannot explicitly consider to model the interactions

between objects, SGC can generate sentences based on the

graph-based formulation by leveraging the high-level concepts

and the attention clustering region.

Consider the image of train in Fig. 1. To accurately describe the

image content, any captioning model needs to identify the visual

elements such as train, tracks, and workers and then combine them

to generate a coherent sentence. While previous captioning models

ignore the topological structure, which is an essential representa-

tion for an image, we ﬁrst generates the scene graph (inside blue

dotted grid) and then the proposed SGC composes a caption by

understanding the complex elements and the interactions between

objects, e.g., near and on, as well as their localizations.

We have three contribution points: (1) we propose a novel

framework to embed the scene graph into the compact representa-

tion, which can capture the explicit semantic concepts and the

graph topology information. Speciﬁcally, the prediction probabili-

ties of concepts in graph are aggregated to construct the multi-

label sematic vector. Meanwhile, we extract the shallow network

features from the extended adjacent matrix to construct the topol-

ogy vector. (2) we develop the scene-graph-driven method for

attention graph generation by exploiting high internal homogene-

ity and external inhomogeneity among the nodes of the scene

graph. Then, the attention graph is used to crop the visual region

for assisting to describe the image. (3) the encoded images, atten-

tions, concepts, and topology information are fed into LSTM which

translates these information into text on the held-out MSCOCO

dataset.

2. Related work

Image Caption. The task of image captioning is to generate a

sentence describing the visual content of a given image. Early, in

the template-based approaches [24–26], the sentences were gen-

erated with syntactic and semantic constraints by using detected

objects from the images. Recently, researchers follow a CNN-RNN

framework [8,27,5,1]: Mao et al. [8] proposed a multimodal RNN

(m-RNN) to estimate the probability distribution of the next word

given previous words and the CNN feature of an image at each time

step. Chen et al. [27] learnt a bi-directional mapping between

images and their descriptions, which allowed to reconstruct the

image description given the visual feature. Others [28,20,29–31]

explicitly exploited the latent intermediate knowledge. For exam-

ples, Xu et al. [28] proposed a visual attention model to automati-

cally ﬁxate on salient objects for generating corresponding words

in the output sequence. Wu et al. [20] incorporated high-level

semantic concepts into the CNN-RNN framework. Further, Hen-

dricks et al. [9] proposed the Deep Compositional Captioner

(DCC) to generate descriptions of novel objects which were not

present in the paired image-sentence dataset.

However, the aforementioned methods ignore the topological

structure which is an essential representation for an image. Our

framework can generate and explore the topological property of

the scene graph for the image, which includes complex elements

and interactions between objects as well as their localizations, to

construct the linguistic logic image description.

Structural Representations. Graph-structural representations

have attained widespread use in the computer graphics to repre-

sent compositional scenes. Fisher et al. [32] used graph kernels

to compare 3D scenes, and Chang et al. [33] generated 3D scenes

from natural language descriptions using the scene graphs. Parse

graphs obtained in scene parsing [34,35] were typically the result

of applying a grammar designed for a particular domain (such as

indoor scenes [34]).

Recent work by Lin et al. [36] constructed semantic graphs from

text queries using hand-deﬁned rules to transform parse trees and

used these semantic graphs to retrieve videos in the context of

autonomous driving. However, their system is constrained to the

six object classes from KITTI [37].

More semi-structural representations of visual scenes explicitly

encoded certain types of properties, such as attributes [38–40],

object co-occurrence [41], or spatial relationships between objects

[42–45]. In contrast, we can generate the scene graph which is a

structural representation of images. Each node is explicitly

grounded in an image region, avoiding the inherent uncertainty

of text-based representations. Meanwhile, our system uses a much

larger, open-world vocabulary, boosting the generated scene

graphs to be more fruitful.

Real-World Scene Graph Datasets. Datasets have been a driv-

ing force of computer vision algorithms and some scene graph

datasets have been published recently. Sadeghi et al. [42] proposed

a dataset suitable for the phrasal recognition domain. The visual

phrases were formed by either interactions between objects or

activities of the single object. However, it only listed the limited

number of 17 visual phrases using 8 selected object classes from

Pascal VOC2008 dataset [46].

The scene graph dataset [18] takes {object, attribute} and

{object, relationship, object} tuples into consideration. It uses an

open vocabulary to label the most meaningful information of each

image instead of being constrained to a ﬁxed set of predeﬁned

classes. In addition, Lu et al. [47] relabeled the dataset to increase

the number of instances per relationship class, which made it have

advantages in the visual relationship detection. However, the nat-

ure language descriptions for images are not presented in this

dataset, which limits the application of graphs in the image cap-

tioning task.

The visual genome dataset [48] represents each image as a

scene graph where all objects, attributes and relationships are con-

catenated and canonicalized to its corresponding WordNet [49] ID

Fig. 1. Existing deep captioning methods are unable to consider the topological

structure within an image. However, our framework can infer the scene graph of the

visual content and incorporate the interactions between objects to generate the

logical descriptions.

478 N. Xu et al. / J. Vis. Commun. Image R. 58 (2019) 477–485

剩余8页未读，继续阅读

评论收藏

内容反馈

weixin_38724370

粉丝: 5
资源: 931

场景图字幕：基于结构视觉表示的图像字幕

使用基于场景图的语义概念对图像进行字幕

带有视觉注意的图像字幕：我的学士学位论文的代码

卫星图像字幕生成数据集.zip

字幕- anything是一个多功能的工具，结合了图像分割、视觉字幕和ChatGPT，可以根据用户偏好生成具有不同控件的定制字幕

论文研究 - 双重注意机制在中文图像字幕中的应用

StructCap：用于图像字幕的结构化语义嵌入

remote-sensing-image-captioning:遥感图像字幕论文的体系结构

原创基于MATLAB图像处理的车辆检测与识别pdf-基于MATLAB图像处理的车辆检测与识别.pdf

GPT4Vision图像字幕数据集.zip

Add_word_toPic.rar_图像字幕

Python-Neuraltalk2pytorch在pytorch中的图像字幕模型

ASS字幕文件说明(ASS、SSA特效代码全解全指令指令大全-包含了所有的指令)

图像理解经典综述详细翻译

3d左右格式字幕制作软件 内含详细制作教程

计算机视觉辅助媒体创造.pptx

视觉-语言模型 Florence-VL：基于生成型视觉编码器与深度-广度融合技术

基于CORDIC的反正弦和反余弦计算的FPGA实现

最新资源

3d左右格式字幕制作软件内含详细制作教程