使用基于场景图的语义概念对图像进行字幕资源-CSDN文库

需积分: 9 117 浏览量 2021-04-12 19:22:41 上传评论收藏 632KB PDF 举报

图像字幕生成是计算机视觉领域的一项重要任务，其目标是为图像生成自然语言的描述，以使计算机能像人类一样理解图像内容。从给定文件中提取的知识点包括以下几个方面： 1. 基于场景图的语义表示方法：文章提出了一种新的基于场景图的语义概念对图像进行字幕的方法。场景图是一种图形化的表示，它以节点表示对象和属性，以边表示对象之间的关系，能够表示图像中的语义结构。在基于场景图的方法中，图像首先被转换成一个中间场景图状态，这样可以更好地捕捉和表达图像中物体及其关系的语义信息。 2. 概念认知与句子构建阶段：为了实现图像到字幕的转换，研究者将图像字幕生成任务分为两个阶段：概念认知和句子构建。在概念认知阶段，主要任务是识别图像中的高级语义概念；而句子构建阶段则是将这些概念串成连贯的句子。 ***N-RNN-SVM框架：研究者构建了一个CNN-RNN-SVM框架来生成基于场景图的序列。卷积神经网络（CNN）首先用于提取图像的视觉特征，然后循环神经网络（RNN）被用来生成自然语言描述。支持向量机（SVM）则可能用于分类或识别场景图中的元素。 4. 词汇构建：为了生成有效的字幕，研究者构建了一个语义概念的词汇库。这个词汇库包含了一组用于描述图像的关键词和短语。 5. 比特向量的使用：生成的场景图基础序列会被转换成比特向量作为下一个阶段RNN的输入。这种转换可能涉及编码和压缩技术，目的是为了降低复杂度，同时保留图像的关键语义信息。 6. 实验与评估：研究者在MSCOCO数据集上评估了其方法。MSCOCO是一个广泛使用的图像字幕生成基准数据集。通过实验结果的对比，证明了提出方法相较于现有技术具有竞争力或者更优的效果。 7. 相关关键词：在文档的末尾列出了本研究的主要关键词：图像字幕、长短期记忆（LSTM）、卷积神经网络（CNN）、语义表示、场景图。 8. 计算机视觉与人工智能：文档的抽象和关键词部分也反映了本研究与计算机视觉和人工智能领域的紧密联系。在计算机视觉的子领域中，研究着重于物体识别和视觉内容的基于内容的索引与检索。而从人工智能的角度来看，本研究也属于人工智能方法论，特别是在利用机器学习进行图像理解和自然语言处理方面。 9. 研究的意义：本研究的意义在于提供了一种更有效捕捉图像语义信息的方法，尤其在理解图像中物体间的关系以及将这些关系转化为自然语言描述方面。这对于机器视觉的理解能力是一个重大的提升，对于推动图像字幕生成技术的进一步发展具有重要的理论和应用价值。 10. 挑战与展望：尽管本研究取得了一定的进展，但仍面临一些挑战，比如在不同场景和角度下对物体的识别和描述、复杂场景的语义理解等。未来的研究可以尝试结合更先进的深度学习技术、注意力机制、或者自然语言处理的新进展来进一步提高图像字幕生成的质量和准确性。

资源推荐

资源详情

资源评论

Image Captioning with Scene-graph Based

Semantic Concepts

Lizhao Gao, Bo Wang* and Wenmin Wang

School of ECE, Peking University Shenzhen Graduate School

Shenzhen, China

gaolz@pku.edu.cn, wangbo@pkusz.edu.cn, wangwm@ece.pku.edu.cn

ABSTRACT

Different from existing approaches for image captioning, in this

paper, we explore the co-occurrence dependency of high-level

semantic concepts and propose a novel method with scene-graph

based semantic representation for image captioning. To embed

scene graph as an intermediate state, we divide the task of image

captioning into two phases, called concept cognition and sentence

construction respectively. We build a vocabulary of semantic

concepts and propose a CNN-RNN-SVM framework to generate

the scene-graph-based sequence, which is then transformed into a

bit vector, as the input of RNN in the next phase. We evaluate our

method on MS COCO dataset. Experimental results show that our

approaches obtain a competitive or superior result to the state-of-

the-arts.

CCS Concepts

• Computing methodologies➝Artificial intelligence➝

Computer vision➝Computer vision problems➝ Object

recognition • Computing methodologies➝Artificial

intelligence➝ Computer vision➝ Computer vision tasks ➝

Visual content-based indexing and retrieval

Keywords

Image captioning, LSTM, CNN, Semantic representation, Scene

graph

1. INTRODUCTION

For human beings it is not a difficult task to generate a natural

language description for an image, but which is a challenging

problem in computer vision because it requires translation between

two different forms of information. Not only must image captioning

models enable to solve the vision challenges of determining what

objects are in an image, but they must also be powerful enough to

capture and express their relationships in natural languages.

Recent work has significantly improved the quality of caption

generation using a convolutional neural network (CNN) as an

image encoder and a recurrent neural network (RNN) as text

decoder to generate a caption. [12] uses long short-term memory

(LSTM) units and only put image features into RNN at the

beginning. [7] proposes to learn to score sentence and image

similarity as a function of R-CNN object detections with outputs of

a bidirectional RNN. [15] adds attention mechanism to image

captioning task and perform well. The high-level semantic

representation is extracted from the shared-CNN [14], which is

different from the original features of others. However, this method

cannot take the association between semantic concepts into account.

When we were babies, we learned new semantic concepts by

observing the visual world and listening to the natural language

descriptions of our parents, which was the source for our own

description for the visual world [2]. Inspired by the process of baby

learning, we propose a new method for image captioning. We

divide the task into two phases, called concept cognition and

sentence construction respectively. In the first phase, we use a

novel CNN-RNN-SVM framework to generate high-level semantic

representation, as the input of an RNN to generate caption in the

second phase. Furthermore, there is the presence of attention

mechanism in the visual system of human beings [15]. We cannot

focus on the whole horizon simultaneously when observing

something. Our attention transfers from one place to another

dynamically and sequentially. Usually, large objects in bright color

can be noticed preferentially followed by smaller ones in turn. So

in captioning task, we build a scene-graph based representation to

mimic the attention mechanism of human beings. Extensive

experiments demonstrated that our model yields competitive results

compared to the state-of-the-arts in image captioning task. It

achieves 26.2% BLEU-4, 22.4% METEOR, 76% CIDEr on the

MSCOCO dataset. Compared to other methods, we have the

following contributions:

(1) We utilize a graph-based semantic representation, scene graph,

as the bridge of images and corresponding natural descriptions.

(2) We implement co-occurrence dependency of semantic concepts

with the help of the recurrent neurons.

(3) Compared with other methods of generation of attributes

features, our method is more simple and does not need too

complicated pre-processing for images.

The paper is organized as follows. Section 2 describes details of our

model. The following section presents the results and analysis of

our experiments on MSCOCO dataset. Finally, section 4 concludes

this paper.

2. OUR MODEL

Our approach is summarized in Figure 1. We utilize a graph-based

semantic representation, scene graph, as the bridge of images and

corresponding captions. To embed scene graph as an intermediate

state, we divide the task of image captioning into two phases, called

concept cognition and sentence construction. In the first phase, we

build a vocabulary of semantic concepts and propose a CNN-RNN-

SVM framework to generate the scene-graph based sequence,

which is then transformed into one bit vector V. In the second phase,

SAMPLE: Permission to make digital or hard copies of all or part of this

work for personal or classroom use is granted without fee provided that

copies are not made or distributed for profit or commercial advantage and

that copies bear this notice and the full citation on the first page. To copy

otherwise, or republish, to post on servers or to redistribute to lists,

requires prior specific permission and/or a fee.

Conference’10, Month 1–2, 2010, City, State, Country.

DOI: http://dx.doi.org/10.1145/12345.67890

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余4页未读，立即下载

评论收藏

内容反馈

weixin_38601878

粉丝: 7
资源: 960

使用基于场景图的语义概念对图像进行字幕

场景图字幕：基于结构视觉表示的图像字幕

StructCap：用于图像字幕的结构化语义嵌入

FCN 实现图像语义分割

Sydney，UCM，Rscid等遥感图像字幕数据集生成Resnet特征

基于深度卷积神经网络的前景对象图像分割模型FOSegNet.pdf

图像理解经典综述详细翻译

使用语义分割进行图像前景后景分离处理

VSUA字幕：“用于对齐语言文字和视觉语义单元以进行图像字幕的代码”，ACM MM 2019

基于机器学习的视频语义提取.zip

2015-Chen_Minds_Eye_A_2015_CVPR_paper

中文字幕专业ocr工具

VSR指导的CIC：具有动词特定语义作用的类人可控图像字幕

语义HTML练习：基于Wikipedia的随机图像进行语义HTML练习，这是CSETHTML练习的一部分

基于特征的方法，使用维基百科进行概念的语义相似性评估

使用CNN进行语义图像分割的最新进展调查

基于matlab的文字识别算法-课程设计.doc

基于MATLAB的图片中文字的提取及识别（精）.pdf

最新资源