【免费】[2016].DenseCap.（李飞飞）1_我看见的世界资源-CSDN文库

需积分: 0 48 浏览量 2022-08-03 19:08:23 上传评论收藏 7.9MB PDF 举报

资源详情

资源评论

资源推荐

DenseCap: Fully Convolutional Localization Networks for Dense Captioning

Justin Johnson

∗

Andrej Karpathy

∗

Li Fei-Fei

Department of Computer Science, Stanford University

{jcjohns,karpathy,feifeili}@cs.stanford.edu

Abstract

We introduce the dense captioning task, which requires a

computer vision system to both localize and describe salient

regions in images in natural language. The dense caption-

ing task generalizes object detection when the descriptions

consist of a single word, and Image Captioning when one

predicted region covers the full image. To address the local-

ization and description task jointly we propose a Fully Con-

volutional Localization Network (FCLN) architecture that

processes an image with a single, efﬁcient forward pass, re-

quires no external regions proposals, and can be trained

end-to-end with a single round of optimization. The archi-

tecture is composed of a Convolutional Network, a novel

dense localization layer, and Recurrent Neural Network

language model that generates the label sequences. We

evaluate our network on the Visual Genome dataset, which

comprises 94,000 images and 4,100,000 region-grounded

captions. We observe both speed and accuracy improve-

ments over baselines based on current state of the art ap-

proaches in both generation and retrieval settings.

1. Introduction

Our ability to effortlessly point out and describe all aspects

of an image relies on a strong semantic understanding of a

visual scene and all of its elements. However, despite nu-

merous potential applications, this ability remains a chal-

lenge for our state of the art visual recognition systems.

In the last few years there has been signiﬁcant progress

in image classiﬁcation [39, 26, 53, 45], where the task is

to assign one label to an image. Further work has pushed

these advances along two orthogonal directions: First, rapid

progress in object detection [40, 14, 46] has identiﬁed mod-

els that efﬁciently identify and label multiple salient regions

of an image. Second, recent advances in image captioning

[3, 32, 21, 49, 51, 8, 4] have expanded the complexity of

the label space from a ﬁxed set of categories to sequence of

words able to express signiﬁcantly richer concepts.

However, despite encouraging progress along the label

density and label complexity axes, these two directions have

∗

Both authors contributed equally to this work.

Classification

Cat

Captioning

A cat

riding a

skateboard

Detection

Cat

Skateboard

Dense Captioning

Orange spotted cat

Skateboard with

red wheels

Cat riding a

skateboard

Brown hardwood

flooring

label density

Whole Image Image Regions

label

complexity

Single

Label

Sequence

Figure 1. We address the Dense Captioning task (bottom right)

with a model that jointly generates both dense and rich annotations

in a single forward pass.

remained separate. In this work we take a step towards uni-

fying these two inter-connected tasks into one joint frame-

work. First, we introduce the dense captioning task (see

Figure 1), which requires a model to predict a set of descrip-

tions across regions of an image. Object detection is hence

recovered as a special case when the target labels consist

of one word, and image captioning is recovered when all

images consist of one region that spans the full image.

Additionally, we develop a Fully Convolutional Local-

ization Network (FCLN) for the dense captioning task.

Our model is inspired by recent work in image captioning

[49, 21, 32, 8, 4] in that it is composed of a Convolutional

Neural Network and a Recurrent Neural Network language

model. However, drawing on work in object detection [38],

our second core contribution is to introduce a new dense lo-

calization layer. This layer is fully differentiable and can

be inserted into any neural network that processes images

to enable region-level training and predictions. Internally,

the localization layer predicts a set of regions of interest in

the image and then uses bilinear interpolation [19, 16] to

smoothly crop the activations in each region.

We evaluate the model on the large-scale Visual Genome

dataset, which contains 94,000 images and 4,100,000 region

captions. Our results show both performance and speed im-

provements over approaches based on previous state of the

art. We make our code and data publicly available to sup-

port further progress on the dense captioning task.

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余9页未读，立即下载

评论收藏

内容反馈

郑瑜伊

粉丝: 19
资源: 318

[2016].DenseCap.（李飞飞）1

评论0

最新资源

[2016].DenseCap.（李飞飞）1

评论0

【R213】The Worlds I See 我看见的世界【Fei-Fei_Li 李飞飞】.pdf

李飞飞计算机视觉课件 cs231_2016.zip

李飞飞 深度学习 课件

CS231n斯坦福大学李飞飞视觉识别课程作业代码

斯坦福CS231n李飞飞计算机视觉课程 全部讲义

李飞飞斯坦福计算机视觉深度学习CS231N PPT

李飞飞深度学习课程笔记.txt

李飞飞深度学习课件及作业PDF

深度学习李飞飞

cs231人工智能课程李飞飞

李飞飞深度学习中文笔记完整版

吴恩达李飞飞Hinton课程分享

李飞飞——计算机视觉——斯坦福CS231.rar

cs231n李飞飞课程作业一题目和答案

李飞飞2017秋计算机视觉讲义

李飞飞深度学习笔记作业

CS231n 李飞飞 计算机视觉(笔记+作业)

李飞飞计算机视觉课件 cs231_2017.zip

斯坦福大学李飞飞计算机视觉CS231PPT课件

BurpLoaderKeygen.jar.zip

最新版ISO/IEC 27001:2022、ISO 27002:2022中英文合集

Goby红队版-win-x64-2.4.7版本

Chrome Header Editor 插件

ISO SAE 21434-2021 中文版.pdf

国赛ciscn2024-WP-re2-androidso-re(unidbg模拟执行Native层方法)

国赛ciscn2024-WP-re6-gdb-debug(伪随机数保护)

OpenVAS GVM 中文翻译补丁

安全认证cisp教材全套

最新资源

李飞飞深度学习课件

斯坦福CS231n李飞飞计算机视觉课程全部讲义

CS231n 李飞飞计算机视觉(笔记+作业)