2018视觉三大会议目标检测论文及工程_机器视觉会议资源-CSDN文库

共8个文件

pdf：5个

zip：3个

需积分: 10 139 浏览量 2018-10-17 15:10:28 上传评论 1 收藏 13.37MB ZIP 举报

资源推荐

资源详情

资源评论

收起资源包目录

OBJ Dect 2018.zip （8个子文件）

OBJ Dect 2018

STDN.pdf 1.6MB

SIN

SIN-master.zip 1.16MB

SIN.pdf 2.25MB

REFINE NET

REFINE NET.pdf 566KB

RefineDet-master.zip 6.07MB

YOLOV3

YOLOv3.pdf 2.14MB

tensorflow-yolo-v3-master.zip 14KB

RFBNET.pdf 1.18MB

Structure Inference Net: Object Detection Using Scene-Level Context and

Instance-Level Relationships

Yong Liu

1,2

, Ruiping Wang

1,2,3

, Shiguang Shan

1,2,3

, Xilin Chen

1,2,3

Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS),

Institute of Computing Technology, CAS, Beijing, 100190, China

University of Chinese Academy of Sciences, Beijing, 100049, China

Cooperative Medianet Innovation Center, China

yong.liu@vipl.ict.ac.cn, {wangruiping, sgshan, xlchen}@ict.ac.cn

Abstract

Context is important for accurate visual recognition. In

this work we propose an object detection algorithm that not

only considers object visual appearance, but also makes use

of two kinds of context including scene contextual informa-

tion and object relationships within a single image. There-

fore, object detection is regarded as both a cognition prob-

lem and a reasoning problem when leveraging these struc-

tured information. Speciﬁcally, this paper formulates object

detection as a problem of graph structure inference, where

given an image the objects are treated as nodes in a graph

and relationships between the objects are modeled as edges

in such graph. To this end, we present a so-called Struc-

ture Inference Network (SIN), a detector that incorporates

into a typical detection framework (e.g. Faster R-CNN) with

a graphical model which aims to infer object state. Com-

prehensive experiments on PASCAL VOC and MS COCO

datasets indicate that scene context and object relationships

truly improve the performance of object detection with more

desirable and reasonable outputs.

1. Introduction

Object detection is one of the fundamental computer vi-

sion problems. Recently, this topic has enjoyed a series of

breakthroughs thanks to the advances of deep learning, and

it is observed that prevalent object detectors predominantly

regard detection as a problem of classifying candidate boxes

[

16, 15, 33, 24, 7]. While most of them have achieved im-

pressive performance in a number of detection benchmarks,

they only focus on local information near an object’s region

of interest within the image. Usually an image contains rich

contextual information including scene context and object

relationships [

10]. Ignoring these information inevitably

places constraints on the accuracy of objects detected [

3].

To illustrate such constraints, considering the practical

(a) (b)

Figure 1. Some Typical Detection Errors of Faster R-CNN. (a)

Some boats are mislabeled as cars on PASCAL VOC [

12]. (b) The

mouse is undetected on MS COCO [26].

examples in Fig.

1, detected by Faster R-CNN [33]. In

the ﬁrst case where is a river ﬁeld, some of the boats are

mislabeled as cars, since the detector only concentrates on

object’s visual appearance. If the scene information in this

image was taken into account, such banana skin could have

been easily avoided. In the second case, though a laptop and

person have been detected as expected, no further object is

found any more. It is quite common that mouse and laptop

usually co-occur within a single image. If using object rela-

tive position and co-occurrence pattern, more objects within

the given image could be detected.

Many empirical studies [

10, 14, 19, 41, 30, 29, 36] have

suggested that recognition algorithms can be improved by

proper modeling of context. To handle the problem above,

two types of contextual information model have been ex-

plored for detection [

4]. The ﬁrst type incorporates con-

text around object or scene-level context [

3, 43, 37], and the

second models object-object relationships at instance-level

[

18, 4, 30]. While these two types of models capture com-

plementary contextual information, they can be combined

together to jointly help detection.

We are thus motivated to intuitively conjecture that vi-

sual concepts in most of natural images form an organism

with the key components of scene, objects and relation-

ships, and different objects in the scene are organized in a

6985







Graph

G = (V, E, s)

Figure 2. Graph Problem. Detection basically aims to answer:

what is where. From a structure perspective, it can be formulated

as a reasoning problem of a graph involving the mutually comple-

mentary information of scene, objects and relationships.

structured manner, e.g. boats are on the river, mouse is near

laptop. Sequentially object d etection is regarded as not only

a cognition problem, but also an inference problem which

is based on contextual information with object ﬁne-grained

details. To systematically solve it, a tailored graph is for-

mulated for each individual image. As described in Fig.

objects are nodes of the graph, and object relationships are

edges of the graph. These objects interact with each other

via the graph under the guidance of scene context. More

speciﬁcally, an object will receive messages from the scene

and other objects that are highly correlated with it. In such a

way, object state is not only determined by its ﬁne-grained

appearance details but also effected by scene context and

object relationship. Eventually the state of each object is

used to determine its category and reﬁne its location.

To make the above conjecture computationally feasible,

we propose a structure inference network (SIN) to reason

object state in a graph, where memory cell is the key mod-

ule to encode different kinds of messages (e.g. from scene

and other objects) into object state, and a novel way of us-

ing Gated Recurrent Units (GRUs) [

5] as the memory cell

is presented in this work. Speciﬁcally, we ﬁx object rep-

resentation as the initial state of GRU and then input each

kind of message to achieve the goal of updating object state.

Since SIN can accomplish inference as long as the inputs

to it covers the representations of object, scene-level con-

text and instance-level relationship, our structure inference

method is not constrained to speciﬁc detection framework.

2. Related Work

Object detection. Modern CNN based object detection

methods can be divided into two groups [

25, 35]: (i) re-

gion proposals based methods (two-stage detectors) and (ii)

proposal-free methods (one-stage detectors).

With the resurgence of deep learning, two-stage detec-

tors quickly come to dominate object detection during the

past few years. Representative methods include R-CNN

[

16], Fast R-CNN [15], Faster R-CNN [33] and so on.

The ﬁrst stage produces numbers of candidate boxes, and

then the second stage classiﬁes these boxes into foreground

classes or background. R-CNN [

16] extracts CNN features

from the candidate regions and applies linear SVMs as the

classiﬁer. To obtain higher speed, Fast R-CNN [

15] pro-

poses a novel ROI-pooling operation to extract feature vec-

tors for each candidate box from shared convolutional fea-

ture map. Faster R-CNN [

33] integrates proposal genera-

tion with the second-stage classiﬁer into a single convolu-

tion network. More recently, one-stage detectors like SSD

[

27] and YOLO [31] have been proposed for real-time de-

tection with satisfactory accuracy. Anyway, detecting dif-

ferent objects in an image is always considered as some iso-

lated tasks among these state-of-the-art methods especially

in two-stage detectors. While such methods work well for

salient objects most of the time, they are hard to handle

small objects by using vague feature associated only with

the object itself.

Contextual information. Consequently, it is natural to

use richer contextual information. In early years, a number

of approaches have explored contextual information to im-

prove object detection [

29, 19, 1, 10, 40, 6, 41]. For exam-

ple, Mottaghi et al. [

29] propose a deformable part-based

model, which exploits both local context around each can-

didate detection and global context at the level of the scene.

The presence of objects in irrelevant scenes is penalized in

[

41]. Recently, some works [3, 43, 37] based on deep Con-

vNet have made some attempts to incorporate contextual in-

formation to object detection. Contextual information out-

side the region of interest is integrated using spatial recur-

rent neural network in ION [

3]. GBD-Net [43] proposes

a novel gated bi-directional CNN to pass message between

features of different support regions around objects. Shri-

vastava et al. [

37] use segmentation to provide top-down

context to guide region proposal generation and object de-

tection. While context around object or scene-level context

has been addressed in such works [

3, 43, 37] under the deep

learning-based pipeline, they make less progress in explor-

ing object-object relationships. On the contrary, a much re-

cent work [

4] proposes a new sequential reasoning architec-

ture that mainly exploits object-object relationships to se-

quentially detect objects in an image, however, with only

implicit yet weak consideration of scene-level context. Dif-

ferent from these existing works, our proposed structure in-

ference network has the capability of jointly modeling both

scene-level context and object-object relationships and in-

ferring different object instances within an image from a

structural and global perspective.

Structure inference. Several interesting works [

28, 34,

23, 39, 21, 9, 2, 22, 42] have been proposed to combine

deep networks with graphical models for structured predic-

tion tasks that are solved by structure inference techniques.

A generic structured model is designed to leverage diverse

label relations including scene, object and attributes to im-

prove image classiﬁcation performance in [

21]. Deng et al.

6986

RPN

RoI

Projection

RoI

Pooling

RoI

Pooling

scene

nodes

edges

edge

GRU

Scene

GRU

whole Image

…

cls

bbox

Concatenate

Structure Inference

Softmax

Regression

edge

GRU

Scene

GRU

Figure 3. SIN: The Framework of Our Method. Firstly we get a ﬁxed number of ROIs from an input image. Each ROI is pooled into a

ﬁxed-size feature map and then mapped to a feature vector by a fully connected layer as node. We extract the whole image feature as scene

in the same way, and then we concatenate the descriptors of every two ROIs into edges. To iteratively update the node state, an elaborately

designed structure inference method is triggered, and the ﬁnal state of each node is used to predict the category and reﬁne the location of

the corresponding ROI. The whole framework is trained end-to-end with the original multi-task loss (this study exploits Faster R-CNN as

the base detection framework).

[

9] propose structure inference machines for analyzing re-

lations in group activity recognition. Structural-RNN [

22]

combines the power of high-level spatio-temporal graphs

and sequence learning, and evaluates the model ranging

from motion to object interactions. In [

42], a graph in-

ference model is proposed to tackle the task of generating

structured scene graph from an image. While our work

shares similar spirit as [

42] to formulate the object detec-

tion task as a graph structure inference problem, the two

works have essential differences in their technical sides,

such as the graph instantiation manners, inference mecha-

nisms, message passing schemes, etc, which highly depend

on the speciﬁc task domains.

3. Method

Our goal is to improve the detection models by ex-

ploring rich contextual information. To this end, different

from existing methods that only make use of visual appear-

ance clues, our model is designed to explicitly take object-

object relationships and scene information into considera-

tion. Speciﬁcally, a structure inference network is devised

to iteratively propagate information among different objects

as well as the whole scene. The whole framework of our

method is depicted in Fig.

3, which will be detailed in the

following sections.

3.1. Graphical Modeling

Our structure inference network (SIN) is agnostic to the

choice of base object detection framework. In this work

we build SIN based on Faster R-CNN as a demonstration,

which is an advanced method for detection. We present a

graph G = (V, E, s) to model the graphical problem as

shown in Fig.

2. The nodes v ∈ V represent the region

proposals, while s is the scene of the image, and e ∈ E is

GRU





















󰇛󰇜





󰇛󰇜





󰇛󰇜

Figure 4. An illustration of GRU. The update gate z selects

whether the hidden state h

t+1

is to be updated with a new hid-

den state

h. The reset gate r decides whether the previous hidden

state h

is ignored.

the edge (relationship) between each pair of object nodes.

Speciﬁcally, after Region Proposal Network (RPN [33]),

thousands of region proposals that might contain objects are

obtained. We then use Non-Maximum Suppression (NMS

[

13]) to choose a ﬁxed number of ROIs (Region of Interest).

For each ROI v

, we extract the visual feature f

by an FC

layer that follows an ROI pooling layer. For scene s about

the image, since there is no ground-truth scene label for the

image, the whole image visual feature f

is extracted as

the scene representation through the same layers’ operation

as nodes. For directed edge e

j→i

from node v

to v

, we

use both the spatial feature and visual feature of v

, v

compute a scalar, which represents the inﬂuence of v

on v

as will be detailed in Sec.

3.3. With such modeling, how

to drive them to interact in the graph? It will be delineated

in the following.

3.2. Message Passing

For each node, the key of interaction is to encode the

messages passed from the scene and other nodes to it. Due

to that each node needs receiving multiple incoming mes-

sages, it is necessary to design an aggregation function that

6987

评论收藏

内容反馈

北极吃火锅

粉丝: 10
资源: 10

2018视觉三大会议目标检测论文及工程

CVPR2018目标检测论文

ECCV2018目标检测论文

CV object detection目标检测近几年论文和模型

2017年三大视觉会议目标检测论文及code

基于深度学习的目标检测研究进展

最近两届计算机视觉会议ICCV的部分论文

计算机视觉目标检测相关论文集合

论文研究-一种基于视觉注意的小目标检测方法.pdf

17种视觉目标跟踪算法代码、论文及原理解读打包下载.rar

目标分类目标检测论文

智能视觉监控中运动目标检测的阴影抑制论文

基于视觉的目标检测与跟踪综述.pdf

论文研究-基于码本建模的视觉运动目标检测算法综述 .pdf

研究论文-基于视觉显著性的空中目标检测算法.pdf

2016年欧洲计算机视觉会议部分论文

人工智能论文：基于深度学习的目标检测技术综述.pdf

基于计算机视觉的人体运动目标检测

一种基于改进YOLOv3的单目视觉道路目标检测及距离估计方法.docx

计算机视觉中运动目标检测算法探究.pdf

运动目标跟踪论文十篇

《融合视觉显著性和局部熵的红外弱小目标检测》论文复现代码

YOLOv8-deepsort 实现智能车辆目标检测+车辆跟踪+车辆计数

YOLOv8网络结构图，自制visio文件，yolov8.vsds，需要的自取，在原有的基础上直接改就行了

yolov8(2023年8月版本),已经下好yolov8s.pt和yolov8n.pt

Transformer模型实现长期预测并可视化结果（附代码+数据集+原理介绍）

社交平台上经济类话题的文章热度信息，数据是真实的，但不是真实日期

Unet眼底血管图像分割数据集+代码+模型+系统界面+教学视频.zip

行人跌倒数据集（VOC格式）

YOLOV5 + 双目相机实现三维测距（新版本）

最新资源