没有合适的资源?快使用搜索试试~ 我知道了~
End-To-End_People_Detection_CVPR_2016_paper
需积分: 10 6 下载量 158 浏览量
2016-10-25
08:59:11
上传
评论
收藏 1.09MB PDF 举报
温馨提示
试读
9页
End-To-End People_Detection_CVPR_2016_paper
资源推荐
资源详情
资源评论
End-to-end people detection in crowded scenes
Russell Stewart
1
, Mykhaylo Andriluka
1,2
, and Andrew Y. Ng
1
1
Stanford University, USA
2
Max Planck Institute for Informatics, Germany
Abstract
Current people detectors operate either by scanning an
image in a sliding window fashion or by classifying a dis-
crete set of proposals. We propose a model that is based
on decoding an image into a set of people detections. Our
system takes an image as input and directly outputs a set of
distinct detection hypotheses. Because we generate predic-
tions jointly, common post-processing steps such as non-
maximum suppression are unnecessary. We use a recur-
rent LSTM layer for sequence generation and train our
model end-to-end with a new loss function that operates
on sets of detections. We demonstrate the effectiveness of
our approach on the challenging task of detecting people in
crowded scenes
1
.
1. Introduction
In this paper we propose a new architecture for detecting
objects in images. We strive for an end-to-end approach that
accepts images as input and directly generates a set of object
bounding boxes as output. This task is challenging because
it demands both distinguishing objects from the background
and correctly estimating the number of distinct objects and
their locations. Such an end-to-end approach capable of di-
rectly outputting predictions would be advantageous over
methods that first generate a set of bounding boxes, evalu-
ate them with a classifier, and then perform some form of
merging or non-maximum suppression on an overcomplete
set of detections.
Sequentially generating a set of detections has an im-
portant advantage in that multiple detections on the same
object can be avoided by remembering the previously gen-
erated output. To control this generation process, we use
a recurrent neural network with LSTM units. To produce
intermediate representations, we use expressive image fea-
1
The implementation is publicly available at
https://github.
com/Russell91/ReInspect
.
tures from GoogLeNet that are further fine-tuned as part of
our system. Our architecture can thus be seen as a “decod-
ing” process that converts an intermediate representation of
an image into a set of predicted objects. The LSTM can be
seen as a “controller” that propagates information between
decoding steps and controls the location of the next out-
put (see Fig.
2 for an overview). Importantly, our trainable
end-to-end system allows joint tuning of all components via
back-propagation.
One of the key limitations of merging and non-maximum
suppression utilized in [
6, 17] is that these methods typ-
ically don’t have access to image information, and in-
stead must perform inference solely based on properties of
bounding boxes (e.g. distance and overlap). This usually
works for isolated objects, but often fails when object in-
stances overlap. In the case of overlapping instances, im-
age information is necessary to decide where to place boxes
and how many of them to output. As a workaround, several
approaches proposed specialized solutions that specifically
address pre-defined constellations of objects (e.g. pairs of
pedestrians) [
5, 23]. Here, we propose a generic architec-
ture that does not require a specialized definition of object
constellations, is not limited to pairs of objects, and is fully
trainable.
We specifically focus on the task of people detection as
an important example of this problem. In crowded scenes
such as the one shown in Fig.
1, multiple people often oc-
cur in close proximity, making it particularly challenging to
distinguish between nearby individuals.
The key contribution of this paper is a trainable, end-to-
end approach that jointly predicts the objects in an image.
This lies in contrast to existing methods that treat predic-
tion or classification of each bonding box as an indepen-
dent problem and require post-processing on the set of de-
tections. We demonstrate that our approach is superior to
existing architectures on a challenging dataset of crowded
scenes with large numbers of people. A technical contribu-
tion of this paper is a novel loss function for sets of objects
that combines elements of localization and detection. An-
1
2325
资源评论
ture_dream
- 粉丝: 277
- 资源: 63
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功