264 International Journal of Computer Vision (2020) 128:261–318
Table 1 continued
No. Survey title References Year Venue Content
21 Tutorial: deep learning for objects
and scenes
− 2017 CVPR17 A high level summary of recent
work on deep learning for visual
recognition of objects and scenes
22 Tutorial: instance level recognition − 2017 ICCV17 A short course of recent advances
on instance level recognition,
including object detection,
instance segmentation and
human pose prediction
23 Tutorial: visual recognition and
beyond
− 2018 CVPR18 A tutorial on methods and
principles behind image
classification, object detection,
instance segmentation, and
semantic segmentation
24 Deep learning for generic object
detection
Ours 2019 VISI A comprehensive survey of deep
learning for generic object
detection
any comprehensive recent survey. A thorough review and
summary of existing work is essential for further progress in
object detection, particularly for researchers wishing to enter
the field. Since our focus is on generic object detection, the
extensive work on DCNNs for specific object detection, such
as face detection (Li et al. 2015a; Zhang et al. 2016a;Huetal.
2017), pedestrian detection (Zhang et al. 2016b; Hosang et al.
2015), vehicle detection (Zhou et al. 2016b) and traffic sign
detection (Zhu et al. 2016b) will not be considered.
1.2 Scope
The number of papers on generic object detection based on
deep learning is breathtaking. There are so many, in fact, that
compiling any comprehensive review of the state of the art is
beyond the scope of any reasonable length paper. As a result,
it is necessary to establish selection criteria, in such a way
that we have limited our focus to top journal and conference
papers. Due to these limitations, we sincerely apologize to
those authors whose works are not included in this paper. For
surveys of work on related topics, readers are referred to the
articles i n Table 1. This survey focuses on major progress of
the last 5 years, and we restrict our attention to still pictures,
leaving the important subject of video object detection as a
topic for separate consideration in the future.
The main goal of this paper is to offer a comprehensive
survey of deep learning based generic object detection tech-
niques, and to present some degree of taxonomy, a high
level perspective and organization, primarily on the basis
of popular datasets, evaluation metrics, context modeling,
and detection proposal methods. The intention is that our
categorization be helpful for readers to have an accessi-
ble understanding of similarities and differences between
a wide variety of strategies. The proposed taxonomy gives
researchers a framework to understand current research and
to identify open challenges for future research.
The remainder of this paper is organized as follows.
Related background and the progress made during the last
2 decades are summarized in Sect. 2. A brief introduction
to deep learning is given in Sect. 3. Popular datasets and
evaluation criteria are summarized in Sect. 4. We describe
the milestone object detection frameworks in Sect. 5.From
Sects. 6 to 9, fundamental sub-problems and the relevant
issues involved in designing object detectors are discussed.
Finally, in Sect. 10, we conclude the paper with an overall
discussion of object detection, state-of-the- art performance,
and future research directions.
2 Generic Object Detection
2.1 The Problem
Generic object detection, also called generic object category
detection, object class detection, or object category detec-
tion (Zhang et al. 2013), is defined as follows. Given an
image, determine whether or not there are instances of objects
from predefined categories (usually many categories, e.g.,
200 categories in the ILSVRC object detection challenge)
and, if present, to return the spatial location and extent of
each instance. A greater emphasis is placed on detecting
a broad range of natural categories, as opposed to specific
object category detection where only a narrower predefined
category of interest (e.g., faces, pedestrians, or cars) may
be present. Although thousands of objects occupy the visual
world in which we live, currently the research community is
primarily interested in the localization of highly structured
objects (e.g., cars, faces, bicycles and airplanes) and artic-
123