Robust Vehicle Detection in Aerial Images Based on Image Spatial
Pyramid Detection Model
Xianghui Li
1
and Xinde Li
∗2
Abstract— Vehicle detection in high resolution aerial images
obtained by unmanned aerial vehicles (UAV) has a wide
application in traffic surveillance. Recently, many detectors
based on convolutional neural network (CNN) have achieved
great success in object detection. However, it would be difficult
for them to perform efficiently on aerial images because the
significant difference in target size caused by the altitude change
of the UAV platform brings great challenge for these detec-
tors to conduct precise localization. To improve the detection
performance on aerial images, we propose an Image Spatial
Pyramid Detection Model (ISPDM) which mainly consists of
two stages. In the first stage, we divide the image into several
patches and select some of them with an image patch selection
progress. In the second stage, we utilize YOLOv3 to detect
vehicles the original image along with the selected patches
and obtain the final result with an integrated decision-making
algorithm. Finally, the superiority of the proposed algorithm
is well demonstrated by comparison with other solutions for
vehicle detection in high resolution aerial images through
extensive experiments.
I. INTRODUCTION
Vehicle detection in aerial images has become more and
more popular in traffic surveillance [1]–[5] because of the
fast and flexible deployment of unmanned aerial vehicles
(UAV). Traditional vehicle detection algorithms are mainly
based on handcrafted descriptors including Local Binary Pat-
tern (LBP), Harr, Scale-Invariant Feature Transform (SIFT)
and Histogram of Oriented Gradient (HOG) [6]–[10]. The
authors in reference [7] proposed a boosting HOG descriptor
to characterize vehicle shape and appearance and utilized a
linear Support Vector Machine (SVM) to distinguish vehicles
from the background. Moranduzzo and Melagni [10] utilized
different descriptors to conduct vehicle detection in UAV
imagery and observed that the integration of the original
SIFT features with color and morphological features had
the best performance. A combination of multiple features
including HOG, LBP and opponent histograms is proposed
in [9] to detect cars in aerial images.
However, the complex background makes it difficult for
handcrafted features to characterize the objects precisely.
Recently these features have been outperformed by convolu-
tional neural network (CNN) [4]. Regions with CNN features
series like Faster RCNN [11]–[13] obtain higher precision
but are more time-consuming compared with SSD [14] and
1
Xianghui Li is with the Key Laboratory of Measurement and Control of
CSE, Ministry of Education, School of Automation, Southeast University,
Nanjing 210096, China (e-mail: 230149424@seu.edu.cn)
2
Xinde Li is the corresponding author and with the Key Laboratory
of Measurement and Control of CSE, Ministry of Education, School of
Automation, and also with School of Cyber Science and Engineering,
Southeast University, Nanjing 210096, China (e-mail: xindeli@seu.edu.cn).
YOLO series [15]–[17]. Although these methods achieve
state-of-the-art performance in object detection, it would be
difficult for them to achieve the same performance on aerial
images because of the significant difference in target size.
These detectors would struggle with precise localization of
vehicles with different scales.
In this paper, we propose an Image Spatial Pyramid
Detection Model (ISPDM) to detect vehicles in UAV imagery
with high detection accuracy. The framework of ISPDM
which is shown in Figure 1 mainly consists of two stages. In
order to improve the detection of vehicles with multiscales,
we propose an image spatial pyramid in the first stage to
select the image patches which might contain vehicles to
feed the detection model. In the second stage, an integrated
decision-making algorithm is proposed to fuse detections on
different image patches to get final results.
II. CONSTRUCTION OF IMAGE SPATIAL
PYRAMID
To achieve better performance on detecting vehicles with
different scales, we construct an image spatial pyramid with
two layers. The first layer consists of the original image
while in the second layer, the image is divided into n
patches and the second layer can be represented as X
P
=
{x
i
} (i =1, ··· ,n). The detection in the original image
would be able to find the relatively large vehicles while the
detection in the image patches would be able to localize the
relative small vehicles.
However, there is no object in some patches which would
cost extra computation if we conduct vehicle detection in
these patches. Therefore, the SURF (Speeded Up Robust
Feature) [18] is utilized to remove the patches which contain
few vehicles with great probability. As is shown in Figure
2, firstly the original image is divided into n patches while
n = n
x
× n
y
can be calculated by (1):
⎧
⎪
⎨
⎪
⎩
n
x
=
I
width
floor(I
width
÷i
width
)
n
y
=
I
height
floor(I
height
÷i
height
)
(1)
where n
x
and n
y
refer to the number of segmented rows and
columns respectively, I
width
and I
height
refer to the width
and height of the input image, i
width
and i
height
refer to
the width and height of the input layer which belongs to the
object detection neural network, f loor() is utilized to round
down decimal places. Typically, if n ≤ 1, the image would
not be segmented and there would be only one layer.
After the image is divided into n patches, the SURF
feature is extracted in the original image. The amount of
2019 IEEE 4th International Conference on Advance
Robotics and Mechatronics (ICARM)
978-1-7281-0064-7/19/$31.00 ©2019 IEEE
850