YouOnlyLookTwiceRapidMulti-ScaleObjectDetectionInSatelliteImagery资源-CSDN文库

共2个文件

pdf：1个

zip：1个

卫星图像

目标识别

需积分: 36 54 浏览量 2018-09-28 17:17:16 上传评论收藏 9.3MB RAR 举报

资源推荐

资源详情

资源评论

收起资源包目录

You Only Look Twice Rapid MultiScale Object Detection In文章及源码.rar （2个子文件）

You Only Look Twice Rapid MultiScale Object Detection In.pdf 8.1MB

yolt-master.zip 1.77MB

You Only Look Twice: Rapid Multi-Scale Object Detection In

Satellite Imager y

Adam Van Een

CosmiQ Works, In-Q-Tel

avaneen@iqt.org

ABSTRACT

Detection of small objects in large swaths of imagery is one of

the primary problems in satellite imagery analytics. While object

detection in ground-based imagery has beneted from research

into new deep learning approaches, transitioning such technology

to overhead imagery is nontrivial. Among the challenges is the

sheer number of pixels and geographic extent per image: a single

DigitalGlobe satellite image encompasses

64 km

and over 250

million pixels. Another challenge is that objects of interest are

minuscule (oen only

∼

10 pixels in extent), which complicates

traditional computer vision techniques. To address these issues, we

propose a pipeline (You Only Look Twice, or YOLT) that evaluates

satellite images of arbitrary size at a rate of

≥

s. e

proposed approach can rapidly detect objects of vastly dierent

scales with relatively lile training data over multiple sensors. We

evaluate large test images at native resolution, and yield scores of

8 for vehicle localization. We further explore resolution and

object size requirements by systematically testing the pipeline at

decreasing resolution, and conclude that objects only

∼

5 pixels in

size can still be localized with high condence. Code is available at

hps://github.com/CosmiQ/yolt

KEYWORDS

Computer Vision, Satellite Imagery, Object Detection

1 INTRODUCTION

Computer vision techniques have made great strides in the past few

years since the introduction of convolutional neural networks [

]

in the ImageNet [

] competition. e availability of large, high-

quality labelled datasets such as ImageNet [

], PASCAL VOC [

]

and MS COCO [

] have helped spur a number of impressive ad-

vances in rapid object detection that run in near real-time; three of

the best are: Faster R-CNN [

], SSD [

], and YOLO [

] [

]. Faster

R-CNN typically ingests 1000

600 pixel images, whereas SSD uses

300

300 or 512

512 pixel input images, and YOLO runs on either

416

416 or 544

544 pixel inputs. While the performance of all

these frameworks is impressive, none can come remotely close to in-

gesting the

∼

000

000 input sizes typical of satellite imagery.

Of these three frameworks, YOLO has demonstrated the greatest

inference speed and highest score on the PASCAL VOC dataset. e

authors also showed that this framework is highly transferrable

to new domains by demonstrating superior performance to other

frameworks (i.e., SSD and Faster R-CNN) on the Picasso Dataset

[

] and the People-Art Dataset [

]. Due to the speed, accuracy,

and exibility of YOLO, we accordingly leverage this system as the

inspiration for our satellite imagery object detection framework.

e application of deep learning methods to traditional object

detection pipelines is non-trivial for a variety of reasons. e unique

aspects of satellite imagery necessitate algorithmic contributions to

address challenges related to the spatial extent of foreground target

objects, complete rotation invariance, and a large scale search space.

Excluding implementation details, algorithms must adjust for:

Small spatial extent

In satellite imagery objects of interest

are oen very small and densely clustered, rather than

the large and prominent subjects typical in ImageNet data.

In the satellite domain, resolution is typically dened as

the ground sample distance (GSD), which describes the

physical size of one image pixel. Commercially available

imagery varies from 30 cm GSD for the sharpest Digital-

Globe imagery, to 3

−

4 meter GSD for Planet imagery. is

means that for small objects such as cars each object will

be only

∼

15 pixels in extent even at the highest resolution.

Complete rotation invariance

Objects viewed from over-

head can have any orientation (e.g. ships can have any

heading between 0 and 360 degrees, whereas trees in Ima-

geNet data are reliably vertical).

Training example frequency

ere is a relative dearth of

training data (though eorts such as SpaceNet

are aempt-

ing to ameliorate this issue)

Ultra high resolution

Input images are enormous (oen

hundreds of megapixels), so simply downsampling to the

input size required by most algorithms (a few hundred

pixels) is not an option (see Figure 1).

e contribution in this work specically addresses each of these

issues separately, while leveraging the relatively constant distance

from sensor to object, which is well known and is typically

∼

400

km. is coupled with the nadir facing sensor results in consistent

pixel size of objects.

Section 2 details in further depth the challenges faced by standard

algorithms when applied to satellite imagery. e remainder of

this work is broken up to describe the proposed contributions as

follows. To address small, dense clusters, Section 3.1 describes

a new, ner-grained network architecture. Sections 3.2 and 3.3

detail our method for spliing, evaluating, and recombining large

test images of arbitrary size at native resolution. With regard to

rotation invariance and small labelled training dataset sizes, Section

4 describes data augmentation and size requirements. Finally, the

performance of the algorithm is discussed in detail in Section 6.

hps://aws.amazon.com/public-datasets/spacenet/

arXiv:1805.09512v1 [cs.CV] 24 May 2018

Figure 1: DigitalGlobe

km (∼

000

pixels)

image at 50 cm GSD near the Panama Canal. One

416

pixel sliding window cutout is shown in red. For an image

this size, there are ∼ 1500 unique cutouts.

2 RELATED WORK

Deep learning approaches have proven eective for ground-based

object detection, though current techniques are oen still subopti-

mal for overhead imagery applications. For example, small objects

in groups, such as ocks of birds present a challenge [

], caused

in part by the multiple downsampling layers of all three convolu-

tional network approaches listed above (YOLO, SDD, Faster-RCNN).

Further, these multiple downsampling layers result in relatively

course features for object dierentiation; this poses a problem if

objects of interest are only a few pixels in extent. For example,

consider the default YOLO network architecture, which downsam-

ples by a factor of 32 and returns a 13

13 prediction grid; this

means that object dierentiation is problematic if object centroids

are separated by less than 32 pixels. Accordingly we implement

a unique network architecture with a denser nal prediction grid.

is improves performance by yielding ner grained features to

help dierentiate between classes. is ner prediction grid also

permits classication of smaller objects and denser clusters.

Another reason object detection algorithms struggle with satel-

lite imagery is that they have diculty generalizing objects in new

or unusual aspect ratios or congurations [

]. Since objects can

have arbitrary heading, this limited range of invariance to rotation

is troublesome. Our approach remedies this complication with ro-

tations and augmentation of data. Specically, we rotate training

images about the unit circle to ensure that the classier is agnos-

tic to object heading, and also randomly scale the images in HSV

(hue-saturation-value) to increase the robustness of the classier to

varying sensors, atmospheric conditions, and lighting conditions.

In advanced object detection techniques the network sees the

entire image at train and test time. While this greatly improves

background dierentiation since the network encodes contextual

(background) information for each object, the memory footprint

on typical hardware (NVIDIA Titan X GPUs with 12GB RAM) is

infeasible for a 256 megapixel image.

Figure 2: Challenges of the standard object detection net-

work architecture when applied to overhead vehicle dete c-

tion. Each image uses the same standard YOLO architecture

model trained on

416

pixel cutouts of cars from the

COWC dataset. Le: Model applied to a large

4000

pixel test image downsampled to a size of

416

; none of

the 1142 cars in this image are detected. Right: Model ap-

plied to a small

416

pixel cutout; the excessive false

negative rate is due to the high density of cars that cannot

be dierentiated by the 13 × 13 grid.

We also note that the large sizes satellite images preclude simple

approaches to some of the problems noted above. For example,

upsampling the image to ensure that objects of interest are large

and dispersed enough for standard architectures is infeasible, since

this approach would also increase runtime many-fold. Similarly,

running a sliding window classier across the image to search for

objects of interest quickly becomes computationally intractable,

since multiple window sizes will be required for each object size.

For perspective, one must evaluate over one million sliding window

cutouts if the target is a 10 meter boat in a DigitalGlobe image. Our

response is to leverage rapid object detection algorithms to evaluate

satellite imagery with a combination of local image interpolation

on reasonably sized image chips (

∼

200 meters) and a multi-scale

ensemble of detectors.

To demonstrate the challenges of satellite imagery analysis, we

train a YOLO model with the standard network architecture (13

grid) to recognize cars in 416

416 pixel cutouts of the COWC

overhead imagery dataset [

] (see Section 4 for further details on

this dataset). Naively evaluating a large test image (see Figure

2) with this network yields a

∼

100% false positive rate, due to

the 100

downsampling of the test image. Even appropriately

sized image chips are problematic (again, see Figure 2), as the

standard YOLO network architecture cannot dierentiate objects

with centroids separated by less than 32 pixels. erefore even if

one restricts aention to a small cutout, performance is oen poor

in high density regions with the standard architecture.

3 YOU ONLY LOOK TWICE

In order to address the limitations discussed in Section 2, we im-

plement an object detection framework optimized for overhead

imagery: You Only Look Twice (YOLT). We extend the Darknet

neural network framework [

] and update a number of the C li-

braries to enable analysis of geospatial imagery and integrate with

external python libraries. We opt to leverage the exibility and large

Figure 3: Limitations of the YOLO framework (le column,

quotes from [10]), along with YOLT contributions to address

these limitations (right column).

user community of python for pre- and post-processing. Between

the updates to the C code and the pre and post-processing code

wrien in python, interested parties need not have any knowledge

of C to train, test, or deploy YOLT models.

3.1 Network Architecture

To reduce model coarseness and accurately detect dense objects

(such as cars or buildings), we implement a network architecture

that uses 22 layers and downsamples by a factor of 16 us, a

416

416 pixel input image yields a 26

26 prediction grid. Our

architecture is inspired by the 30-layer YOLO network, though this

new architecture is optimized for small, densely packed objects.

e dense grid is unnecessary for diuse objects such as airports,

but crucial for high density scenes such as parking lots (see Fig-

ure 2). To improve the delity of small objects, we also include a

passthrough layer (described in [

], and similar to identity map-

pings in ResNet [

]) that concatenates the nal 52

52 layer onto

the last convolutional layer, allowing the detector access to ner

grained features of this expanded feature map.

Each convolutional layer save the last is batch normalized with

a leaky rectied linear activation, save the nal layer that utilizes a

linear activation. e nal layer provides predictions of bounding

boxes and classes, and has size:

= N

boxes

× (N

classes

)

, where

boxes

is the number of boxes per grid (5 by default), and

classes

is the number of object classes [10].

3.2 Test Procedure

At test time, we partition testing images of arbitrary size into man-

ageable cutouts and run each cutout through our trained model.

Partitioning takes place via a sliding window with user dened

bin sizes and overlap (15% by default), see Figure 4. We record

the position of each sliding window cutout by naming each cutout

according to the schema:

ImageName|row column height width.ext

For example:

panama50cm|1370 1180 416 416.tif

Table 1: YOLT Network Architecture

Layer Type Filters Size/Stride Output Size

0 Convolutional 32 3×3 / 1 416×416×32

1 Maxpool 2×2 / 2 208×208×32

2 Convolutional 64 3×3 / 1 208×208× 64

3 Maxpool 2×2 / 2 104×104× 64

4 Convolutional 128 3×3 / 1 104×104×128

5 Convolutional 64 1×1 / 1 104×104×64

6 Convolutional 128 3×3 / 1 104×104×128

7 Maxpool 2×2 / 2 52×52×64

8 Convolutional 256 3×3 / 1 52× 52×256

9 Convolutional 128 1×1 / 1 52× 52×128

10 Convolutional 256 3×3 / 1 52× 52×256

11 Maxpool 2×2 / 2 26× 26×256

12 Convolutional 512 3×3 / 1 26× 26×512

13 Convolutional 256 1×1 / 1 26× 26×256

14 Convolutional 512 3×3 / 1 26× 26×512

15 Convolutional 256 1×1 / 1 26× 26×256

16 Convolutional 512 3×3 / 1 26× 26×512

17 Convolutional 1024 3×3 / 1 26× 26×1024

18 Convolutional 1024 3×3 / 1 26× 26×1024

19 Passthrough 10 → 20 26× 26×1024

20 Convolutional 1024 3×3 / 1 26×26×1024

21 Convolutional N

1×1 / 1 26×26×N

Figure 4: Graphic of testing procedure for large image sizes,

showing a sliding window going from le to right across Fig-

ure 1. e overlap of the bottom right image is shown in red.

Non-maximal suppression of this overlap is necessary to re-

ne detections at the edge of the cutouts.

3.3 Post-Processing

Much of the utility of satellite (or aerial) imagery lies in its inherent

ability to map large areas of the globe. us, small image chips

评论收藏

内容反馈

huyiqun6

粉丝: 1
资源: 40

You Only Look Twice Rapid Multi-Scale Object Detection In Satell...

最新资源

You Only Look Twice Rapid Multi-Scale Object Detection In Satell...

Python-YouOnlyLookTwice卫星图像快速多尺度目标检测

遥感算法IDL源码

CNN做遥感图像目标识别完整代码

YOLT-大尺寸图像目标检测的解决方案.docx

3DMGAME-Sekiro.Shadows.Die.Twice.v1.02-v1.04.Plus.24.Trainer-FLiNG.rar

FastReport.v4.15 for.Delphi.BCB.Full.Source企业版含ClientServer中文修正版支持D4-XE5

twice-tribute-page:TWICE的致敬页面

twice-api-client:用于两次 API 服务器的客户端，使用 React.js 和 Bootstrap 构建

边界重叠图像的一种快速拼接算法

Twice New Tab-crx插件

FlexGraphics_V_1.79_D4-XE10.2_Downloadly.ir

Sekiro: Shadows Die Twice New Tab-crx插件

Bochs - The cross platform IA-32 (x86) emulator

FairyGUI-Unity-Plugin-3.4.0.zip

flash标签云 3D效果 PHP插件 by weefselkweekje

twice:TWICE-基于Web的交互式环境的工具包

朗文4AChapter2复习提要.doc

plsqldev13.0.5.1908x32主程序+ v12中文包+keygen.rar

Turbo C++ 3.0[DISK]

Turbo C++ 3.00[DISK]

plsqldev13.0.5.1908x64主程序+ v12中文包+keygen.rar

TMS Component Pack 8.0.9.0 Full Source + Full Docs D7-DX10 05 Jan 2016

Google C++ Style Guide(Google C++编程规范）高清PDF

2013年八年级英语上册 Unit 2 How often do you exercise Section A 2a-2c导学案

博客中聚类算法（K-means、FCM、DBSCAN、DPC）的数据集（免积分）

机器学习期末复习题及答案

最新资源