Yang_PIXOR_Real-Time_3D_CVPR_2018_paper.pdf资源-CSDN文库

版权申诉

121 浏览量 2021-05-20 22:56:34 上传评论收藏 1010KB PDF 举报

这篇文章介绍了一种名为PIXOR的实时3D物体检测方法，该方法是针对无人驾驶汽车领域中的点云数据进行处理的。在自动驾驶技术的发展过程中，能够实时准确地检测出周围环境中的物体对于确保行车安全至关重要。传统的3D物体检测方法由于点云数据维度高，计算成本昂贵，无法满足实时处理的需求。 PIXOR的提出，旨在通过一种高效的数据利用方式，将场景从鸟瞰图（Bird's Eye View，BEV）视角进行表示，并提出了一种无需提议（proposal-free）的单阶段检测器，能够直接从像素级的神经网络预测中解码出有向的3D物体估计。PIXOR在输入表示、网络架构和模型优化方面进行了特别设计，以便在高精度与实时效率之间取得平衡。文章的作者包括来自Uber先进技术群组（Uber Advanced Technologies Group）以及多伦多大学的研究人员，分别是Bin Yang、Wenjie Luo和Raquel Urtasun。他们将 PIXOR 方法应用于KITTI BEV物体检测基准和一个大规模的3D车辆检测基准数据集，并证明了所提出的检测器在平均精度（Average Precision，AP）方面超过了其他现有技术方法，同时仍然能以每秒10帧（FPS）的速度运行。文章的背景部分指出，近年来，基于卷积神经网络的方法已经被广泛应用于从单个图像中产生精确的2D物体检测，这对于机器人应用，例如自动驾驶汽车来说，是远远不够的。在自动驾驶中，研究者们更关心的是3D空间内物体的检测，这对于运动规划和安全的路线规划至关重要。尽管基于RGB-D传感器（如Microsoft Kinect、Intel RealSense和Apple PrimeSense）的方法结合深度信息与RGB图像已经开发出来，并且显示出相比于单目方法显著的性能提升，但在自动驾驶的背景下，由于安全性的需要，使用像激光雷达（LIDAR）这样的高端传感器更为常见。 PIXOR方法的一个显著特点是使用了像素级的预测，并且该方法是一种单阶段的、无提议的方法，这意味着整个检测流程不需要经过生成候选区域（region proposal）这一步骤，从而加快了检测速度。相较于之前的两阶段检测系统，PIXOR的这种方法极大地提升了处理速度，从而满足了实时处理的要求。同时，PIXOR还利用了像素级网络预测的特性，对模型的性能和实时性进行了优化，其方法在保持高检测精度的同时，还显著提高了处理速度。 PIXOR方法的提出者还对输入表示、网络架构、模型优化等方面进行了特别设计。他们专注于如何在保证高精度检测的同时，提高算法的运行效率。其目标是让算法能够在自动驾驶车辆的实际应用中得到应用，而不是仅仅停留在理论研究层面。 PIXOR验证了其在两个数据集上的性能：KITTI BEV物体检测基准数据集和一个大规模的3D车辆检测基准数据集。在这些数据集上，PIXOR不仅在AP指标上超过了现有的其他先进技术方法，而且在处理速度上达到了每秒10帧，这表明 PIXOR方法在实际应用中具有很大的潜力。 PIXOR方法的应用和研究对于未来的无人驾驶技术发展具有重要意义。通过更为高效的3D物体检测技术，可以进一步提升无人驾驶汽车对周围环境的感知能力，增加行车的安全性。同时，其在实时性方面的突破，也为其在实际无人驾驶系统中的部署提供了可能性，为未来的自动驾驶技术走向商业化和普及化奠定了基础。

资源推荐

资源详情

资源评论

PIXOR: Real-time 3D Object Detection from Point Clouds

Bin Yang, Wenjie Luo, Raquel Urtasun

Uber Advanced Technologies Group

University of Toronto

{byang10, wenjie, urtasun}@uber.com

Abstract

We address the problem of real-time 3D object detec-

tion from point clouds in the context of autonomous driv-

ing. Speed is critical as detection is a necessary compo-

nent for safety. Existing approaches are, however, expensive

in computation due to high dimensionality of point clouds.

We utilize the 3D data more efﬁciently by representing the

scene from the Bird’s Eye View (BEV), and propose PIXOR,

a proposal-free, single-stage detector that outputs oriented

3D object estimates decoded from pixel-wise neural net-

work predictions. The input representation, network archi-

tecture, and model optimization are specially designed to

balance high accuracy and real-time efﬁciency. We validate

PIXOR on two datasets: the KITTI BEV object detection

benchmark, and a large-scale 3D vehicle detection bench-

mark. In both datasets we show that the proposed detector

surpasses other state-of-the-art methods notably in terms of

Average Precision (AP), while still runs at 10 FPS.

1. Introduction

Over the last few years we have seen a plethora of meth-

ods that exploit Convolutional Neural Networks to produce

accurate 2D object detections, typically from a single image

[12, 11, 28, 4, 27, 23]. However, in robotics applications

such as autonomous driving we are interested in detecting

objects in 3D space. This is fundamental for motion plan-

ning in order to plan a safe route.

Recent approaches to 3D object detection exploit differ-

ent data sources. Camera based approaches utilize either

monocular [

1] or stereo images [2]. However, accurate 3D

estimation from 2D images is difﬁcult, particularly in long

ranges. With the popularity of inexpensive RGB-D sen-

sors such as Microsoft Kinect, Intel RealSense and Apple

PrimeSense, several approaches that utilize depth informa-

tion and fuse them with RGB images have been developed

[

32, 33]. They have been shown to achieve signiﬁcant per-

formance gains over monocular methods. In the context

of autonomous driving, high-end sensor like LIDAR (Light

Detection And Ranging) is more common because higher

accuracy is needed for safety. The major difﬁculty in deal-

ing with LIDAR data is that the sensor produces unstruc-

tured data in the form of a point cloud containing typically

around 10

3D points per 360-degree sweep. This poses a

large computational challenge for modern detectors.

Different forms of point cloud representation have been

explored in the context of 3D object detection. The main

idea is to form a structured representation where standard

convolution operation can be applied. Existing representa-

tions are mainly divided into two types: 3D voxel grids and

2D projections. A 3D voxel grid transforms the point cloud

into a regularly spaced 3D grid, where each voxel cell can

contain a scalar value (e.g., occupancy) or vector data (e.g.,

hand-crafted statistics computed from the points within that

voxel cell). 3D convolution is typically applied to extract

high-order representation from the voxel grid [

6]. However,

since point clouds are sparse by nature, the voxel grid is

very sparse and therefore a large proportion of computation

is redundant and unnecessary. As a result, typical systems

[

6, 37, 20] only run at 1-2 FPS.

An alternative is to project the point cloud onto a plane,

which is then discretized into a 2D image based representa-

tion where 2D convolutions are applied. During discretiza-

tion, hand-crafted features (or statistics) are computed as

pixel values of the 2D image [

3]. Commonly used projec-

tions are range view (i.e., 360-degree panoramic view) and

bird’s eye view (i.e., top-down view). These 2D projection

based representations are more compact, but they bring in-

formation loss during projection and discretization. For ex-

ample, range-view projection will have distorted object size

and shape. To alleviate the information loss, MV3D [

3] pro-

poses to fuse the 2D projections with the camera image to

bring additional information. However, the fused model has

nearly linear computation cost with respect to the number of

input modals, making real-time application infeasible.

In this paper, we propose an accurate real-time 3D object

detector, which we call PIXOR (ORiented 3D object de-

tection from PIXel-wise neural network predictions), that

operates on 3D point clouds. PIXOR is a single-stage,

7652

…

Input representation

3D/LIDAR point cloud

Pixar/detector

3D BEV/detections

PIXOR

Figure 1. Overview of the proposed 3D object detector from Bird’s Eye View (BEV) of LIDAR point cloud.

proposal-free dense object detector that exploits the 2D

Bird’s Eye View (BEV) representation in an efﬁcient way.

We choose the BEV representation as it is computationally

friendly compared with 3D voxel grids, and also preserves

the metric space which allows our model to explore priors

about the size and shape of the object categories. Our detec-

tor outputs accurate oriented bounding boxes in real-world

dimension in bird’s eye view. Note that these are 3D esti-

mates as we assume that the objects are on the ground. This

is a reasonable assumption in the autonomous driving sce-

nario as vehicles do not ﬂy.

We demonstrate the effectiveness of our approach in two

datasets, the public KITTI benchmark [

10] and a large-

scale 3D vehicle detection dataset (ATG4D). Speciﬁcally,

PIXOR achieves the highest Average Precision (AP) on

KITTI bird’s eye view object detection benchmark among

all previously published methods, while also runs the fastest

among them (over 10 FPS). We also provide in-depth abla-

tion studies on KITTI to investigate how much performance

gain each module contributes, and prove the scalability and

generalization ability of PIXOR by applying it to the large-

scale ATG4D dataset.

2. Related Work

We ﬁrst review recent advances in applying Convolu-

tional Neural Networks to object detection, and then revisit

works in two related sub-ﬁelds, single-stage object detec-

tion and 3D object detection.

2.1. CNN-based Object Detection

Convolutional Neural Networks (CNN) have shown out-

standing performance in image classiﬁcation [

18]. When

applied to object detection, it is natural to utilize them by

running inference over cropped regions representing the ob-

ject candidates. Overfeat [

30] slides a CNN on different

positions and scales and predicts a bounding box per class

at each time. Since the introduction of class-agnostic ob-

ject proposals [

36, 26], proposal based approaches popu-

late, with Region-CNN (RCNN) [

12] and its faster versions

[

11, 4] being the most seminal work. RCNN ﬁrst extracts

the whole-image feature map with an ImageNet [

5] pre-

trained CNN and then predicts a conﬁdence score as well

as box position per proposal via a RoI-pooling operation

on the whole-image feature map [

13]. Faster-RCNN [28]

further proposes to learn to generate region proposals with

a CNN and share the feature representation with detection,

which leads to further gain in both performance and speed.

Proposal based object detectors achieve outstanding perfor-

mances in many public benchmarks [

7, 29]. However, the

typical two-stage pipeline makes it unsuitable for real-time

applications.

2.2. Single-stage Object Detection

Different from the two-stage detection pipeline that ﬁrst

predicts proposals and then reﬁnes them, single-stage detec-

tors directly predict the ﬁnals detections. YOLO [

27] and

SSD [

23] are the most representative works with real-time

speed. YOLO [

27] divides the image into sparse grids and

makes multi-class and multi-scale predictions per grid cell.

SSD [

23] additionally uses pre-deﬁned object templates (or

anchors) to handle large variance in object size and shape.

For single-class object detection, DenseBox [17] and EAST

[

38] show that single-stage detector also works well with-

out using manually designed anchors. They both adopt

the fully-convolutional network architecture [

24] to make

dense predictions, where each pixel location corresponds to

one object candidate. Recently RetinaNet [22] shows that

single-stage detector can outperform two-stage detector if

class imbalance problem during training is resolved prop-

erly. Our proposed detector follows the idea of single-stage

dense object detector, while further extends these ideas to

real-time 3D object detection by re-designing the input rep-

resentation, network architecture, and output parameteriza-

tion. We also remove the hyper parameter of pre-deﬁned

object anchors by re-deﬁning the objective function of ob-

ject localization, which leads to a simpler detection frame-

work.

7653

剩余8页未读，继续阅读

评论收藏

内容反馈

版权申诉

电动汽车控制与安全

粉丝: 276
资源: 4186

Yang_PIXOR_Real-Time_3D_CVPR_2018_paper.pdf

Yang_PIXOR_Real-Time_3D_CVPR_2018_paper.zip

Mei_Dont_Hit_Me_Glass_Detection_in_Real-World_Scenes_CVPR_2020_paper.pdf

L1L0_TM-CVPR2018-master_layerdecomposition_l1l0_tonemapping_

Barath_Five-Point_Fundamental_Matrix_CVPR_2018_paper.pdf

Zhang_Cross-Scene_Crowd_Counting_2015_CVPR_paper.pdf

Luo_Fast_and_Furious_CVPR_2018_paper.pdf

人工智能精选论文（图像识别与图像处理）

Yao_Feng_Joint_3D_Face_ECCV_2018_paper_CVPR2018_

Luo_Fast_and_Furious_CVPR_2018_paper.zip

Zhong_Graph_Convolutional_Label_Noise_Cleaner_Train_a_Plug-And-Play_Action_Classifier_CVPR_2019_paper.pdf

Homayounfar_Hierarchical_Recurrent_Attention_CVPR_2018_paper.pdf

Sun_Deep_Learning_Face_2014_CVPR_paper.pdf

Shi_Point-GNN_Graph_Neural_Network_for_3D_Object_Detection_in_a_CVPR_2020_paper.pdf

CVPR2018_Oral_论文合集_人工智能_机器学习

Sultani_Real-World_Anomaly_Detection_CVPR_2018_paper笔记1

Qi_Frustum_PointNets_for_CVPR_2018_paper.pdf

Homayounfar_Hierarchical_Recurrent_Attention_CVPR_2018_paper.zip

Duan_Revisiting_Skeleton-Based_Action_Recognition_CVPR_2022_paper.pdf

3_Wei_Convolutional_Pose_Machines_CVPR_2016_paper.pdf

A Twofold Siamese Network for Real-Time Object Tracking

Zhou_Learning_Deep_Features_CVPR_2016_paper.pdf

Mattyus_Matching_Adversarial_Networks_CVPR_2018_paper.zip

Mattyus_Matching_Adversarial_Networks_CVPR_2018_paper.pdf

计算机视觉论文 Redmon_You_Only_Look_CVPR_2016_paper

GFK_cvpr12_supp_.pdf

Men_A_Common_Framework_CVPR_2018_paper.pdf

Adaptive background mixture models for real-time tracking(stauffer_cvpr98_track)代码

最新资源