PIXOR: Real-time 3D Object Detection from Point Clouds
Bin Yang, Wenjie Luo, Raquel Urtasun
Uber Advanced Technologies Group
University of Toronto
{byang10, wenjie, urtasun}@uber.com
Abstract
We address the problem of real-time 3D object detec-
tion from point clouds in the context of autonomous driv-
ing. Speed is critical as detection is a necessary compo-
nent for safety. Existing approaches are, however, expensive
in computation due to high dimensionality of point clouds.
We utilize the 3D data more efficiently by representing the
scene from the Bird’s Eye View (BEV), and propose PIXOR,
a proposal-free, single-stage detector that outputs oriented
3D object estimates decoded from pixel-wise neural net-
work predictions. The input representation, network archi-
tecture, and model optimization are specially designed to
balance high accuracy and real-time efficiency. We validate
PIXOR on two datasets: the KITTI BEV object detection
benchmark, and a large-scale 3D vehicle detection bench-
mark. In both datasets we show that the proposed detector
surpasses other state-of-the-art methods notably in terms of
Average Precision (AP), while still runs at 10 FPS.
1. Introduction
Over the last few years we have seen a plethora of meth-
ods that exploit Convolutional Neural Networks to produce
accurate 2D object detections, typically from a single image
[12, 11, 28, 4, 27, 23]. However, in robotics applications
such as autonomous driving we are interested in detecting
objects in 3D space. This is fundamental for motion plan-
ning in order to plan a safe route.
Recent approaches to 3D object detection exploit differ-
ent data sources. Camera based approaches utilize either
monocular [
1] or stereo images [2]. However, accurate 3D
estimation from 2D images is difficult, particularly in long
ranges. With the popularity of inexpensive RGB-D sen-
sors such as Microsoft Kinect, Intel RealSense and Apple
PrimeSense, several approaches that utilize depth informa-
tion and fuse them with RGB images have been developed
[
32, 33]. They have been shown to achieve significant per-
formance gains over monocular methods. In the context
of autonomous driving, high-end sensor like LIDAR (Light
Detection And Ranging) is more common because higher
accuracy is needed for safety. The major difficulty in deal-
ing with LIDAR data is that the sensor produces unstruc-
tured data in the form of a point cloud containing typically
around 10
5
3D points per 360-degree sweep. This poses a
large computational challenge for modern detectors.
Different forms of point cloud representation have been
explored in the context of 3D object detection. The main
idea is to form a structured representation where standard
convolution operation can be applied. Existing representa-
tions are mainly divided into two types: 3D voxel grids and
2D projections. A 3D voxel grid transforms the point cloud
into a regularly spaced 3D grid, where each voxel cell can
contain a scalar value (e.g., occupancy) or vector data (e.g.,
hand-crafted statistics computed from the points within that
voxel cell). 3D convolution is typically applied to extract
high-order representation from the voxel grid [
6]. However,
since point clouds are sparse by nature, the voxel grid is
very sparse and therefore a large proportion of computation
is redundant and unnecessary. As a result, typical systems
[
6, 37, 20] only run at 1-2 FPS.
An alternative is to project the point cloud onto a plane,
which is then discretized into a 2D image based representa-
tion where 2D convolutions are applied. During discretiza-
tion, hand-crafted features (or statistics) are computed as
pixel values of the 2D image [
3]. Commonly used projec-
tions are range view (i.e., 360-degree panoramic view) and
bird’s eye view (i.e., top-down view). These 2D projection
based representations are more compact, but they bring in-
formation loss during projection and discretization. For ex-
ample, range-view projection will have distorted object size
and shape. To alleviate the information loss, MV3D [
3] pro-
poses to fuse the 2D projections with the camera image to
bring additional information. However, the fused model has
nearly linear computation cost with respect to the number of
input modals, making real-time application infeasible.
In this paper, we propose an accurate real-time 3D object
detector, which we call PIXOR (ORiented 3D object de-
tection from PIXel-wise neural network predictions), that
operates on 3D point clouds. PIXOR is a single-stage,
1
7652