ficulties include non-lambertian surfaces (e.g., reflectance,
transparency) large displacements (e.g., high speed), a large
variety of materials (e.g., matte vs. shiny), as well as differ-
ent lighting conditions (e.g., sunny vs. cloudy).
Our 3D visual odometry / SLAM dataset consists of
22 stereo sequences, with a total length of 39.2 km. To
date, datasets falling into this category are either monocular
and short [
43] or consist of low quality imagery [42, 4, 35].
They typically do not provide an evaluation metric, and as
a consequence there is no consensus on which benchmark
should be used to evaluate visual odometry / SLAM ap-
proaches. Thus often only qualitative results are presented,
with the notable exception of laser-based SLAM [
28]. We
believe a fair comparison is possible in our benchmark due
to its large scale nature as well as the novel metrics we pro-
pose, which capture different sources of error by evaluating
error statistics over all sub-sequences of a given trajectory
length or driving speed.
Our 3D object benchmark focuses on computer vision
algorithms for object detection and 3D orientation estima-
tion. While existing benchmarks for those tasks do not pro-
vide accurate 3D information [
17, 39, 15, 16] or lack real-
ism [
33, 31, 34], our dataset provides accurate 3D bounding
boxes for object classes such as cars, vans, trucks, pedes-
trians, cyclists and trams. We obtain this information by
manually labeling objects in 3D point clouds produced by
our Velodyne system, and projecting them back into the im-
age. This results in tracklets with accurate 3D poses, which
can be used to asses the performance of algorithms for 3D
orientation estimation and 3D tracking.
In our experiments, we evaluate a representative set of
state-of-the-art systems using our benchmarks and novel
metrics. Perhaps not surprisingly, many algorithms that
do well on established datasets such as Middlebury [
41, 2]
struggle on our benchmark. We conjecture that this might
be due to their assumptions which are violated in our sce-
narios, as well as overfitting to a small set of training (test)
images.
In addition to the benchmarks, we provide MAT-
LAB/C++ development kits for easy access. We also main-
tain an up-to-date online evaluation server
1
. We hope that
our efforts will help increase the impact that visual recogni-
tion systems have in robotics applications.
2. Challenges and Methodology
Generating large-scale and realistic evaluation bench-
marks for the aforementioned tasks poses a number of chal-
lenges, including the collection of large amounts of data in
real time, the calibration of diverse sensors working at dif-
ferent rates, the generation of ground truth minimizing the
amount of supervision required, the selection of the appro-
1
www.cvlibs.net/datasets/kitti
priate sequences and frames for each benchmark as well as
the development of metrics for each task. In this section we
discuss how we tackle these challenges.
2.1. Sensors and Data Acquisition
We equipped a standard station wagon with two color
and two grayscale PointGrey Flea2 video cameras (10 Hz,
resolution: 1392 × 512 pixels, opening: 90
◦
× 35
◦
), a Velo-
dyne HDL-64E 3D laser scanner (10 Hz, 64 laser beams,
range: 100 m), a GPS/IMU localization unit with RTK cor-
rection signals (open sky localization errors < 5 cm) and a
powerful computer running a real-time database [
22].
We mounted all our cameras (i.e., two units, each com-
posed of a color and a grayscale camera) on top of our vehi-
cle. We placed one unit on the left side of the rack, and the
other on the right side. Our camera setup is chosen such
that we obtain a baseline of roughly 54 cm between the
same type of cameras and that the distance between color
and grayscale cameras is minimized (6 cm). We believe
this is a good setup since color images are very useful for
tasks such as segmentation and object detection, but provide
lower contrast and sensitivity compared to their grayscale
counterparts, which is of key importance in stereo matching
and optical flow estimation.
We use a Velodyne HDL-64E unit, as it is one of the few
sensors available that can provide accurate 3D information
from moving platforms. In contrast, structured-light sys-
tems such as the Microsoft Kinect do not work in outdoor
scenarios and have a very limited sensing range. To com-
pensate egomotion in the 3D laser measurements, we use
the position information from our GPS/IMU system.
2.2. Sensor Calibration
Accurate sensor calibration is key for obtaining reliable
ground truth. Our calibration pipeline proceeds as follows:
First, we calibrate the four video cameras intrinsically and
extrinsically and rectify the input images. We then find the
3D rigid motion parameters which relate the coordinate sys-
tem of the laser scanner, the localization unit and the refer-
ence camera. While our Camera-to-Camera and GPS/IMU-
to-Velodyne registration methods are fully automatic, the
Velodyne-to-Camera calibration requires the user to manu-
ally select a small number of correspondences between the
laser and the camera images. This was necessary as existing
techniques for this task are not accurate enough to compute
ground truth estimates.
Camera-to-Camera calibration. To automatically cali-
brate the intrinsic and extrinsic parameters of the cameras,
we mounted checkerboard patterns onto the walls of our
garage and detect corners in our calibration images. Based
on gradient information and discrete energy-minimization,
we assign corners to checkerboards, match them between