pect that an auxiliary classifier can improve target domain
labels because it may use cues that have not been utilized by
the original detection model. As examples, additional cues
can be based on additional input data (e.g. motion or opti-
cal flow), different network architectures, or ensembles of
models. We note however, that the auxiliary image classi-
fication model is only used during the retraining phase and
the computational complexity of the final detector is pre-
served at test time.
The contributions of this paper are summarized as fol-
lows: i) We provide the first (to the best of our knowledge)
formulation of domain adaptation in object detection as ro-
bust learning. ii) We propose a novel robust object detection
framework that considers noise in training data on both ob-
ject labels and locations. We use Faster R-CNN[45] as our
base object detector, but our general framework, theoreti-
cally, could be adapted to other detectors (e.g. SSD [31] and
YOLO [43]) that minimize a classification loss and regress
bounding boxes. iii) We use an independent classification
refinement module to allow other sources of information
from the target domain (e.g. motion, geometry, background
information) to be integrated seamlessly. iv) We demon-
strate that this robust framework achieves state-of-the-art on
several cross-domain detection tasks.
2. Previous Work
Object Detection: The first approaches to object detec-
tion used a sliding window followed by a classifier based
on hand-crafted features [6, 11, 58]. After advances in
deep convolutional neural networks, methods such as R-
CNN [19], SPPNet [22], and Fast R-CNN [18] arose which
used CNNs for feature extraction and classification. Slow
sliding window algorithms were replaced with faster region
proposal methods such as selective search [53]. Recent
object detection methods further speed bounding box de-
tection. For example, in Faster R-CNN [45] a region pro-
posal network (RPN) was introduced to predict refinements
in the locations and sizes of predefined anchor boxes. In
SSD [31], classification and bounding box prediction is per-
formed on feature maps at different scales using anchor
boxes with different aspect ratios. In YOLO [42], a regres-
sion problem on a grid is solved, where for each cell in the
grid, the bounding box and the class label of the object cen-
tering at that cell is predicted. Newer extensions are found
in [63, 43, 5]. A comprehensive comparison of methods is
reported in [25]. The goal of this paper is to increase the
accuracy of an object detector in a new domain regardless
of the speed. Consequently, we base our improvements on
Faster R-CNN, a slower, but accurate detector.
1
1
Our adoption of faster R-CNN also allows for direct comparison with
the state-of-the-art [2].
Domain Adaptation: was initially studied for image
classification and the majority of the domain adaptation
literature focuses on this problem [10, 9, 29, 21, 20, 12,
48, 32, 33, 14, 13, 17, 1, 37, 30]. Some of the meth-
ods developed in this context include cross-domain kernel
learning methods such as adaptive multiple kernel learn-
ing (A-MKL) [10], domain transfer multiple kernel learn-
ing (DTMKL) [9], and geodesic flow kernel (GFK) [20].
There are a wide variety of approaches directed towards ob-
taining domain invariant predictors: supervised learning of
non-linear transformations between domains using asym-
metric metric learning [29], unsupervised learning of in-
termediate representations [21], alignment of target and do-
main subspaces using eigenvector covariances [12], align-
ment the second-order statistics to minimize the shift be-
tween domains [48], and covariance matrix alignment ap-
proach [59]. The rise of deep learning brought with it steps
towards domain-invariant feature learning. In [32, 33] a re-
producing kernel Hilbert embedding of the hidden features
in the network is learned and mean-embedding matching
is performed for both domain distributions. In [14, 13] an
adversarial loss along with a domain classifier is trained to
learn features that are discriminative and domain invariant.
There is less work in domain adaptation for object de-
tection. Domain adaptation methods for non-image clas-
sification tasks include [15] for fine-grained recognition,
[3, 24, 64, ?] for semantic segmentation, [?] for dataset
generation, and [?] for finding out of distribution data in
active learning. For object detection itself, [61] used an
adaptive SVM to reduce the domain shift, [41] performed
subspace alignment on the features extracted from R-CNN,
and [2] used Faster RCNN as baseline and took an adversar-
ial approach (similar to [13]) to learn domain invariant fea-
tures jointly on target and source domains. We take a fun-
damentally different approach by reformulating the prob-
lem as noisy labeling. We design a robust-to-noise train-
ing scheme for object detection which is trained on noisy
bounding boxes and labels acquired from the target domain
as pseudo-ground-truth.
Noisy Labeling: Previous work on robust learning has fo-
cused on image classification where there are few and dis-
joint classes. Early work used instance-independent noise
models, where each class is confused with other classes in-
dependent of the instance content [39, 36, 40, 47, 65, 62].
Recently, the literature has shifted towards instance-specific
label noise prediction [60, 35, 54, 55, 56, 57, 51, 27, 7, 44].
To the best of our knowledge, ours is the first proposal for
an object detection model that is robust to label noise.
3. Method
Following the common formulation for domain adapta-
tion, we represent the training data space as the source do-