PRW that consists of
932
identities, with bounding boxes
across
11, 816
frames. The dataset comes with annotations
and extensive baselines to evaluate the impacts of detection
and recognition methods on person re-ID accuracy.
In Section 4, we leverage the volume of the PRW dataset
to train state-of-the-art detectors such as R-CNN [15], with
various convolutional neural network (CNN) architectures
such as AlexNet [19], VGGNet [31] and ResidualNet [17].
Several well-known descriptors and distance metrics are also
considered for person re-ID. However, our joint setup al-
lows two further improvements in Section 4.2. First, we
propose a cascaded fine-tuning strategy to make full use of
the detection data provided by PRW, which results in im-
proved CNN embeddings. Two CNN variants, are derived
w.r.t the fine tuning strategies. Novel insights can be learned
from the new fine-tuning method. Second, we propose a
Confidence Weighted Similarity (CWS) metric that incor-
porates detection scores. Assigning lower weights to false
positive detections prevents a drop in re-ID accuracy due to
the increase in gallery size with the use of detectors.
Given a dataset like PRW that allows simultaneous eval-
uation of detection and re-ID, it is natural to consider
whether any complementarity exists between the two tasks.
For a particular re-ID method, it is intuitive that a bet-
ter detector should yield better accuracy. But we argue
that the criteria for determining a detector as better are
application-dependent. Previous works in pedestrian de-
tection [10, 28, 43] usually use Average Precision or Log-
Average Miss Rate under IoU
> 0.5
for evaluation. How-
ever, through extensive benchmarking on the proposed PRW
dataset, we find in Section 5 that IoU > 0.7 is a more effec-
tive rule in indicating detector influences on re-ID accuracy.
In other words, the localization ability of detectors plays a
critical role in re-ID.
Figure 1 presents the pipeline of the end-to-end re-ID
system discussed in this paper. Starting from raw video
frames, a gallery is created by pedestrian detectors. Given a
query person-of-interest, gallery bounding boxes are ranked
according to their similarity with the query. To summarize,
our main contributions are:
•
A novel large-scale dataset, Person Re-identification in
the Wild (PRW), for simultaneous analysis of person
detection and re-ID.
•
Comprehensive benchmarking of state-of-the-art detec-
tion and recognition methods on the PRW dataset.
•
Novel insights into how detection aids re-ID, along with
an effective fine-tuning strategy and similarity measure
to illustrate how they might be utilized.
•
Novel insights into the evaluation of pedestrian detec-
tors for the specific application of person re-ID.
Figure 2: Annotation interface. All appearing pedestrians
are annotated with a bounding box and ID. ID ranges from 1
to 932, and -2 stands for ambiguous persons.
2. Related Work
An overview of existing re-ID datasets.
In recent years,
a number of person re-ID datasets have been exposed [16,
20, 21, 44, 45, 48, 48]. Varying numbers of IDs and boxes
exist with them (see Table 1). Despite some differences
among them, a common property is that the pedestrians are
confined within pre-defined bounding boxes that are either
hand-drawn (e.g., VIPeR [16], iLIDS [48], CUHK02 [20])
or obtained using detectors (e.g., CUHK03 [21], Market-
1501 [45] and MARS [44]). PRW is a follow-up to our
previous releases [44, 45] and requires considering the entire
pipeline for person re-ID from scratch.
Pedestrian detection.
Recent pedestrian detection works
feature the “proposal+CNN” approach. Pedestrian detec-
tion usually employs weak pedestrian detectors as propos-
als, which allows achieving relatively high recall using very
few proposals [24, 27
–
29]. Despite the impressive recent
progress in pedestrian detection, it has been rarely consid-
ered with person re-ID as an application. This paper attempts
to determine how detection can help re-ID and provide in-
sights in assessing detector performance.
Person re-ID.
Recent progress in person re-ID mainly
consists in deep learning. Several works [1, 8, 21, 40, 44]
focus on learning features and metrics through the CNN
framework. Formulating person re-ID as a ranking task, im-
age pairs [1, 21, 40] or triplets [8] are fed into CNN. It is
also shown in [47] that deep learning using the identifica-
tion model [35, 44, 50] yields even higher accuracy than the
siamese model. With a sufficient amount of training data per
ID, we thus adopt the identification model to learn an CNN
embedding in the pedestrian subspace. We refer readers to
our recent works [47, 50] for details.
Detection and re-ID.
In our knowledge, two previous
works focus on such end-to-end systems. In [42], persons in
photo albums are detected using poselets [4] and recognition
is performed using face and global signatures. However, the
setting in [42] is not typical for person re-ID where pedes-
trians are observed by surveillance cameras and faces are
not clear enough. In a work closer to ours, Xu et al. [39]
jointly model pedestrian commonness and uniqueness, and
calculate the similarity between query and each sliding win-
dow in a brute-force manner. While [39] works on datasets