1558-1748 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2020.2991741, IEEE Sensors
Journal
2 IEEE SENSORS JOURNAL, VOL. XX, NO. XX, XXXX 2020
resent the scene using RF intensity based spatial heat-maps,
reflection point clouds or range-doppler maps, rather than a
true-color image representation. Radars are therefore primarily
used for target localization and tracking applications. Further-
more, object classification becomes non-trivial with radar data,
and the lack of available labeled radar data-sets for this task
makes it even more challenging.
Traditionally, radar systems have been size and cost inten-
sive primarily targeted to commercial and defense applica-
tions. However, continuing advancement in micro-electronics
fabrication and manufacturing techniques, including Radio
Frequency Integrated Circuits (RFICs), have significantly re-
duced the size and cost of electronic sensors making them
more accessible to public [12]–[14]. mmWave automotive
radars are an example of this technology. They are low-power,
compact and are extremely practical to deploy. Furthermore,
mmWave radars provide us with a high resolution point cloud
representation of the scene and have therefore emerged as one
of the primary sensors in autonomous robots on a smaller
scale, to more commercial applications such as autonomous
vehicles. Higher operating bandwidths also allow mmWave
radars to roughly generate the contour of human body without
extracting facial information, thus preserving user privacy.
In this paper, we propose mm-Pose, a novel real-time
approach to estimate and track human skeleton using mmWave
radars and convolutional neural networks (CNNs). A potential
depiction of its application in traffic monitoring systems and
autonomous vehicles is shown in Fig. 1. To the best of the
authors’ knowledge, this is the first method that uses mmWave
radar reflection signals to estimate the real-world position of
>15 distinct joints of a human body. mm-Pose could also
find applications in (i) privacy-protected automated patient
monitoring systems, and (ii) aiding defense forces in a hostage
situation. Radars carrying this technology on unmanned aerial
vehicles (UAVs) could scan the building and map the live
skeletal postures of the hostage and the adversary, through
the walls, which would not have been possible otherwise with
vision sensors.
The paper is organized as follows. Section II summarizes
the current skeleton tracking work carried out in the CV com-
munity and its extension to RF sensors. A concise background
theory around the two fundamental blocks of the system,
namely (i) radar signal processing chain and (ii) machine
learning and neural networks is presented in Section III.
The detailed approach, novel data representation and system
architecture are presented in Section IV, followed by the
experimental results and discussion in Section V. Finally, the
study is summarized and concluded in Section VI.
II. LITERATURE REVIEW
It is extremely critical to accurately estimate and track
human posture in several applications, as the estimated pose
is key to infer their specific behavior. Since the last decade,
scientists have been exploring various approaches to estimating
human pose. One of the early works in 2005 was Strike a Pose,
proposed by researchers at Oxford, that would detect humans
in a specific pose by identifying 10 distinct body parts/limbs
using rectangular templates from RGB images/videos [15]. A
k-poselet based keypoint detection scheme was proposed in
2016, that uses predicted torso keypoint activations to detect
multiple persons using agglomerative clustering [16]. Another
approach was to use region-based CNN (R-CNN) to learn N
masks, to detect each of the N distinct key-points to construct
the skeleton from images, using a ResNet variant architecture
[17]. In 2016, DeeperCut, an improved multi-person pose
estimation model from DeepCut was proposed that used a
bottom up approach using a fine-tuned ResNet architecture
that doubled the then estimation accuracy with a 3 orders
of magnitude reduction in run-time [18], [19]. A top-down
approach to pose estimation was proposed by Google, that
first identified regions in the image containing people using
R-CNN, and then used a fully convolutional ResNet archi-
tectiture and aggregation to obtain the keypoint predictions,
yielding a 0.649 precision on the Common Objects in Context
(COCO) test-dev set [20]. Another extremely popular bottom-
up approach for human pose estimation is OpenPose, proposed
by researchers at Carneigie Mellon University in 2017 [21].
OpenPose used Part Affinity Fields (PAF), a non-parametric
representation of different body parts, and then associate them
to individuals in the scene. This real-time algorithm had great
results on the MPII dataset and also won the 2016 COCO
key-points challenge [22]. Also, the cross-platform versatility
and open-source data-sets has led to OpenPose being used as
the most popular benchmark for generating highly accurate
ground truth data-sets for training.
While the aforementioned approaches paved the way to-
wards human pose and skeleton tracking, they were limited
to 2-D estimation on account of the images/videos being
collected from monocular cameras. While monocular cameras
provide high resolution information of the azimuth and ele-
vation of the objects, extracting depth using monocular vision
sensors is extremely challenging and non-trivial. To model a 3-
D representation of the skeletal joints, HumanEva dataset was
created by researchers at the University of Toronto [23]. The
dataset was created by using 7 synchronous video cameras
(3 RGB + 4 grayscale) in a circular array, to capture the
entire scene in its field-of-view. The human subject was made
to perform 5 different motions, and reflective markers were
placed on specific joint locations to track the motion and
a ViconPeak commercial motion capture system was used
to obtain the 3-D ground truth pose of the body. Another
approach to extract 3-D skeletal joint information is by using
Microsoft Kinect [24]. The Kinect consists of an RGB and
infra-red (IR) camera that allows it to capture the scene in
3-D space. It used a per-pixel classification approach to first
identify the human body parts, followed by joint estimation
by finding the global centroid of the probability mass for each
identified part. However the downsides of vision based sensors
for skeletal tracking are the fact that their performance is
extensively hindered in poor lighting and occlusion. Moreover,
as previously introduced, privacy concerns restrict the use of
vision based for several applications.
Studies have previously made use of micro-doppler signa-
tures to determine human behavior using RF signals, however
it did not provide spatial information of the subjects’ locations
Authorized licensed use limited to: University of Newcastle. Downloaded on June 02,2020 at 19:56:26 UTC from IEEE Xplore. Restrictions apply.
评论0
最新资源