【免费】mm姿势：使用mmWave雷达和CNN的实时人体骨骼姿势估计1

需积分: 0 139 浏览量 2022-08-04 12:02:48 上传评论收藏 3.41MB PDF 举报

资源详情

资源评论

1558-1748 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JSEN.2020.2991741, IEEE Sensors

Journal

IEEE SENSORS JOURNAL, VOL. XX, NO. XX, XXXX 2020 1

mm-Pose: Real-Time Human Skeletal Posture

Estimation using mmWave Radars and CNNs

Arindam Sengupta, Student Member, IEEE, Feng Jin, Student Member, IEEE,

Renyuan Zhang, Student Member, IEEE, and Siyang Cao, Member, IEEE

Abstract—In this paper, mm-Pose, a novel approach to detect

and track human skeletons in real-time using an mmWave radar,

is proposed. To the best of the authors’ knowledge, this is the

ﬁrst method to detect >15 distinct skeletal joints using mmWave

radar reﬂection signals. The proposed method would ﬁnd several

applications in trafﬁc monitoring systems, autonomous vehicles,

patient monitoring systems and defense forces to detect and track

human skeleton for effective and preventive decision making

in real-time. The use of radar makes the system operationally

robust to scene lighting and adverse weather conditions. The

reﬂected radar point cloud in range, azimuth and elevation

are ﬁrst resolved and projected in Range-Azimuth and Range-

Elevation planes. A novel low-size high-resolution radar-to-image

representation is also presented, that overcomes the sparsity

in traditional point cloud data and offers signiﬁcant reduction

in the subsequent machine learning architecture. The RGB

channels were assigned with the normalized values of range,

elevation/azimuth and the power level of the reﬂection signals for

each of the points. A forked CNN architecture was used to predict

the real-world position of the skeletal joints in 3-D space, using

the radar-to-image representation. The proposed method was

tested for a single human scenario for four primary motions, (i)

Walking, (ii) Swinging left arm, (iii) Swinging right arm, and (iv)

Swinging both arms to validate accurate predictions for motion

in range, azimuth and elevation. The detailed methodology,

implementation, challenges, and validation results are presented.

Index Terms—Convolutional Neural Networks, mmWave

Radars, Posture Estimation, Skeletal Tracking

I. INTRODUCTION

ITH the advent in computing resources and advanced

machine learning (ML) techniques, computer vision

(CV) has emerged as an exciting ﬁeld of research to pro-

vide Artiﬁcal Intelligence (AI) and autonomous machines

with information about the visual representation of the real

world [1], [2]. Primarily using vision based sensors, such as

monocular camera, Red-Green-Blue-Depth (RGBD) camera or

Infra-Red (IR) based sensors, and applied machine learning,

CV targets several applications, including (but not limited to)

object classiﬁcation, target tracking, trafﬁc monitoring and

autonomous vehicles [3]–[7]. In the recent years, another

interesting topic that the CV community has been exploring

is the ability to estimate human skeletal pose by identifying

and detecting speciﬁc joints and/or body parts from still/video

data. This speciﬁc area of research ﬁnds several applications,

one being primarily in the health-care industry by automating

patient monitoring systems, with the current situation of global

shortage in nursing staff [8]. Such tracking systems would

A. Sengupta, F. Jin, R. Zhang and S. Cao are with the Department of

Electrical and Computer Engineering, University of Arizona, Tucson, AZ,

85721 USA. e-mail: (sengupta,fengjin,ryzhang,caos)@email.arizona.edu

Radar

mm−Wave

Radar

mm−Wave

mm−Pose

Autonomous Vehicles

carrying mmWave Radar

Pedestrian detected by Autonomous Vehicles

Pedestrian detected by Traffic Monitoring System

Fig. 1. mm-Pose can be used in autonomous/ semi-autonomous vehicles and

trafﬁc monitoring systems for robust skeletal posture estimation of pedestrians,

represented in green and blue dot on the crosswalk, respectively.

also allow for effective pedestrian monitoring for autonomous

and semi-autonomous vehicles, and aid defense forces with

behavioral information of the adversary, to trigger appropriate

preventive decision making.

While vision based sensors provide a high-resolution repre-

sentation of the scene, there are a few challenges associated

with their operation. They heavily rely on (or inﬂuenced by)

external sources for illuminating the scene and are there-

fore rendered ineffective in poor lighting conditions, adverse

weather conditions or when the scene/target is occluded [9].

These could result in irrevocable catastrophic events similar

to the ones encountered at (i) Tesla’s autopilot testing, where

the vision sensors failed to detect the white side of a tractor

trailer in brightly lit sky (very high reﬂectivity) [10] , and (ii)

Uber self-driving vehicle crash incident in Arizona due to the

vision/LiDAR sensors’ inability to detect the pedestrian in time

to avoid the accident during a night test (low/no reﬂectivity)

[11]. There is therefore an imminent need for alternate sensors

to achieve the task, while overcoming the aforementioned

challenges.

Radio Frequency (RF) based sensors, such as radars, use its

own signals to illuminate the target (active sensing), therefore

making it operationally robust to scene lighting and weather

conditions. However, unlike vision based sensors, radars rep-

Authorized licensed use limited to: University of Newcastle. Downloaded on June 02,2020 at 19:56:26 UTC from IEEE Xplore. Restrictions apply.

Journal

2 IEEE SENSORS JOURNAL, VOL. XX, NO. XX, XXXX 2020

resent the scene using RF intensity based spatial heat-maps,

reﬂection point clouds or range-doppler maps, rather than a

true-color image representation. Radars are therefore primarily

used for target localization and tracking applications. Further-

more, object classiﬁcation becomes non-trivial with radar data,

and the lack of available labeled radar data-sets for this task

makes it even more challenging.

Traditionally, radar systems have been size and cost inten-

sive primarily targeted to commercial and defense applica-

tions. However, continuing advancement in micro-electronics

fabrication and manufacturing techniques, including Radio

Frequency Integrated Circuits (RFICs), have signiﬁcantly re-

duced the size and cost of electronic sensors making them

more accessible to public [12]–[14]. mmWave automotive

radars are an example of this technology. They are low-power,

compact and are extremely practical to deploy. Furthermore,

mmWave radars provide us with a high resolution point cloud

representation of the scene and have therefore emerged as one

of the primary sensors in autonomous robots on a smaller

scale, to more commercial applications such as autonomous

vehicles. Higher operating bandwidths also allow mmWave

radars to roughly generate the contour of human body without

extracting facial information, thus preserving user privacy.

In this paper, we propose mm-Pose, a novel real-time

approach to estimate and track human skeleton using mmWave

radars and convolutional neural networks (CNNs). A potential

depiction of its application in trafﬁc monitoring systems and

autonomous vehicles is shown in Fig. 1. To the best of the

authors’ knowledge, this is the ﬁrst method that uses mmWave

radar reﬂection signals to estimate the real-world position of

>15 distinct joints of a human body. mm-Pose could also

ﬁnd applications in (i) privacy-protected automated patient

monitoring systems, and (ii) aiding defense forces in a hostage

situation. Radars carrying this technology on unmanned aerial

vehicles (UAVs) could scan the building and map the live

skeletal postures of the hostage and the adversary, through

the walls, which would not have been possible otherwise with

vision sensors.

The paper is organized as follows. Section II summarizes

the current skeleton tracking work carried out in the CV com-

munity and its extension to RF sensors. A concise background

theory around the two fundamental blocks of the system,

namely (i) radar signal processing chain and (ii) machine

learning and neural networks is presented in Section III.

The detailed approach, novel data representation and system

architecture are presented in Section IV, followed by the

experimental results and discussion in Section V. Finally, the

study is summarized and concluded in Section VI.

II. LITERATURE REVIEW

It is extremely critical to accurately estimate and track

human posture in several applications, as the estimated pose

is key to infer their speciﬁc behavior. Since the last decade,

scientists have been exploring various approaches to estimating

human pose. One of the early works in 2005 was Strike a Pose,

proposed by researchers at Oxford, that would detect humans

in a speciﬁc pose by identifying 10 distinct body parts/limbs

using rectangular templates from RGB images/videos [15]. A

k-poselet based keypoint detection scheme was proposed in

2016, that uses predicted torso keypoint activations to detect

multiple persons using agglomerative clustering [16]. Another

approach was to use region-based CNN (R-CNN) to learn N

masks, to detect each of the N distinct key-points to construct

the skeleton from images, using a ResNet variant architecture

[17]. In 2016, DeeperCut, an improved multi-person pose

estimation model from DeepCut was proposed that used a

bottom up approach using a ﬁne-tuned ResNet architecture

that doubled the then estimation accuracy with a 3 orders

of magnitude reduction in run-time [18], [19]. A top-down

approach to pose estimation was proposed by Google, that

ﬁrst identiﬁed regions in the image containing people using

R-CNN, and then used a fully convolutional ResNet archi-

tectiture and aggregation to obtain the keypoint predictions,

yielding a 0.649 precision on the Common Objects in Context

(COCO) test-dev set [20]. Another extremely popular bottom-

up approach for human pose estimation is OpenPose, proposed

by researchers at Carneigie Mellon University in 2017 [21].

OpenPose used Part Afﬁnity Fields (PAF), a non-parametric

representation of different body parts, and then associate them

to individuals in the scene. This real-time algorithm had great

results on the MPII dataset and also won the 2016 COCO

key-points challenge [22]. Also, the cross-platform versatility

and open-source data-sets has led to OpenPose being used as

the most popular benchmark for generating highly accurate

ground truth data-sets for training.

While the aforementioned approaches paved the way to-

wards human pose and skeleton tracking, they were limited

to 2-D estimation on account of the images/videos being

collected from monocular cameras. While monocular cameras

provide high resolution information of the azimuth and ele-

vation of the objects, extracting depth using monocular vision

sensors is extremely challenging and non-trivial. To model a 3-

D representation of the skeletal joints, HumanEva dataset was

created by researchers at the University of Toronto [23]. The

dataset was created by using 7 synchronous video cameras

(3 RGB + 4 grayscale) in a circular array, to capture the

entire scene in its ﬁeld-of-view. The human subject was made

to perform 5 different motions, and reﬂective markers were

placed on speciﬁc joint locations to track the motion and

a ViconPeak commercial motion capture system was used

to obtain the 3-D ground truth pose of the body. Another

approach to extract 3-D skeletal joint information is by using

Microsoft Kinect [24]. The Kinect consists of an RGB and

infra-red (IR) camera that allows it to capture the scene in

3-D space. It used a per-pixel classiﬁcation approach to ﬁrst

identify the human body parts, followed by joint estimation

by ﬁnding the global centroid of the probability mass for each

identiﬁed part. However the downsides of vision based sensors

for skeletal tracking are the fact that their performance is

extensively hindered in poor lighting and occlusion. Moreover,

as previously introduced, privacy concerns restrict the use of

vision based for several applications.

Studies have previously made use of micro-doppler signa-

tures to determine human behavior using RF signals, however

it did not provide spatial information of the subjects’ locations

Authorized licensed use limited to: University of Newcastle. Downloaded on June 02,2020 at 19:56:26 UTC from IEEE Xplore. Restrictions apply.

Journal

SENGUPTA, A. et al.: MM-POSE: REAL-TIME HUMAN SKELETAL POSTURE ESTIMATION USING MMWAVE RADARS AND CNNS (APRIL 2020) 3

[25], [26] as the signatures solely represented the temporal

velocity proﬁles of the reﬂection points. Skeleton tracking

using RF signals is a new and emerging area of research.

RF based devices can be further classiﬁed into two categories,

wearable and non-wearable. Wearable wireless sensors use Wi-

Fi signals to track the location and velocity of the device,

which indirectly represents the human. However, Wi-Fi sig-

nals cannot distinguish between different body parts and are

therefore not suited for the proposed task of pose estimation.

Non-wearable RF sensors, such a radar, can be traditionally

used to localize the target in range and angle. RF-Capture,

proposed by researchers at MIT in 2015, was the ﬁrst approach

to identify several human body parts, in a coarse fashion, using

FMCW signals and an antenna array, and then stitching the

identiﬁed parts to reconstruct a human ﬁgure [27]. However,

the design couldn’t perform a full skeletal tracking over time.

This was soon followed by RF-Pose, proposed by the same

research group in 2018, that used RF heat-maps obtained using

two antenna arrays, one vertical and the other horizontal [28].

A teacher-student encoder-decoder architecture was used to

estimate various key-points, which were then used to construct

the skeletal pose. Finally, RF-Based 3D Skeletons, used 1.8

GHz (5.4 GHz-7.2 GHz) bandwidth Frequency-Modulated

Continuous Wave (FMCW) signals, and a ResNet architecture

to estimate 3-D skeletons. For ground truth data, a circular

array of 12 2-D vision sensors was used to capture the scene,

and Open-Pose was used to generate 2-D skeletons from

each camera node output, which were then associated and

triangulated to obtain 3-D skeletons [29].

In this paper, we propose mm-Pose, a novel approach to use

77 GHz mmWave radars for human skeletal tracking. mmWave

radars offer a greater bandwidth (≈3 GHz), that in turn

provides a more precise resolution. Furthermore, operating at

77 GHz allows it to capture even small abnomalities from

the reﬂection surface, thus adding more granularity in terms

of identifying more key-points. Unlike the aforementioned

approaches, mmWave radars are low-power, low-cost and

compact, making it extremely practical for deployment. We

make use of a forked-CNN architecture to predict >15 key-

points and construct the skeleton in real-time. To obtain ground

truth data, we parallely collect the keypoint locations using

Microsoft Kinect on MATLAB API.

III. BACKGROUND THEORY

A. Radar Signal Processing

The mmWave radar transmits a frequency modulated contin-

ues wave (FMCW) chirp signal, and utilizes stretch processing

[30] to get the beat frequency, which corresponds to the

target’s range. The Doppler processing across multiple chirps

during one coherent processing interval (CPI) determines the

Doppler frequency, which is related to the target’s velocity.

Mathematically, the n-th chirp during one CPI in complex

form is given by:

(t) = e

j2π[f

]

, nT ≤ t < (n + 1)T,

∀n ∈ [0, 1, ..., N − 1].

(1)

where f

is the chirp starting frequency, BW is the sweeping

bandwidth, T is the duration of one chirp and N is the number

of the chirps during one CPI.

is referred to as the chirp

rate. The echo from a target is a time delayed version of

the transmitting chirp. After stretch processing, the resulting

baseband signal is given as:

× e

j2πf

× e

j2π

(2τ

t−τ

)

. (2)

where A

is the normalized received signal amplitude, which

represents its reﬂectivity, and the τ

is the two round time

delay between the radar and the target during the n-th chirp

period,

2(R

− vnT )

. (3)

in which R

is the initial distance, v is the target’s radial

velocity. The radar cross section (RCS), which represent the

reﬂectivity of the target, can be solved by:

σ = 20 log

(4πR

). (4)

The equation above shows the relationship between the

normalized signal amplitude, and the corresponding power,

being directly proportional to the radar cross section or the

size of the target. As τ

is constant for one chirp in the range

dimension, the baseband signal is a single frequency tone with

respect to t, also called the beat frequency, given by:

Beat

≈

τ =

. (5)

The beat frequency resolution, that depends on the sampling

time in one chirp, is expressed as:

∆f

Beat

≥

. (6)

From (5) and(6), the range resolution can be calculated as:

∆R ≥

2BW

. (7)

In the Doppler dimension, the data is sampled in the same

position during each chirp across all the N chirps. This time,

as t is constant, the baseband signal in (2) is a single tone with

respect to τ

after ignoring the smaller multiplicative term, and

is expressed as:

j2πf

= e

j2πf

2(R

−vnT )

. (8)

The obtained Doppler frequency is given by:

Doppler

= −

. (9)

where λ is the wavelength. The Doppler frequency resolution

depends on the sampling interval in one CPI, and is repre-

sented as:

∆f

Doppler

≥

CP I

(10)

Using (9) and (10), the obtained velocity resolution is

expressed as:

∆v ≥

2 ∗ CP I

. (11)

To determine the angle of the target, the time-division-

multiplexing (TDM) multiple-input and multiple-output

(MIMO) direction-of-arrival (DOA) estimation algorithm is

used. Consider a scenario where a mmWave radar has two

Authorized licensed use limited to: University of Newcastle. Downloaded on June 02,2020 at 19:56:26 UTC from IEEE Xplore. Restrictions apply.

剩余11页未读，继续阅读

评论收藏

内容反馈

mm姿势：使用mmWave雷达和CNN的实时人体骨骼姿势估计1

评论0

最新资源

mm姿势：使用mmWave雷达和CNN的实时人体骨骼姿势估计1

评论0

最新资源

相关推荐

颜色分类leetcode-mmpose:mmPose：使用毫米波雷达进行人体姿态估计的开源工具箱

3DHPE:使用CNN从单眼RGB相机进行实时3D人体姿态估计

电子行业专题研究：从5G mmwave看供应链机会.pdf

视频流-：通过mmwave进行视频流的性能

电子行业专题研究：从5G mmwave看供应链机会.zip

电子制造：从5G mmwave看供应链机会（2）.zip

电子制造：从5G mmwave看供应链机会（2）.pdf

Introduction to mmwave Sensing: FMCW Radars

电子制造行业点评：从5G mmwave看供应链机会（2）-200305.pdf

Python库 | mmWave-0.1.50-py3-none-any.whl

Python库 | mmWave-0.1.42-py3-none-any.whl

IWR1443资料.rar_IWR1443_mmwave_mmwave radar_radar chip_雷达芯片

mmwaveSensorNetwork.zip_MM wave_mmwave_sensor_sensor design_sens

Sub6-Preds-mmWave：使用6 GHz以下信道预测mmWave波束和链路阻塞

Ti mmwave studio安装包-1

mmwave_automotive_toolbox_2_4_2.rar

mmWave radar sensing FMCW radars 课件PDF资料

电子行业：从5G_mmwave看供应链机会-20191120-天风证券-10页.pdf

最新版ISO/IEC 27001:2022、ISO 27002:2022中英文合集

Goby红队版-win-x64-2.4.7版本

Chrome Header Editor 插件

ISO SAE 21434-2021 中文版.pdf

安全认证cisp教材全套

OpenVAS GVM 中文翻译补丁

2024最新：Hvv中常见的面试问题