1
Deep Learning-Based Human Pose Estimation:
A Survey
Ce Zheng
∗
, Wenhan Wu
∗
, Taojiannan Yang, Sijie Zhu, Chen Chen, Member, IEEE, Ruixu Liu, Ju
Shen, Senior Member, IEEE, Nasser Kehtarnavaz Fellow, IEEE and Mubarak Shah, Fellow, IEEE
Abstract
—Human pose estimation aims to locate the human body parts and build human body representation (e.g., body skeleton) from
input data such as images and videos. It has drawn increasing attention during the past decade and has been utilized in a wide range of
applications including human-computer interaction, motion analysis, augmented reality, and virtual reality. Although the recently
developed deep learning-based solutions have achieved high performance in human pose estimation, there still remain challenges due to
insufficient training data, depth ambiguities, and occlusions. The goal of this survey paper is to provide a comprehensive review of recent
deep learning-based solutions for both 2D and 3D pose estimation via a systematic analysis and comparison of these solutions based on
their input data and inference procedures. More than 240 research papers since 2014 are covered in this survey. Furthermore, 2D and 3D
human pose estimation datasets and evaluation metrics are included. Quantitative performance comparisons of the reviewed methods on
popular datasets are summarized and discussed. Finally, the challenges involved, applications, and future research directions are
concluded. We also provide a regularly updated project page on: https://github.com/zczcwh/DL-HPE
Index Terms—Survey of human pose estimation, 2D and 3D pose estimation, deep learning-based pose estimation, pose estimation
datasets, pose estimation metrics
F
1 INTRODUCTION
H
UMAN pose estimation (HPE), which has been exten-
sively studied in computer vision literature, involves
estimating the configuration of human body parts from
input data captured by sensors, in particular images and
videos. HPE provides geometric and motion information
of the human body which has been applied to a wide
range of applications (e.g., human-computer interaction,
motion analysis, augmented reality (AR), virtual reality
(VR), healthcare, etc.). With the rapid development of deep
learning solutions in recent years, such solutions have been
shown to outperform classical computer vision methods in
various tasks including image classification [1], semantic seg-
mentation [2], and object detection [3]. Significant progress
and remarkable performance have already been made by
employing deep learning techniques in HPE tasks. However,
challenges such as occlusion, insufficient training data, and
depth ambiguity still pose difficulties to be overcome. 2D
HPE from images and videos with 2D pose annotations is
easily achievable and high performance has been reached
for the human pose estimation of a single person using deep
learning techniques. More recently, attention has been paid
•
∗
The first two authors are contributed equally.
•
C. Zheng, W. Wu, T. Yang, S. Zhu and C. Chen are with the Department
of Electrical and Computer Engineering, University of North Carolina at
Charlotte, Charlotte, NC 28223.
E-mail: {czheng6, wwu25, tyang30, szhu3, chen.chen}@uncc.edu
•
R. Liu and J. Shen are with the Department of Computer Science,
University of Dayton, Dayton, OH 45469.
E-mail: {liur05, jshen1}@udayton.edu
•
N. Kehtarnavaz is with the Department of Electrical and Computer
Engineering, University of Texas at Dallas, Richardson, TX 75080.
E-mail: kehtar@utdallas.edu
•
M. Shah is with the Center for Research in Computer Vision, University
of Central Florida, Orlando, FL 32816.
E-mail: shah@crcv.ucf.edu
to highly occluded multi-person HPE in complex scenes. In
contrast, for 3D HPE, obtaining accurate 3D pose annotations
is much more difficult than its 2D counterpart. Motion
capture systems can collect 3D pose annotation in controlled
lab environments; however, they have limitations for in-the-
wild environments. For 3D HPE from monocular RGB images
and videos, the main challenge is depth ambiguities. In multi-
view settings, viewpoints association is the key issue that
needs to be addressed. Some works have utilized sensors
such as depth sensor, inertial measurement units (IMUs), and
radio frequency devices, but these approaches are usually
not cost-effective and require special purpose hardware.
Given the rapid progress in HPE research, this article
attempts to track recent advances and summarize their
achievements in order to provide a clear picture of current
research on deep learning-based 2D and 3D HPE.
1.1 Previous surveys and our contributions
Table 1 lists the related surveys and reviews previously
reported on HPE. Among them, [4] [5] [6] [7] focus on the gen-
eral field of visual-based human motion capture methods and
their implementations including pose estimation, tracking,
and action recognition. Therefore, pose estimation is only one
of the topics covered in these surveys. The research works on
3D human pose estimation before 2012 are reviewed in [8].
The body parts parsing-based methods for single-view and
multi-view HPE are reported in [9]. These surveys published
during 2001-2015 mainly focus on conventional methods
without deep learning. A survey on both traditional and
deep learning-based methods related to HPE is presented
in [10]. However, only a handful of deep learning-based
approaches are included. The survey in [11] covers 3D HPE
methods with RGB inputs. The survey in [13] only reviews
arXiv:2012.13392v1 [cs.CV] 24 Dec 2020