improvehumanactionrecognitionwithRGB-Ddevices资源-CSDN文库

需积分: 9 31 浏览量 2013-12-08 20:03:09 上传评论收藏 1.74MB PDF 举报

RGB-D设备是一种结合了传统RGB摄像头和深度摄像头的设备，能够同时捕捉场景的色彩信息和深度信息。RGB-D设备的出现，为人体动作识别研究领域带来了新的机遇和挑战。RGB-D设备能够提供无标记的身体姿态估计，这是通过骨架数据实现的，其中包含身体关节的三维位置。这些数据可以用于姿态、手势或动作识别。 RGB-D设备受到越来越多关注的原因主要在于它们的低价格、高采样率和多种可能的应用。如微软的Kinect和华硕的Xtion Pro等设备，它们可靠地捕捉深度信息，除了常规的颜色图像外，还能提供深度图像。这样的RGB-D数据可以用于获得无标记的身体姿态估计。具体来说，会生成一个由一系列关节构成的骨架模型，这些特征数据可用于学习和分类人体姿态、动作甚至日常生活活动（ADL）。在研究人体动作识别方面，文章《通过RGB-D设备改善人体动作识别：进化联合选择法》提出了一个使用进化算法来确定骨架关节最佳子集的方法，考虑到骨架的拓扑结构，目的是改善最终的成功率。所提出的方法已被应用于MSR-Action3D数据集，并与当前最先进的RGB动作识别方法进行比较。实验结果显示，所提出的算法能够显著提高初始识别率，并获得与或优于现有方法的成功率。 RGB-D数据的使用为动作识别技术提供了更丰富和精确的信息。因为可以捕捉到人体的三维结构，这些设备在对人体动作的细节分析上具有独特的优势。骨架模型的创建使得动作识别算法能够在不依赖于外部标记的情况下，分析人体动作。这种模型为动作识别提供了相对独立于观察角度和环境条件的方法。通过使用RGB-D设备获取的骨架数据，可以实现对人体动作的快速和准确识别，这对于许多应用场景来说都是非常有价值的。例如，在人机交互、视频监控、游戏娱乐和医疗康复等领域中，准确的人体动作识别可以大大提升用户体验和系统的智能化水平。而在辅助机器人、安全监控、虚拟现实和增强现实技术中，RGB-D设备的应用也越来越广泛。文章提到了进化算法在人体动作识别过程中的应用，即选择最佳的骨架关节子集。这种方法考虑到了骨架的拓扑结构，意味着它不仅仅简单地使用所有可用的关节，而是经过智能选择，挑选出最有助于识别动作的那些关节。这不仅可以提高动作识别的准确性，还可以在不增加计算负担的情况下提升系统的效率。在提到的MSR-Action3D数据集中，包含了多种不同的动作类别，这些动作的执行是通过骨架数据捕获的。实验结果表明，利用进化算法选择最佳关节子集的方法，相比其他现有技术，能更有效地识别和分类这些动作，这对动作识别的算法研发具有重要意义。随着深度学习技术的发展，人体动作识别领域也正在经历着革命性的变革。深度学习方法，尤其是卷积神经网络（CNN）和循环神经网络（RNN），已被广泛应用于动作识别任务。RGB-D设备结合深度学习，有望进一步推动动作识别技术的准确性和可靠性，为各种智能应用提供强大的支持。

资源推荐

资源详情

资源评论

Evolutionary joint selection to improve human action recognition

with RGB-D devices

Alexandros Andre Chaaraoui

, José Ramón Padilla-López

, Pau Climent-Pérez

Francisco Flórez-Revuelta

⇑

Department of Computer Technology, University of Alicante, P.O. Box 99, E-03080 Alicante, Spain

Faculty of Science, Engineering and Computing, Kingston University, Penrhyn Road, KT1 2EE Kingston upon Thames, United Kingdom

article info

Keywords:

RGB-D devices

Human action recognition

Evolutionary computation

Instance selection

Feature subset selection

abstract

Interest in RGB-D devices is increasing due to their low price and the wide range of possible applications

that come along. These devices provide a marker-less body pose estimation by means of skeletal data

consisting of 3D positions of body joints. These can be further used for pose, gesture or action recognition.

In this work, an evolutionary algorithm is used to determine the optimal subset of skeleton joints, taking

into account the topological structure of the skeleton, in order to improve the ﬁnal success rate. The pro-

posed method has been validated using a state-of-the-art RGB action recognition approach, and applying

it to the MSR-Action3D dataset. Results show that the proposed algorithm is able to signiﬁcantly improve

the initial recognition rate and to yield similar or better success rates than the state-of-the-art methods.

1. Introduction

Recently, interest has grown on affordable devices as the Micro-

soft Kinect (Microsoft Corporation, 2013) or the ASUS Xtion Pro

(ASUSTeK Computer Inc, 2013), which can capture depth quite reli-

ably. These image sensors provide a depth image (D), besides the

regular colour image (RGB). The resulting RGB-D data can be used

to obtain a marker-less body pose estimation. Speciﬁcally, a

skeleton model consisting of a set of joints is generated. This char-

acteristic data can be used in order to learn and classify human

poses, actions or even activities of daily living (ADL). These depth

sensors have become popular due to their low cost, high sample

rate and capability of combining visual and depth information.

Usage can be found both in research and commercial applications.

Although they were initially designed for gaming purposes, other

applications, where natural human–computer interaction (HCI) is

required, are extensively employing these technologies (e.g. Seo

& Lee, 2013). In particular, RGB-D devices are used in ambient as-

sisted living for fall detection (Mastorakis & Makris, 2012), physical

rehabilitation (Chang, Chen, & Huang, 2011; Huang, 2011), medical

image exploration in operating rooms (Gallo, Placitelli, & Ciampi,

2011) and gait analysis (Stone & Skubic, 2011) among other

applications.

Reliability and accuracy of RGB-D devices have been studied in

several works (Alnowami, Alnwaimi, Tahavori, Copland, & Wells,

2012; Obdrzalek et al., 2012), which show that the extraction of

a skeleton from depth information is not straightforward. Among

several difﬁculties, lack of precision and occlusions caused by body

parts or other objects present in the scene stand out (Khoshelham

& Elberink, 2012; Shotton et al., 2011).

Most of the existing works that employ RGB-devices for human

action recognition use all the available joints obtained by the de-

vices. However, some actions or gestures involve moving the

whole body, whereas others are performed using only the arms

or the hands. Therefore, it is interesting to determine which joints

have a greater value to the success of the recognition method being

used, and which ones can be discarded because they are not rele-

vant for a speciﬁc application, since they introduce confusion or

noise and reduce the recognition rate.

This paper proposes an evolutionary method for the selection of

the subset of relevant joints that improve action recognition using

RGB-D devices. A method based on a bag of key poses and dynamic

time warping (DTW) is used as recognition algorithm in order to

calculate the ﬁtness of the different solutions obtained in the evo-

lution. This evolutionary feature subset selection method employs

speciﬁc knowledge about the topological structure of the skeleton

in order to obtain better solutions in less time.

The remainder of this paper is organised as follows: Section 2

reviews the state of the art on human action recognition with

RGB-D devices and evolutionary feature subset selection. Section 3

presents the proposed evolutionary algorithm. Section 4 deals with

the human action recognition method employed to evaluate the

http://dx.doi.org/10.1016/j.eswa.2013.08.009

⇑

Corresponding author.

E-mail addresses: alexandros@dtic.ua.es (A.A. Chaaraoui), jpadilla@dtic.ua.es

(J.R. Padilla-López), P.Climent@kingston.ac.uk (P. Climent-Pérez), F.Florez@kingston.

ac.uk (F. Flórez-Revuelta).

Expert Systems with Applications 41 (2014) 786–794

Contents lists available at ScienceDirect

Expert Systems with Applications

journal homepage: www.elsevier.com/locate/eswa

ﬁtness of the individuals in the population. Section 5 presents the

results obtained applying our proposed method to a well-known

dataset. Finally, in Section 6 some discussion and conclusions are

drawn.

2. Related work

This section reviews the most relevant state of the art on human

action recognition with RGB-D devices and evolutionary feature

subset selection related to this work.

2.1. Human action recognition with RGB-D devices

Experimental results show that humans are able to recognise

different activities seeing only a few points of light attached to

the joints of the human body (Moving Light Display, Johansson,

1973; Polana & Nelson, 1997). Therefore, it seems that the position,

orientation and motion of joints contain enough characteristic data

in order to recognise activities using computers. Furthermore, good

performance may be achieved using only the spatial distribution of

the joints.

In the state-of-the-art works in the research ﬁeld, it can be ob-

served that an increasing number of applications for RGB-D-based

human action recognition are being developed. The necessary

datasets, so as to perform initial evaluations and compare the

results, have been recorded and made publicly available. Both

datasets designed for gesture or action recognition for natural user

interfaces (NUI) or gaming (Li, Zhang, & Liu, 2010), and more com-

plex activities involving interactions with objects (Janoch et al.,

2011; Ni, Wang, & Moulin, 2011; Sung, Ponce, Selman, & Saxena,

2011; Wang, Liu, Wu, & Yuan, 2012) have been published.

There are several methods to extract a structured set of joints

and their connections, i.e. the skeletal information, from depth

maps (Shotton et al., 2011). These methods provide different kinds

of skeleton models. The Microsoft Kinect SDK (Microsoft Corpora-

tion, 2013) provides a skeleton model with 20 joints (see Fig. 1),

whereas the OpenNI/NITE (PrimeSense, Ltd., 2013) skeleton tracks

a set of 15 joints.

The use of the different data provided by the RGB-D devices for

human action recognition goes from employing only the depth

data, or only the skeleton data extracted from the depth, to the

fusion of both the depth and the skeleton data.

Li et al., 2010 use a simple but effective projection scheme to

obtain a representation set of 3D points from the depth map.

Dynamics of human motion are modelled based on a set of salient

postures shared among the actions. These postures are described

using a bag-of-points. Yang, Zhang, and Tian (2012) propose a

method to recognise human actions from sequences of depth

maps. They project the depth maps onto three orthogonal planes

and accumulate the whole sequence generating a depth motion

map (DMM), similar to the motion history images (Bobick & Davis,

2001). Histograms of oriented gradients (HOG) are obtained for

each DMM. The concatenation of the three HOG serves as input

feature to a linear SVM classiﬁer. Wang, Liu, Chorowski, Chen,

and Wu, 2012 treat an action sequence as a 4D shape and propose

random occupancy pattern features, which are extracted from ran-

domly sampled 4D sub-volumes with different sizes and at differ-

ent locations. These features are robust to noise and less sensitive

to occlusions. An Elastic-Net regularization is employed to select a

sparse subset of features that are the most discriminative for the

classiﬁcation. Finally a SVM classiﬁer is trained for action

classiﬁcation.

Miranda et al. (2012) described each pose using a spherical

angular representation of the skeleton joints obtained with Kinect.

Those descriptors serve to identify key poses through a multi-class

classiﬁer derived from support vector learning machines. A gesture

is represented as a sequence of key poses and labelled on the ﬂy

through a decision forest, that naturally performs the gesture time

warping and avoids the requirement for an initial or neutral pose.

Xia, Chen, and Aggarwal (2012) use histograms of 3D joint loca-

tions computed from the action depth sequences. These features

are re-projected using LDA and then clustered into several posture

visual words, which represent the prototypical poses of actions.

The temporal evolutions of those visual words are modelled by dis-

crete hidden Markov models. Azary and Savakis (2012) use sparse

representations of spatio-temporal kinematic joint features and

raw depth features which are invariant to scale and position. They

create overcomplete dictionaries and classify input patterns using

both L1-norm and L2-norm minimisation. Yang and Tian (2012)

propose a new type of features, the EigenJoints. They employ 3D

position differences of joints to characterise action information

including posture, motion and offset features. After a normalisation

process, PCA is applied to compute the EigenJoints. Then they em-

ploy a Naïve-Bayes-nearest-neighbour classiﬁer for multi-class ac-

tion classiﬁcation. Soh and Demiris (2012) propose an online echo

state Gaussian process (OESGP), a novel Bayesian-based online

method, to iteratively learn complex temporal dynamics and pro-

duce predictive distributions. They use a generative modelling ap-

proach whereby each action class is represented by a separate

OESGP model. Inference is performed using a Bayes ﬁlter to itera-

tively update a probability distribution over the model classes.

Fothergill, Mentis, Kohli, and Nowozin (2012) employ joint angles,

joint angle velocities and xyz-velocities of joints as feature vector

at each frame. Then gesture recognition is carried out using ran-

dom forests.

Other methods fuse depth and skeletal data. Wang, Liu, Wu,

et al. (2012) use the pairwise relative difference between joints’

positions as features. The authors state that 3D joint positions

are insufﬁcient to fully model an action, especially when the action

includes the interactions between the subject and other objects.

Therefore, these interactions are characterised by local occupancy

patterns (LOP) at each joint. This LOP feature computes the local

occupancy information based on the 3D point cloud around a par-

ticular joint. A Fourier temporal pyramid is then used to obtain a

Fig. 1. The 20 joints from a skeleton in the MSR-Action3D dataset.

A.A. Chaaraoui et al. / Expert Systems with Applications 41 (2014) 786–794

787

剩余8页未读，继续阅读

评论收藏

内容反馈

liangchengwu0615

粉丝: 1
资源: 3

improve human action recognition with RGB-D devices

human_action_recognition

abnormal-events-detection2.rar_Action!_action recognition _human

Combining RGB and depth features for action recognition based on sparse representation

Transform based spatio-temporal descriptors for human action recognition

人体活动识别

人体行为识别方法

Action Recognition

action-recognition

One-shot Learning Gesture Recognition from RGB-D Data Using BoF

Range Loss for Deep Face Recognition with Long-Tailed Training Data（商汤，论文阅读ppt)

Human-Activity-Recognition-from-Videos-master.zip_hairzle_human

Multitask-Emotion-Recognition-with-Incomplete-Labels试讲（一小部分项目配置）.avi

Scene-Recognition-With-Bag-Of-Words-master.zip

Coupled hidden conditional random fields for RGB-D human action recognition

face_recognition-1.3.0-py2.py3-none-any.whl

face_recognition_models-0.3.0.tar

Indoor Scene Recognition by 3-D Object Search

python语音识别SpeechRecognition-3.8.1-py2.py3 和 PyAudio-0.2.11-cp37

Human-and-Machine-recognition-of-faces.rar_Human/Machine

Real-Time-Facial-Expression-Recognition-with-DeepLearning

3D-Action-Recognition-Using-EigenJoints

face_recognition_models-0.3.0.tar.gz

SpeechRecognition-3.8.1-py2.py3-none-any.whl

faceRecognition-PHP-master_facerecognition_

TensorFlow-on-Android-for-Human-Activity-Recognition-with-LSTMs:展示如何在TensorFlow中构建LSTM模型并将其部署在Android上的iPython笔记本和Android应用

face_recognition_models-0.3.0-py2.py3-none-any.whl

Unequal-Training-for-Deep-Face-Recognition-with-Long-Tailed-Noisy-Data:CVPR2019论文代码《带有长尾噪声数据的深度人脸识别的不平等训练》

最新资源