fitness of the individuals in the population. Section 5 presents the
results obtained applying our proposed method to a well-known
dataset. Finally, in Section 6 some discussion and conclusions are
drawn.
2. Related work
This section reviews the most relevant state of the art on human
action recognition with RGB-D devices and evolutionary feature
subset selection related to this work.
2.1. Human action recognition with RGB-D devices
Experimental results show that humans are able to recognise
different activities seeing only a few points of light attached to
the joints of the human body (Moving Light Display, Johansson,
1973; Polana & Nelson, 1997). Therefore, it seems that the position,
orientation and motion of joints contain enough characteristic data
in order to recognise activities using computers. Furthermore, good
performance may be achieved using only the spatial distribution of
the joints.
In the state-of-the-art works in the research field, it can be ob-
served that an increasing number of applications for RGB-D-based
human action recognition are being developed. The necessary
datasets, so as to perform initial evaluations and compare the
results, have been recorded and made publicly available. Both
datasets designed for gesture or action recognition for natural user
interfaces (NUI) or gaming (Li, Zhang, & Liu, 2010), and more com-
plex activities involving interactions with objects (Janoch et al.,
2011; Ni, Wang, & Moulin, 2011; Sung, Ponce, Selman, & Saxena,
2011; Wang, Liu, Wu, & Yuan, 2012) have been published.
There are several methods to extract a structured set of joints
and their connections, i.e. the skeletal information, from depth
maps (Shotton et al., 2011). These methods provide different kinds
of skeleton models. The Microsoft Kinect SDK (Microsoft Corpora-
tion, 2013) provides a skeleton model with 20 joints (see Fig. 1),
whereas the OpenNI/NITE (PrimeSense, Ltd., 2013) skeleton tracks
a set of 15 joints.
The use of the different data provided by the RGB-D devices for
human action recognition goes from employing only the depth
data, or only the skeleton data extracted from the depth, to the
fusion of both the depth and the skeleton data.
Li et al., 2010 use a simple but effective projection scheme to
obtain a representation set of 3D points from the depth map.
Dynamics of human motion are modelled based on a set of salient
postures shared among the actions. These postures are described
using a bag-of-points. Yang, Zhang, and Tian (2012) propose a
method to recognise human actions from sequences of depth
maps. They project the depth maps onto three orthogonal planes
and accumulate the whole sequence generating a depth motion
map (DMM), similar to the motion history images (Bobick & Davis,
2001). Histograms of oriented gradients (HOG) are obtained for
each DMM. The concatenation of the three HOG serves as input
feature to a linear SVM classifier. Wang, Liu, Chorowski, Chen,
and Wu, 2012 treat an action sequence as a 4D shape and propose
random occupancy pattern features, which are extracted from ran-
domly sampled 4D sub-volumes with different sizes and at differ-
ent locations. These features are robust to noise and less sensitive
to occlusions. An Elastic-Net regularization is employed to select a
sparse subset of features that are the most discriminative for the
classification. Finally a SVM classifier is trained for action
classification.
Miranda et al. (2012) described each pose using a spherical
angular representation of the skeleton joints obtained with Kinect.
Those descriptors serve to identify key poses through a multi-class
classifier derived from support vector learning machines. A gesture
is represented as a sequence of key poses and labelled on the fly
through a decision forest, that naturally performs the gesture time
warping and avoids the requirement for an initial or neutral pose.
Xia, Chen, and Aggarwal (2012) use histograms of 3D joint loca-
tions computed from the action depth sequences. These features
are re-projected using LDA and then clustered into several posture
visual words, which represent the prototypical poses of actions.
The temporal evolutions of those visual words are modelled by dis-
crete hidden Markov models. Azary and Savakis (2012) use sparse
representations of spatio-temporal kinematic joint features and
raw depth features which are invariant to scale and position. They
create overcomplete dictionaries and classify input patterns using
both L1-norm and L2-norm minimisation. Yang and Tian (2012)
propose a new type of features, the EigenJoints. They employ 3D
position differences of joints to characterise action information
including posture, motion and offset features. After a normalisation
process, PCA is applied to compute the EigenJoints. Then they em-
ploy a Naïve-Bayes-nearest-neighbour classifier for multi-class ac-
tion classification. Soh and Demiris (2012) propose an online echo
state Gaussian process (OESGP), a novel Bayesian-based online
method, to iteratively learn complex temporal dynamics and pro-
duce predictive distributions. They use a generative modelling ap-
proach whereby each action class is represented by a separate
OESGP model. Inference is performed using a Bayes filter to itera-
tively update a probability distribution over the model classes.
Fothergill, Mentis, Kohli, and Nowozin (2012) employ joint angles,
joint angle velocities and xyz-velocities of joints as feature vector
at each frame. Then gesture recognition is carried out using ran-
dom forests.
Other methods fuse depth and skeletal data. Wang, Liu, Wu,
et al. (2012) use the pairwise relative difference between joints’
positions as features. The authors state that 3D joint positions
are insufficient to fully model an action, especially when the action
includes the interactions between the subject and other objects.
Therefore, these interactions are characterised by local occupancy
patterns (LOP) at each joint. This LOP feature computes the local
occupancy information based on the 3D point cloud around a par-
ticular joint. A Fourier temporal pyramid is then used to obtain a
Fig. 1. The 20 joints from a skeleton in the MSR-Action3D dataset.
A.A. Chaaraoui et al. / Expert Systems with Applications 41 (2014) 786–794
787