没有合适的资源?快使用搜索试试~ 我知道了~
Classification using both Inter- & Intra- Channel Parallel Convo...
1.该资源内容由用户上传,如若侵权请联系客服进行举报
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
2.虚拟产品一经售出概不退款(资源遇到问题,请及时私信上传者)
版权申诉
0 下载量 44 浏览量
2021-03-19
14:26:35
上传
评论
收藏 208KB PDF 举报
温馨提示
Convolutional Neural Networks for Multivariate Time Series Classification using both Inter- & Intra- Channel Parallel Convolutions G. Devineau
资源推荐
资源详情
资源评论
Convolutional Neural Networks for Multivariate Time Series Classification using
both Inter- & Intra- Channel Parallel Convolutions
G. Devineau
1
W. Xi
2
F. Moutarde
1
J. Yang
2
1
MINES ParisTech, PSL Research University, Center for Robotics, Paris, France
2
Shanghai Jiao Tong University, School of Electronic Information and Electrical Engineering, China
{guillaume.devineau, wang.xi, fabien.moutarde}@mines-paristech.fr
Abstract
In this paper, we study a convolutional neural network we
recently introduced in [9], intended to recognize 3D hand
gestures via multivariate time series classification.
The Convolutional Neural Network (CNN) we propo-
sed processes sequences of hand-skeletal joints’ positions
using parallel convolutions. We justify the model’s ar-
chitecture and investigate its performance on hand ges-
ture sequence classification tasks. Our model only uses
hand-skeletal data and no depth image. Experimental re-
sults show that our approach achieves a state-of-the-art
performance on a challenging dataset (DHG dataset from
the SHREC 2017 3D Shape Retrieval Contest).Our model
achieves a 91.28% classification accuracy for the 14 ges-
ture classes case and an 84.35% classification accuracy for
the 28 gesture classes case.
1 Introduction
Gesture is a natural way for a user to interact with one’s
environment. One preferred way to infer the intent of a
gesture is to use a taxonomy of gestures and to classify
the unknown gesture into one of the existing categories ba-
sed on the gesture data, e.g. using a neural network to per-
form the classification. In this paper we present and study a
convolutional neural network architecture relying on intra-
and inter- parallel processing of sequences of hand-skeletal
joints’ positions to classify complete hand gestures. Where
most existing deep learning approaches to gesture recog-
nition use RGB-D image sequences to classify gestures
[41], our neural network only uses hand (3D) skeletal data
sequences which are quicker to process than image se-
quences. The rest of this paper is structured as follows. We
first review common recognition methods in Section II. We
then present the DHG dataset we used to evaluate our net-
work in Section III. We detail our approach in Section IV in
terms of motivations, architecture and results. Finally, we
conclude in Section VI and discuss how our model can be
improved and integrated into a realtime interactive system.
Note that the contents of this paper are highly similar to
that of [9], especially sections 1, 2 and 3, as well as the fi-
gure illustrating the network, however in this article we fo-
cus more on practical tips and on justifying the network ar-
chitecture whereas the original paper focus was more cen-
tered on gesture-related aspects. Readers familiar with [9]
can directly skip to the subsection Architecture Tuning of
section IV, in which the network architecture is justified
more thoroughly.
2 Definition & Related Work
We define a 3D skeletal data sequence s as a vector s =
(p
1
· · · p
n
)
T
whose components p
i
are multivariate time se-
quences. Each component p
i
= (p
i
(t))
t∈N
represents a mul-
tivariate sequence with three (univariate sequences) com-
ponents p
i
= (x
(i)
,y
(i)
,z
(i)
) that alltogether represent a time
sequence of the positions p
i
(t) of the i-th skeletal joint j
i
.
Every skeletal joint j
i
represents a distinct and precise arti-
culation or part of one’s hand in the physical world.
In the following subsections, we present a short review
of some approaches to gesture recognition. Typical ap-
proaches to hand gesture recognition begin with the ex-
traction of spatial and temporal features from raw data.
The features are later classified by a Machine Learning
algorithm. The feature extraction step can either be ex-
plicit, using hand-crafted features known to be useful for
classification, or implicit, using (machine) learned features
that describe the data without requiring human labor or ex-
pert knowledge. Deep Learning algorithms leverage such
learned features to obtain hierarchical representations (fea-
tures) that often describe the data better than hand-crafted
features. As we work on skeletal data only, with a deep-
learning perspective, this review pays limited attention to
non deep-learning based approaches and to depth-based
approaches ; a survey on the former approaches can be
found in [19] while several recent surveys on the latter ap-
proaches are listed in Neverova’s thesis [21].
2.1 Non-deep-learning methods using hand-
crafted features
Various hand-crafted representations of skeletal data can
be used for classification. These representations often des-
cribe physical attributes and constraints, or easily interpre-
table properties and correlations of the data, with an em-
phasis on geometric features and statistical features. Some
commonly used features are the positions of the skele-
tal joints, the orientation of the joints, the distance bet-
ween joints, the angles between joints, the curvature of the
joints’ trajectories, the presence of symmetries in the ske-
letal, and more generally other features that involve a hu-
man interpretable metric calculated from the skeletal data
[15, 16, 33]. For instance, in [37], Vemulapalli et al. pro-
pose a human skeletal representation within the Lie group
SE(3) × ... × SE(3), based on the idea that rigid body ro-
tations and translations in 3D space are members of the
Special Euclidean group SE(3). Human actions are then
viewed as curves in this manifold. Recognition (classifica-
tion) is finally performed in the corresponding Lie algebra.
In [8], Devanne et al. represent skeletal joints’ sequences
as trajectories in a n-dimensional space ; the trajectories
of the joints are then interpreted in a Riemannian mani-
fold. Similarities between the shape of trajectories in this
shape space are then calculated with k-Nearest Neighbor
(k-NN) to achieve the sequence classification. In [7], two
approaches for gesture recognition -on the DHG dataset
presented in the next section- are presented. The first one,
proposed by Guerry et al., is a deep-learning method pre-
sented in the next subsection. The second one, proposed by
De Smedt et al., uses three hand-crafted descriptors : Shape
of Connected Joints (SoCJ), Histogram of Hand Directions
(HoHD) and Histogram of Wrist Rotations (HoWR), as
well as Fisher Vectors (FV) for the final representation.
Regardless of the features used, hand-crafted features are
always fed into a classifier to perform the gesture recog-
nition. In [5], CIPPITELLI et al. use a multi-class Sup-
port Vector Machine (SVM) for the final classification
of activity features based on posture features. Other very
frequently used classifiers [40] are Hidden Markov mo-
dels (HMM), Conditional Random Fields (CRF), discrete
distance-based methods, Naive Bayes, and even simple k-
Nearest Neighbors (k-NN) with Dynamic Time Warping
(DTW) discrepancy.
2.2 Deep-Learning based methods
Deep Learning, also known as Hierarchical Learning, is a
subclass of Machine Learning where algorithms f use a
cascade of non-linear computational units f
i
(layers), e.g.
using convolutions, for feature extraction and transforma-
tion : f = f
1
◦ f
2
◦ ... ◦ f
n
.A traditional Convolutional Neu-
ral Network (CNN, or ConvNet) model almost always in-
volves a sequence of convolution and pooling layers, that
are followed by dense layers. Convolution and pooling
layers serve as feature extractors, whereas the dense layers,
also called Multi Layer Perceptron (MLP), can be seen
as a classifier. A strategy to mix deep-learning algorithms
and (hand) gesture recognition consists in training convo-
lutional neural networks [18] on RGB-D images. A di-
rect example of hand gesture recognition via image CNNs
can be found in the works of Strezoski et al. [32] where
CNNs are simply applied on the RGB images of sequences
to classify. Guerry et al. [7] propose a deep-learning ap-
proach for hand gesture recognition on the DHG dataset,
which is described in section III of this paper. The Guerry
et al. approach consists in concatenating the Red, Green,
Blue and Depth channels of each RGB-D image. An al-
ready pretrained VGG [29] image classification model is
then applied on sequences of 5 concatenated images conse-
cutive in time. In [20], Molchanov et al. introduce a CNN
architecture for RGB-D images where the classifier is made
of two CNN networks (a high-resolution network and a
low-resolution network) whose class-membership outputs
are fused with an element-wise multiplication. Neverova et
al. carry out a gesture classification task on multi-modal
data (RGB-D images, audio streams and skeletal data) in
[22, 23]. Each modality is processed independently with
convolution layers at first, and then merged. To avoid mea-
ningless co-adaptation of modalities a multi-modal dropout
(ModDrop) is introduced. Nevertheless, these approaches
use depth information where we only want to use ske-
letal data. In [38], Wang et al. color-code the joints of
a 3D skeleton across time. The colored (3D) trajectories
are projected on 2D planes in order to obtain images that
serve as inputs of CNNs. Each CNN emits a gesture class-
membership probability. Finally, a class score (probability)
is obtained by the fusion of the CNNs scores.
Recurrent Neural Networks (RNN), e.g. networks that use
Long Short-Term Memory (LSTM) [12] or Gated Recur-
rent Units (GRU) [4], have long been considered as the
best way to achieve state-of-the-art results when working
with neural networks on sequences like time series. Re-
cently, the emergence of new neural networks architectures
that use convolutions or attention mechanisms [35, 36] ra-
ther than recurrent cells has challenged this assumption,
given that RNNs present some significant issues such as
being sensitive to the first examples seen, having complex
dynamics that can lead to chaotic behavior [17] or being
models that are intrinsically sequential, which means that
their internal state computations are difficult to parallelize,
to name only a few of their issues. In [30], Song et al. ele-
gantly combine the use of an LSTM-based neural network
for human action recognition from skeleton data with a
spatio-temporal attention mechanism. While this approach
seems promising, we rather seek to find a convolution-only
architecture rather than a recurrent one.
Zheng et al. propose a convolution-based architecture that
does not involve recurrent cells in [42], although this ar-
chitecture can easily be extended with recurrent cells :
[25]. Zheng et al. introduce a general framework (Multi-
Channels Deep Convolution Neural Networks, or MC-
DCNN) for multivariate sequences classification. In MC-
DCNN, multivariate time series are seen as multiple univa-
riate time series ; as such, the neural network input consist
of several 1D time series sequences. The feature learning
step is executed on every univariate sequence individually.
The respective learned features are later concatenated and
merged using a classic MLP placed at the end of the fea-
ture extraction layers to perform classification. The major
剩余7页未读,继续阅读
资源评论
Fun_He
- 粉丝: 19
- 资源: 104
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- (源码)基于Spring MVC和Hibernate框架的学校管理系统.zip
- (源码)基于TensorFlow 2.3的高光谱水果糖度分析系统.zip
- (源码)基于Python框架库的知识库管理系统.zip
- (源码)基于C++的日志管理系统.zip
- (源码)基于Arduino和OpenFrameworks的植物音乐感应系统.zip
- (源码)基于Spring Boot和Spring Security的博客管理系统.zip
- (源码)基于ODBC和C语言的数据库管理系统.zip
- (源码)基于Spring Boot和Vue的Jshop商城系统.zip
- (源码)基于C++的学生信息管理系统.zip
- (源码)基于Arduino的实时心电图监测系统.zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功