没有合适的资源?快使用搜索试试~ 我知道了~
使用3DCNN和卷积LSTM进行手势识别学习时空特征
42 下载量 195 浏览量
2021-02-25
07:31:22
上传
评论 7
收藏 690KB PDF 举报
温馨提示
使用3DCNN和卷积LSTM进行手势识别学习时空特征
资源推荐
资源详情
资源评论



















Learning Spatiotemporal Features using 3DCNN and Convolutional LSTM for
Gesture Recognition
Liang Zhang, Guangming Zhu, Peiyi Shen, Juan Song
School of Software, Xidian University
{liangzhang, gmzhu, pyshen, songjuan}@xidian.edu.cn
Syed Afaq Shah, Mohammed Bennamoun
University of Western Australia
{afaq.shah, mohammed.bennamoun}@uwa.edu.au
Abstract
Gesture recognition aims at understanding the ongoing
human gestures. In this paper, we present a deep archi-
tecture to learn spatiotemporal features for gesture recog-
nition. The deep architecture first learns 2D spatiotempo-
ral feature maps using 3D convolutional neural networks
(3DCNN) and bidirectional convolutional long-short-term-
memory networks (ConvLSTM). The learnt 2D feature maps
can encode the global temporal information and local spa-
tial information simultaneously. Then, 2DCNN is utilized
further to learn the higher-level spatiotemporal features
from the 2D feature maps for the final gesture recogni-
tion. The spatiotemporal correlation information is kept
through the whole process of feature learning. This makes
the deep architecture an effective spatiotemporal feature
learner. Experiments on the ChaLearn LAP large-scale iso-
lated gesture dataset (IsoGD) and the Sheffield Kinect Ges-
ture (SKIG) dataset demonstrate the superiority of the pro-
posed deep architecture.
1. Introduction
Gestures, as a nonverbal body language, play a very
important role in humans daily life. Gesture recognition
aims at understanding the ongoing human gestures and is
of great significance for human-robot/computer interaction,
sign language recognition and virtual [23].
Effective and universal gesture recognition from videos
is extremely difficult; partly due to the large gesture vocab-
ularies with cultural differences, various illumination con-
ditions, out-of-vocabulary motions, inconsistent and non-
standard behaviors among different performers, etc [12].
Moreover, gestures have various time durations and involve
different body parts. A small handful of gestures can be
Figure 1. Overview of the proposed deep architecture. 3DCNN
and bidirectional ConvLSTM are utilized to learn the short-
term and long-term spatiotemporal features successively, and then
2DCNN is used to learn higher-level spatiotemporal features based
on the learnt 2D long-term spatiotemporal feature maps for the fi-
nal gesture recognition.
represented by a single posture of hands and arms, but most
of the gestures are composed of a sequence of hand and arm
postures. Therefore, learning effective spatiotemporal fea-
tures is crucially important for robust gesture recognition.
According to [32], there are four typical properties for ef-
fective spatiotemporal features of gestures: (i) generic, (ii)
compact, (iii) efficient to compute, and (iv) simple to imple-
ment.
Inspired by the deep learning breakthroughs in image
recognition [17, 29, 31], lots of neural network based
frameworks are proposed to learn spatiotemporal features
3120

Figure 2. Pipeline of the proposed framework. Multimodal data
is used to train the proposed deep architecture respectively, and
RGB/Depth/Flow based spatiotemporal features are extracted and
combined into large multimodal spatiotemporal feature vectors
further. Linear SVM classifier is utilized for the final gesture
recognition.
for human action/gesture recognition. Two-Stream Con-
volutional Networks [28] learn spatial and temporal fea-
tures separately. Long-term Recurrent Convolutional Net-
works (LRCN) [6] learn spatial and temporal features us-
ing convolutional neural networks (CNN) and long-short-
term-memory (LSTM) networks successively. Tran et
al. [32] constructed a deep 3D ConvNet to learn spatiotem-
poral features directly and achieved the best performance
on different types of video analysis tasks. Molchanov et
al. [24] proposed to first learn spatiotemporal features on
each clip using 3DCNN, and then to fuse the spatiotem-
poral features over the whole video using recurrent neu-
ral networks (RNN). Obviously, 3DCNN is superior to
learn spatiotemporal features for gesture recognition. How-
ever, RNN/LSTM based networks are more suitable to en-
code long-term temporal information, especially from the
various-length videos. Although Molchanov et al. [24] pro-
posed to combine 3DCNN and RNN, fully-connected spa-
tiotemporal features are transferred into RNN, this make the
spatial correlation information lost in the RNN stage.
In this paper, we propose to first learn short-term spa-
tiotemporal features using a shallow 3DCNN, and then
learn long-term spatiotemporal features further using bidi-
rectional convolutional LSTM (ConvLSTM), lastly recog-
nize gestures using 2DCNN based on the learnt 2D spa-
tiotemporal feature maps. An overview of the proposed
deep architecture is illustrated in Figure 1, and the pipeline
of the proposed framework is in Figure 2.
In brief, our contributions in this paper include:
• 2D spatiotemporal feature maps are learnt using
3DCNN and bidirectional convolutional LSTM. The
2D feature maps can encode the global temporal infor-
mation and local spatial information. Spatiotemporal
correlation information is kept through the whole fea-
ture map learning process.
• The proposed deep architecture can transform video
files into 2D spatiotemporal feature maps. This trans-
formation makes the deep architecture more extensi-
ble to utilize the state-of-the-art 2DCNN to learn the
higher-level spatiotemporal features for gesture recog-
nition.
• The proposed spatiotemporal features with a linear
SVM classification model outperform or achieve per-
formance in par with the state-of-the-art methods on
two different benchmarks.
• To the best of our knowledge, this is the first time to
learn 2D spatiotemporal feature maps using 3DCNN
and bidirectional ConvLSTM, and then to learn higher-
level spatiotemporal features using 2DCNN for the fi-
nal gesture recognition.
2. Related Work
Learning spatiotemporal features is crucial for effective
human action/gesture recognition. Various deep neural net-
works have been proposed recently [15]. However, gesture
recognition has significant differences from action recog-
nition. One obvious difference is that backgrounds may
be an effective clue for action recognition, but in con-
trast can be a challenging factor for gesture recognition.
For example, scene backgrounds can help recognize hu-
man actions, especially the sports in UCF101 [30], but
they may bring negative impact on gesture recognition per-
formance. In fact, gestures focus more on the movement
of hands and arms. Thus, two-stream ConvNets [28] and
their derivations [36, 13] obtain the state-of-the-art perfor-
mance on HMDB51 [18] and UCF101 datasets, but they
fail to achieve a similar performance in the case of ges-
ture recognition. Another obvious approach is to learn spa-
tial and temporal features successively, such as LRCN [6].
However, Pigou et al. [26] demonstrated that LRCN-style
networks are not optimal, while bidirectional recurrence
and temporal convolutions can improve gesture recognition
performance significantly. The huge success of 2DCNN
on image recognition has encouraged researchers to trans-
form video files into particular 2D image files, so that the
state-of-the-art 2DCNN networks can be applied on gesture
recognition [39]. But, handcrafted transformation methods
have inherent deficiency on adaptive learning. In this paper,
a deep architecture will be described, which can learn adap-
tively to transform gesture video files into 2D spatiotempo-
ral feature maps.
Tran et al. [32] constructed a deep 3D ConvNet to learn
spatiotemporal features directly and achieved the best per-
formance on different types of video analysis tasks. In-
spired by [32], 3DCNN-based neural networks obtained the
remarkable performances on gesture recognition [11]. In
the past 2016 ChaLearn LAP Large-scale Isolated/ Continu-
ous Gesture Recognition Challenges [35], 3DCNN demon-
3121
剩余8页未读,继续阅读
资源评论


weixin_38608866
- 粉丝: 7
上传资源 快速赚钱
我的内容管理 展开
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助


最新资源
- 数控技术毕业设计(论文)-轴类零件的加工工艺与编程.doc
- 大学生创业idea书店(1).pptx
- 人工智能时代档案管理革新路径分析(1)(1).docx
- 自动化仪表工程施工及质量验收规范配套表格(1).doc
- 基于深度学习的高中数学单元教学实施策略(1).docx
- 2022年老师信息化培训心得体会(1).doc
- 【推荐下载】人力成本上涨逼使企业大量使用自动化设备(1).doc
- 华为大数据平台规划方案汇报(1).ppt
- 互联网+时代高校辅导员思想政治工作探究(1).docx
- 论职业院校计算机信息安全实训教学模式探索(1).docx
- 移动通信产业链与未来市场展望(1).ppt
- 电力线宽带(PLC)接入合作协议书(1).docx
- 市政工程给排水管道承接口施工技术分析(1).docx
- 《Java程序设计》课程标准授课计划-课程整体设计-教学计划-教学日历(1).doc
- 本科毕设论文-—flash科普作品的创新性设计与研究(1).doc
- 基于PCI-1711L的数据采集系统(1).docx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈



安全验证
文档复制为VIP权益,开通VIP直接复制
