没有合适的资源?快使用搜索试试~ 我知道了~
基于分层池的深度卷积神经网络用于人类动作识别
1 下载量 30 浏览量
2021-03-15
15:42:19
上传
评论 1
收藏 1.38MB PDF 举报
温馨提示
基于视频的人体动作识别是计算机视觉中一个活跃且具有挑战性的话题。过去几年,深度卷积神经网络(CNN)成为最受欢迎的方法,并在HMDB-51和UCF-101等多个数据集上达到了最先进的性能。 由于每个视频都具有多种帧级功能,因此如何组合这些功能以获得良好的视频级功能成为一项艰巨的任务。 因此,本文提出了一种基于深度卷积神经网络(SP-CNN)的新颖的动作识别方法-分层池化。 该过程主要由五个部分组成:(i)在目标数据集上微调预训练的CNN,(ii)帧级特征提取; (iii)用于减少特征维数的主成分分析(PCA)方法; (iv)分层合并帧级功能以获得视频级功能; (v)支持多类分类的支持向量机。 最后,在HMDB-51和UCF-101数据集上进行的实验结果表明,所提出的方法优于最新技术。
资源推荐
资源详情
资源评论
Multimed Tools Appl (2017) 76:13367–13382
DOI 10.1007/s11042-016-3768-5
Stratified pooling based deep convolutional neural
networks for human action recognition
Sheng Yu
1,2,3
· Yun Cheng
2
· Songzhi Su
1,3
·
Guorong Cai
4
· Shaozi Li
1,3
Received: 9 December 2015 / Revised: 3 July 2016 / Accepted: 6 July 2016 /
Published online: 15 July 2016
© Springer Science+Business Media New York 2016
Abstract Video based human action recognition is an active and challenging topic in com-
puter vision. Over the last few years, deep convolutional neural networks (CNN) has become
the most popular method and achieved the state-of-the-art performance on several datasets,
such as HMDB-51 and UCF-101. Since each video has a various number of frame-level
features, how to combine these features to acquire good video-level feature becomes a
challenging task. Therefore, this paper proposed a novel action recognition method named
stratified pooling, which is based on deep convolutional neural networks (SP-CNN). The
process is mainly composed of five parts: (i) fine-tuning a pre-trained CNN on the target
dataset, (ii) frame-level features extraction; (iii) the principal component analysis (PCA)
method for feature dimensionality reduction; (iv) stratified pooling frame-level features to
get video-level feature; and (v) SVM for multiclass classification. Finally, the experimen-
tal results conducted on HMDB-51 and UCF-101 datasets show that the proposed method
outperforms the state-of-the-art.
Keywords Human action recognition · Convolutional neural networks (CNN) · Stratified
pooling (SP) · Support vector machines (SVM)
Shaozi Li
szlig@xmu.edu.cn
Sheng Yu
yushengxmu@stu.xmu.edu.cn
1
Cognitive Science Department, Xiamen University, Xiamen 361005, China
2
School of Information, Hunan University of Humanities, Science and Technology, Loudi 417000,
China
3
Fujian Key Laboratory of the Brain-like Intelligent Systems, Xiamen 361005, China
4
Computer Engineering College, Jimei University, Xiamen 361005, China
13368 Multimed Tools Appl (2017) 76:13367–13382
1 Introduction
Human action recognition has becomes a hot research topic in computer vision. There is
a growing number of applications related to human action especially in the area of video
surveillance [12, 32, 35]. Meanwhile, it’s still very challenging to identify human action
recognition due to viewpoint changes, large intra-class variations, varying motion speed,
partial occlusion, fast irregular motion, background clutter, etc. [46, 52]. Specifically, com-
pared with object recognition in images, human action recognition is much more difficult
because action always contains abundant spatial-temporal information.
Generally, the main pipeline of action recognition can be divided into three parts: fea-
ture extraction, feature coding to generate video-level feature descriptor and the descriptor
classification. To improve the performance of recognition results, many research efforts
have been concentrating on one of these processes, among which visual feature descriptors
play the most important role in action recognition. Overall, there are mainly two types of
video features for action recognition, including hand-crafted features such as histogram of
oriented gradients (HOG) [24], histograms of optical flow (HOF) [24], motion boundary
histogram (MBH) [44], The other one is deep-learned features such as stack convolutional
independent subspace analysis [25], two-stream convolutional networks [41], 3D convolu-
tional networks [16]. Although hand-crafted features achieved good results on human action
recognition, it is worth noting that hand-crafted feature always fails in dealing with large
intra-class variations and small inter-class variations.
Recently, deep convolutional neural networks (CNN) have been proposed for large-scale
image processing [22, 39, 42]. Note that CNN have obtained state-of-the-art results on
object detection, segmentation and recognition [10, 20, 22, 39]. Inspired by these impres-
sive results, some studies attempt to employ CNN for action recognition, typically 3D
convolutional neural networks [16], stacked convolutional independent subspace analysis
[25], deep convolutional neural networks [20], two-stream convolutional networks [41],
trajectory-pooled deep convolutional descriptors (TDD) [46] , etc.
However, there are several problems to directly apply existing deep CNN models to
action recognition. Firstly, the structures of video-based CNN and image-based CNN are
different, thus the video-based CNN weights must be trained on video dataset from scratch.
Secondly, a CNN usually contains tens of millions of parameters, which rely on sufficient
amount of training videos to prevent over-fitting. Thirdly, the network needs several weeks
to train depending on the architecture. The work in [20] collected a new Sports-1M dataset,
which is composed of 1 million videos and includes 498 classes of sports. In prior work [41],
a fixed number of frames such as 25 frames are randomly selected and undergo cropping and
flipping to reshape them into a compatible input format of an image based CNN. Because
the different videos have different frame rate and duration, it is difficult to decide the number
of video frames to be sampled. In [47], Wang et al. proposed that sampling frames from the
given video may result in missing of some important information and suggested using all
video frames as the network input. The algorithm reserves all video frames information, but
greatly increased the burden of the subsequent steps.
To tackle the above problems, we first uniformly sample with stride of ten of the given
video, and then using sampled frames to fine-tune AlexNet model [22] on the target dataset.
Note that fine-tune existing CNN models not only effectively avoid over-fitting when deal-
ing with small dataset, but also reduce the training time. Moreover, the sampling method
preserves enough video information for action recognition in the case of the video frame
rate less than 30 fps.
Multimed Tools Appl (2017) 76:13367–13382 13369
As for the performance evaluation, two-stream convolutional networks [41]andtem-
poral pyramid pooling [47] match the best performance of the improved trajectories [45]
of the hand-crafted feature, and TDD reach the state-of-the-art performance on HMDB-51
and UCF-101. TDD is based on convolutional activation features for human action recog-
nition. But, convolutional layer activations belong to high-dimensional generic features,
which are lacking discriminative capacity. At this point, a number of studies shown that
the full-connected layer activations for image classification have much better performance
[22, 42] than convolutional feature activations. The main reason is that the full-connected
layer descriptor combines the benefits of strong semantics and low dimensionality. From the
experimental results, we found that this virtue is also suitable for human action recognition.
Therefore, in this paper, we select full-connected layer activations as frame-level features
for action recognition.
Motivated by the above analysis, this paper proposed a novel stratified pooling model
which transforms frame-level features to video-level feature descriptor. The design of strat-
ified pooling aims to allow an arbitrary number of the frame-level features to be aggregated
into fixed-length video-level descriptor, maintain a low computational footprint and effec-
tively prevent over-fitting. Firstly, we uniformly sampled video frames as the input of the
CNN to fine-tune AlexNet model [22] on the target dataset. Secondly, the learned model is
applied to extract activations from different layers of the CNN architecture, which including
full-connected layers and convolutional layers. Thirdly, multiple frame-level activations are
aggregated to form a final video-level feature descriptor. Finally, we use SVM with linear
kernel [7] for action classification.
The main contributions in this study are summarized as follows: (1) The uniform sam-
pling method preserves enough video information for action recognition. Furthermore, the
network does not need huge amount of labeled video dataset during the training process.
Then we obtain significant improvement on HMDB51 database, which only contains 6766
video clips. (2) We proposed a novel stratified pooling method that allows an arbitrary
number of the frame-level features to be aggregated into fixed-length video-level feature
descriptor. (3) We analyze the impact of activations of each layer in the CNN on action
recognition, and our novel method has inspiring results on the HMDB-51 and UCF-101
datasets.
The rest of the paper is structured as follows. Section 2, the previous related work is
introduced. Section 3 describes the proposed stratified pooling of deep convolutional neural
networks activations for action recognition. In Section 4, we demonstrate the efficiency and
practicality of the proposed method are conducted on two public datasets. Finally, Section 5
concludes the paper.
2 Related work
Hand-crafted features have obtained great success in recognition task. Typical examples
of local feature descriptors include scale-invariant feature transform (SIFT) [31], spatio-
temporal interest point (STIP) [23], histograms of optical flow (HOF) [24], histogram
of oriented gradients (HOG) [24], histograms of oriented 3D spatio-temporal gradients
(HOG3D) [21], speeded up robust features (SURF) [2] and motion boundary histogram
(MBH) [44]. As for face recognition, Jian et al. [18] proposed an efficient method based
on singular values and potential-field representation for face-image retrieval, in which
the rotation-shift-scale invariant properties of the singular values are exploited to design
剩余15页未读,继续阅读
资源评论
weixin_38560797
- 粉丝: 5
- 资源: 997
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功