基于分层池的深度卷积神经网络用于人类动作识别资源-CSDN文库

30 浏览量 2021-03-15 15:42:19 上传评论 1 收藏 1.38MB PDF 举报

资源推荐

资源详情

资源评论

Multimed Tools Appl (2017) 76:13367–13382

DOI 10.1007/s11042-016-3768-5

Stratified pooling based deep convolutional neural

networks for human action recognition

Sheng Yu

1,2,3

· Yun Cheng

· Songzhi Su

1,3

Guorong Cai

· Shaozi Li

1,3

Received: 9 December 2015 / Revised: 3 July 2016 / Accepted: 6 July 2016 /

Published online: 15 July 2016

Abstract Video based human action recognition is an active and challenging topic in com-

puter vision. Over the last few years, deep convolutional neural networks (CNN) has become

the most popular method and achieved the state-of-the-art performance on several datasets,

such as HMDB-51 and UCF-101. Since each video has a various number of frame-level

features, how to combine these features to acquire good video-level feature becomes a

challenging task. Therefore, this paper proposed a novel action recognition method named

stratified pooling, which is based on deep convolutional neural networks (SP-CNN). The

process is mainly composed of five parts: (i) fine-tuning a pre-trained CNN on the target

dataset, (ii) frame-level features extraction; (iii) the principal component analysis (PCA)

method for feature dimensionality reduction; (iv) stratified pooling frame-level features to

get video-level feature; and (v) SVM for multiclass classification. Finally, the experimen-

tal results conducted on HMDB-51 and UCF-101 datasets show that the proposed method

outperforms the state-of-the-art.

Keywords Human action recognition · Convolutional neural networks (CNN) · Stratified

pooling (SP) · Support vector machines (SVM)

 Shaozi Li

szlig@xmu.edu.cn

Sheng Yu

yushengxmu@stu.xmu.edu.cn

Cognitive Science Department, Xiamen University, Xiamen 361005, China

School of Information, Hunan University of Humanities, Science and Technology, Loudi 417000,

China

Fujian Key Laboratory of the Brain-like Intelligent Systems, Xiamen 361005, China

Computer Engineering College, Jimei University, Xiamen 361005, China

13368 Multimed Tools Appl (2017) 76:13367–13382

1 Introduction

Human action recognition has becomes a hot research topic in computer vision. There is

a growing number of applications related to human action especially in the area of video

surveillance [12, 32, 35]. Meanwhile, it’s still very challenging to identify human action

recognition due to viewpoint changes, large intra-class variations, varying motion speed,

partial occlusion, fast irregular motion, background clutter, etc. [46, 52]. Specifically, com-

pared with object recognition in images, human action recognition is much more difficult

because action always contains abundant spatial-temporal information.

Generally, the main pipeline of action recognition can be divided into three parts: fea-

ture extraction, feature coding to generate video-level feature descriptor and the descriptor

classification. To improve the performance of recognition results, many research efforts

have been concentrating on one of these processes, among which visual feature descriptors

play the most important role in action recognition. Overall, there are mainly two types of

video features for action recognition, including hand-crafted features such as histogram of

oriented gradients (HOG) [24], histograms of optical flow (HOF) [24], motion boundary

histogram (MBH) [44], The other one is deep-learned features such as stack convolutional

independent subspace analysis [25], two-stream convolutional networks [41], 3D convolu-

tional networks [16]. Although hand-crafted features achieved good results on human action

recognition, it is worth noting that hand-crafted feature always fails in dealing with large

intra-class variations and small inter-class variations.

Recently, deep convolutional neural networks (CNN) have been proposed for large-scale

image processing [22, 39, 42]. Note that CNN have obtained state-of-the-art results on

object detection, segmentation and recognition [10, 20, 22, 39]. Inspired by these impres-

sive results, some studies attempt to employ CNN for action recognition, typically 3D

convolutional neural networks [16], stacked convolutional independent subspace analysis

[25], deep convolutional neural networks [20], two-stream convolutional networks [41],

trajectory-pooled deep convolutional descriptors (TDD) [46] , etc.

However, there are several problems to directly apply existing deep CNN models to

action recognition. Firstly, the structures of video-based CNN and image-based CNN are

different, thus the video-based CNN weights must be trained on video dataset from scratch.

Secondly, a CNN usually contains tens of millions of parameters, which rely on sufficient

amount of training videos to prevent over-fitting. Thirdly, the network needs several weeks

to train depending on the architecture. The work in [20] collected a new Sports-1M dataset,

which is composed of 1 million videos and includes 498 classes of sports. In prior work [41],

a fixed number of frames such as 25 frames are randomly selected and undergo cropping and

flipping to reshape them into a compatible input format of an image based CNN. Because

the different videos have different frame rate and duration, it is difficult to decide the number

of video frames to be sampled. In [47], Wang et al. proposed that sampling frames from the

given video may result in missing of some important information and suggested using all

video frames as the network input. The algorithm reserves all video frames information, but

greatly increased the burden of the subsequent steps.

To tackle the above problems, we first uniformly sample with stride of ten of the given

video, and then using sampled frames to fine-tune AlexNet model [22] on the target dataset.

Note that fine-tune existing CNN models not only effectively avoid over-fitting when deal-

ing with small dataset, but also reduce the training time. Moreover, the sampling method

preserves enough video information for action recognition in the case of the video frame

rate less than 30 fps.

Multimed Tools Appl (2017) 76:13367–13382 13369

As for the performance evaluation, two-stream convolutional networks [41]andtem-

poral pyramid pooling [47] match the best performance of the improved trajectories [45]

of the hand-crafted feature, and TDD reach the state-of-the-art performance on HMDB-51

and UCF-101. TDD is based on convolutional activation features for human action recog-

nition. But, convolutional layer activations belong to high-dimensional generic features,

which are lacking discriminative capacity. At this point, a number of studies shown that

the full-connected layer activations for image classification have much better performance

[22, 42] than convolutional feature activations. The main reason is that the full-connected

layer descriptor combines the benefits of strong semantics and low dimensionality. From the

experimental results, we found that this virtue is also suitable for human action recognition.

Therefore, in this paper, we select full-connected layer activations as frame-level features

for action recognition.

Motivated by the above analysis, this paper proposed a novel stratified pooling model

which transforms frame-level features to video-level feature descriptor. The design of strat-

ified pooling aims to allow an arbitrary number of the frame-level features to be aggregated

into fixed-length video-level descriptor, maintain a low computational footprint and effec-

tively prevent over-fitting. Firstly, we uniformly sampled video frames as the input of the

CNN to fine-tune AlexNet model [22] on the target dataset. Secondly, the learned model is

applied to extract activations from different layers of the CNN architecture, which including

full-connected layers and convolutional layers. Thirdly, multiple frame-level activations are

aggregated to form a final video-level feature descriptor. Finally, we use SVM with linear

kernel [7] for action classification.

The main contributions in this study are summarized as follows: (1) The uniform sam-

pling method preserves enough video information for action recognition. Furthermore, the

network does not need huge amount of labeled video dataset during the training process.

Then we obtain significant improvement on HMDB51 database, which only contains 6766

video clips. (2) We proposed a novel stratified pooling method that allows an arbitrary

number of the frame-level features to be aggregated into fixed-length video-level feature

descriptor. (3) We analyze the impact of activations of each layer in the CNN on action

recognition, and our novel method has inspiring results on the HMDB-51 and UCF-101

datasets.

The rest of the paper is structured as follows. Section 2, the previous related work is

introduced. Section 3 describes the proposed stratified pooling of deep convolutional neural

networks activations for action recognition. In Section 4, we demonstrate the efficiency and

practicality of the proposed method are conducted on two public datasets. Finally, Section 5

concludes the paper.

2 Related work

Hand-crafted features have obtained great success in recognition task. Typical examples

of local feature descriptors include scale-invariant feature transform (SIFT) [31], spatio-

temporal interest point (STIP) [23], histograms of optical flow (HOF) [24], histogram

of oriented gradients (HOG) [24], histograms of oriented 3D spatio-temporal gradients

(HOG3D) [21], speeded up robust features (SURF) [2] and motion boundary histogram

(MBH) [44]. As for face recognition, Jian et al. [18] proposed an efficient method based

on singular values and potential-field representation for face-image retrieval, in which

the rotation-shift-scale invariant properties of the singular values are exploited to design

剩余15页未读，继续阅读

评论收藏

内容反馈

weixin_38560797

粉丝: 5
资源: 997

基于分层池的深度卷积神经网络用于人类动作识别

CNN-Action-Recognition:使用卷积神经网络（CNN）的动作识别

Keras_Image_Classification_CNN:卷积神经网络模型在二值图像分类中的应用

姿势识别源代码matlab-SignFi:使用WiFi和卷积神经网络的手语识别

HMDB: a large human motion database 大型人类动作数据集HMDB-数据集

基于3D卷积神经网络的人体动作识别_张瑞1

HumanActivityRecognition:基于卷积神经网络的人类姿态识别

基于卷积神经网络的人体动作识别研究.pdf

基于改进的深度神经网络的人体动作识别模型

基于深度神经网络的人体动作识别研究.pdf

基于卷积神经网络提取多尺度分层特征识别玉米杂草.pdf

基于深度卷积对抗神经网络的多状态自适应+人脸识别方法.pdf

一维卷积神经网络用于雷达高分辨率距离像识别.pdf

基于分层卷积深度学习系统的植物叶片识别研究.pdf

网络游戏-基于分层稀疏滤波卷积神经网络的SAR图像分类方法.zip

基于卷积神经网络的人体动作识别.pdf

使用CNN的人类活动识别：用于Tensorflow中人类活动识别的卷积神经网络

基于3D卷积神经网络的装配动作识别.pdf

基于3D卷积神经网络的手语动作识别.pdf

基于深度卷积神经网络的手势动作识别.pdf

深度卷积神经网络在计算机视觉中的应用.pdf

使用卷积神经网络 (CNN) 进行智能阿尔茨海默病预测-研究论文

卷积神经网络为什么这么有效

DeepCrack_基于分层卷积的裂缝识别_生成裂缝_

基于全卷积神经网络的空间植物图像快速识别_樊帅.pdf

基于OI-LSTM神经网络结构的人类动作识别模型研究.pdf

基于可变形卷积神经网络的人体动作识别.pdf

基于卷积神经网络的人类识别项目分析报告-骆广辉1

基于深度卷积神经网络的动作教学平台.pdf

基于混合卷积神经网络的静态手势识别.pdf

最新资源