A Hybrid Method for Human Interaction Recognition
using Spatio-Temporal Interest Points
Nijun Li, Xu Cheng, Haiyan Guo, Zhenyang Wu
School of Information Science and Engineering
Southeast University, Nanjing, China
{lnjleo, xcheng, haiyan.guo, zhenyang}@seu.edu.cn
Abstract—This paper proposes an innovative and effective
hybrid way to recognize human interactions, which incorporates
the advantages of both global feature (Motion Context, MC) and
Spatio-Temporal (S-T) correlation of local Spatio-Temporal
Interest Points (STIPs). The MC feature, which also derives from
STIPs, is used to train a random forest where Genetic Algorithm
(GA) is applied to the training phase to achieve a good
compromise between reliability and efficiency. Besides, we design
an effective and efficient S-T correlation based match to assist the
MC feature, where MC’s structure and a biological sequence
matching algorithm are employed to calculate the spatial and
temporal correlation score, respectively. Experiments on the UT-
Interaction dataset show that our GA search based random
forest and S-T correlation based match achieve better
performance than some other prevalent machine leaning
methods, and that a combination of those two methods
outperforms most of the state-of-the-art works.
Keywords—spatio-temporal interest points (STIPs); motion
context (MC); random forest; genetic algorithm (GA); spatio-
temporal (S-T) correlation
I.
I
NTRODUCTION
Human action recognition, which has provoked an
increasing research interest in the past decades, is now of
central importance in many applications related to computer
vision such as video surveillance, video retrieval, human-
computer interactions, robot vision, etc. Early studies in this
area usually experiment on simple datasets which only contain
single-person activities (e.g. Weizmann and KTH), and the
recognition rates on those benchmark datasets could be close to
100% now [1]. However, the recognition rates on human
interactions are relatively low due to their richer inner
semantics and contextual information [2].
3D reconstruction and pose estimation based methods are
often used in early years of this century, but now the prevalent
approach is extracting 2D features directly from video
sequences, among which the Spatio-Temporal Interest Points
(STIPs) [3, 4] have been prevalent in the past decade due to
their simplicity, effectiveness and robustness to cluttered
backgrounds [5]. To exploit STIPs wisely, the next question
should be considered is whether to use descriptors to describe
them. Most researchers will give an affirmative answer to this
question: they describe STIPs by various histograms (e.g.
HOF, HOG, HOG3D [6], 3D-SIFT [7], etc.), then cluster the
STIP descriptors to form an unstructured (Bag-of-Words,
BoW [7]) or structured (vocabulary tree [9]) codebook, and
finally fit them into a supervised (e.g. SVM [8] or neural
network [10]) or unsupervised (e.g. probabilistic Latent
Semantic Analysis, pLSA [11]) leaning framework.
Nevertheless, it is also possible to focus only on the Spatio-
Temporal (S-T) relationships of STIPs. Bregonzio et al. [12]
extract multiple features from “clouds” of STIPs and
successfully use Nearest Neighbor (NN) classifier and SVM to
recognize human actions. Another inspiring example is
“Motion Context (MC)” [11] which is derived from “Shape
Context” [13] for object recognition, capturing the distribution
of STIPs.
Although the approach combining STIPs with BoW and
SVM is well known for its good performance, it has some
obvious shortcomings: (1) BoW uses unstructured local
features whose informative S-T relationships are totally
ignored; (2) SVM is not necessarily the best choice for
discriminative learning machine due to its binary classification
nature and difficulty in determining the kernel function
parameters. To overcome those short-comings, Matikainen et
al. [14] describe human actions by pairwise S-T relationships
whereas Zhang et al. [15] put forward a “Bag of S-T Phrases
(BoP)” model, both taking advantage of the S-T constraints of
STIPs
and achieving promising results. In spite of SVM,
decision tree receives more and more attention [1, 9] for its
merits of multiclass classification and ability to create a
structured codebook. To better deal with noise and enjoy the
benefits of boosting, some works train a series of decision
trees (also called “random forest”) [16, 17] instead of single
tree.
This paper aims at exploring and presenting effective and
efficient methods for human interaction recognition, and the
contributions of our work are as follows.
(1) An innovative hybrid framework which incorporates
both global features and S-T correlation of local features is
proposed to recognize human interactions and achieves
promising results.
(2) Genetic Algorithm (GA) search is integrated into the
training of random forest for the first time, which proves to be
a good compromise between reliability and efficiency.
(3) An efficient scheme to calculate S-T correlation score
between two videos is presented, and such score based match
outperforms both BoW and pLSA (using the same codebook).
2014 22nd International Conference on Pattern Recognition
1051-4651/14 $31.00 © 2014 IEEE
DOI 10.1109/ICPR.2014.434
2513