Unsupervised Facial Pose Grouping via Garbor
Subspace Affinity and Self-Tuning Spectral
Clustering
Xin Liu
Department of Computer Science and Technology
Huaqiao University, Xiamen, P.R. China
Email: xliu@hqu.edu.cn
Yiu-ming Cheung
a,b,∗
a
Department of Computer Science
Hong Kong Baptist University, Hong Kong SAR, China
b
United International College, BNU – HKBU, Zhuhai, China
∗
Corresponding Author; Email: ymc@comp.hkbu.edu.hk
Abstract—Facial pose grouping plays an important role in
the video face recognition. In this paper, we present an un-
supervised facial pose grouping approach via Garbor subspace
affinity and self-tuning spectral clustering. First, we utilize the
local normalization method to reduce the impact of uneven
illuminations, and then extract the discriminative appearance
features via Gabor wavelet representation. Next, the Garbor
subspace affinity method is presented to compute an affinity
matrix in terms of the pairwise similarity, in which the facial
frames of the same pose always share the smaller pairwise
similarities. Finally, we employ the self-tuning spectral clustering
algorithm to label the affinity matrix, through which the number
of pose groups and the corresponding grouping results can be
obtained automatically. Without any label priors, the proposed
approach is able to well differentiate the distinct facial poses
under uneven illuminations, and the experimental results have
shown the satisfactory performances.
Index Terms—Facial pose grouping; Garbor subspace affinity;
self-tuning spectral clustering
I. INTRODUCTION
Face recognition has been an extensive research area in
computer vision community. As one of the most successful ap-
plications in real life, it has a large number of commercial and
law-enforcement applications ranging from the surveillance,
security, telecommunications, human computer interaction and
so forth. In the past, much work on face recognition was
mainly focused on applying a single probe face image in a
relatively controlled environment and most successful recog-
nition approaches often required an accurate alignment about
the corresponding feature vector between the sample images
to be compared. Nevertheless, these single-image based face
recognition methods often limited their application domain due
to the sensitivity to the facial pose variations.
Evidently, as a common problem in the face recognition
community, the instinctive face pose, e.g., moderate out-
of-plane head motion, provides important information about
emotional response and subtle facial actions, which should be
generally considered in designing a robust face recognition
algorithm. Under the circumstances, an intuitive way is to
obtain the multiple views of the same face for recognition
purpose. Along this way, Li et al. [1] first utilized indepen-
dent component analysis (ICA) to learn a group of view-
specific subspace representation and subsequent performed a
face recognition task from the pre-learned multi-view face
examples, in which the span of the basis components in each
view-group defined a subspace of faces in that view. Latter,
Huang et al. [2] first detected the coordinates of the eyes and
then utilized the k-means algorithm to determine the facial
pose. In addition, Mangai et al. [3] utilized Linear Discriminant
Analysis (LDA) to build a low-dimensional subspace for face
images having samples at wide range of viewpoints, in which
the Mahalanobis distance was utilized to group the patterns for
subsequent recognition.
Recently, there has been an increasing interest on video-
based face recognition because the commonly available video
sources are able to provide more significant facial informa-
tion [4]. Intrinsically, recognition in video offers a great op-
portunity to integrate information temporally across the video
sequence, which may help to increase the recognition rates
significantly. Along this line, Hadid et al. [5] first performed an
unsupervised learning to extract the most representative sam-
ples (called “pose exemplars”) from the raw gallery sequences
and subsequently conducted a probabilistic voting to recognize
the individuals in the videos. Volker et al. [6] first learned
a set of exemplars (i.e., cluster centers) that incorporating
different poses to summarize the gallery video information,
and then utilized these exemplars as centers to model the
probabilistic mixture distributions for the recognition purpose.
Similar, Lee et al. [7] first modeled each registered person with
facial pose variations by a group of low-dimensional appear-
ance manifolds (named pose manifolds) in the ambient image
space and then identified the person via these pose manifolds
efficiently. Latter, they further represented the face model by a
complex nonlinear appearance manifold that approximated by
a collection of linear subspaces [8], in which each subspace
incorporating the nearby poses was obtained by principle
component analysis (PCA) of frames from the training video
sequence. In addition, Chen et al. [9] first partitioned the
video sequence by a k-means clustering algorithm so that the
frames with the similar pose and illumination were in one
partition. Then, they learned a sub-dictionary with sparseness
2015 IEEE International Conference on Systems, Man, and Cybernetics
978-1-4799-8697-2/15 $31.00 © 2015 IEEE
DOI 10.1109/SMC.2015.475
2720