Large Graph Construction for Scalable Semi-Supervised Learning
Wei Liu wliu@ee.columbia.edu
Junfeng He jh2700@columbia.edu
Shih-Fu Chang sfchang@ee.columbia.edu
Department of Electrical Engineering, Columbia University, New York, NY 10027, USA
Abstract
In this paper, we address the scalability issue
plaguing graph-based semi-supervised learn-
ing via a small number of anchor points which
adequately cover the entire point cloud. Crit-
ically, these anchor points enable nonpara-
metric regression that predicts the label for
each data point as a locally weighted av-
erage of the labels on anchor points. Be-
cause conventional graph construction is inef-
ficient in large scale, we propose to construct
a tractable large graph by coupling anchor-
based label prediction and adjacency matrix
design. Contrary to the Nystr¨om approxi-
mation of adjacency matrices which results
in indefinite graph Laplacians and in turn
leads to potential non-convex optimization
over graphs, the proposed graph construction
approach based on a unique idea called An-
chorGraph provi de s nonnegative adjacency
matrices to guarantee positive semidefinite
graph Laplacians. Our approach scales lin-
early with the data size and in practice usu-
ally produces a large sparse gr aph . Experi-
ments on large datasets demonstrate the sig-
nificant accuracy improvement and scalabil-
ity of the proposed approach.
1. Introduction
In pervasive applications of machine learning, one fre-
quently encounters situations where only a few labeled
data are available and large amounts of data remain
unlabeled. The labeled data often suffer from difficult
and expensive acqu is it ion whereas the unlabeled data
can be cheaply and automatically gathered. Semi-
supervised learning (SSL) (
Chapelle et al., 2006)(Zhu,
Appearing in Proceedings of the 27
th
International Confer-
ence on Machine Learning, Haifa, Isr ael, 2010. Copyright
2010 by the author(s)/owner(s).
2008) has been recommended to cope with the very
situations of limited labeled data and abundant unla-
beled data.
With rapid development of the Internet, now we can
collect massive (up to hundreds of millions) unlabeled
data such as images and videos, and then the need for
large scale SSL arises. Unfortunately, most SSL meth-
ods scale badly with the data size n. For instance, the
classical TSVM (
Joachims, 1999) is computationally
challenging, scaling exponentially with n. Among vari-
ous versions of TSVM, CCCP-TSVM (
Collobert et al.,
2006) has the lowest complexity, but it scales as at
least O(n
2
) so it is still difficult to scale up.
Graph-based SSL (
Zhu et al., 2003)(Zhou et al., 2004)
(Belkin et al., 2006) is appealing recently because it is
easy to implement and gives rise to closed-form solu-
tions. However, graph-based SSL usually has a cubic
time complexity O(n
3
) since the inverse of th e n × n
graph Lapl acian is needed
1
, thus blocking widespread
applicability to real-life problems that encounter grow-
ing amounts of unlabe led data. To temper the cubic
time complexity, recent studies seek to reduce the in-
tensive computation upon the graph Laplacian manip-
ulation. (
Delalleu et al., 2005) proposed a nonpara-
metric inductive function which makes label predic-
tion based on a subset of samples and then truncates
the graph Laplacian with the selected subset and its
connection to the rest samples. Clearly, such a trunca-
tion ignores the topology structure within the major-
ity part of input data and thereby loses considerable
data information. (
Zhu & Lafferty, 2005) fitted a gen-
erative mixture model to the raw data and proposed
the h ar monic mix tu r es to span the label prediction
function, but it did not explain how to construct a
large sparse graph such that the proposed harmonic
mixtures meth od can be scalable. (
Tsang & Kwok,
2007) scaled up the manifold regularization technology
first proposed in (Belkin et al., 2006) through solving
1
It is not easy yet to exactly solve the equivalent large-
scale linear systems.
评论0
最新资源