![](https://csdnimg.cn/release/download_crawler_static/16068957/bg2.jpg)
1008 J. Comput. Sci. & Technol., Sept. 2018, Vol.33, No.5
problems, but also exhibits a better predictive perfor-
mance than LDA
[2]
.
With a ra pid development of info rmation and web
technologies, various kinds of complicated data, such
as news, blogs, acade mic literatures, social data contin-
uously increase day by day. The respective explo sive
expansion of large-s c ale text data makes data analysis
by traditional methods quite problematic for the fol-
lowing reasons. Firstly, commonly used training meth-
ods such as Gibbs sampling and variational inference
(VI) are inefficient and time-consuming. Gibbs sam-
pling is based on numerous iterations and its conver-
gence is hard to predict. VI often involves several hun-
dreds of iterations, and at each iteration, it must deal
with a ll documents in a corpus. The implementation
of these two methods in sLDA becomes especially hard
and slow, because the r esponse variable is nonlinear
over the topic a ssignment, i.e., z
d
in (1), and softmax
distribution pa rameters, i.e., µ
c
in (1), and the normali-
zation factor strongly couples the topic assignment of
each document
[4]
. Secondly, sLDA based on the two
training methods refers to a single process, which can-
not be applied to multi-core computers or be extended
to a computer cluster. Thirdly, sLDA based on the
two training methods lacks the online training capacity.
This implies that new input or streaming data c annot
be processed and analy z ed separately on the basis of
the available tr ained mo de l and estimated parameters .
As a stochastic version of VI, the stochastic vari-
ational inference (SVI)
[5]
is a stochastic optimization
technique, which can approximate the posterior distri-
bution of a probabilistic model co ntaining latent vari-
ables. When estimating variational parameters, SVI
just needs a s ubset sample of the obse rved data at
each iteration, while VI must cover all the observed
data, which makes the SVI tra ining procedure faster
for convergence than that of VI. More over, SVI uses
the noisy estimation of natural gra die nt in the Rie-
mannian space instead of the traditional gradie nt in
the Euclidean space to optimize the objective function.
The natural gradient not only provides a better esti-
mation of closeness between two probabilistic distr i-
butions according to the symmetrized Kullback-Leibler
divergence
[6]
, but also avoids the time-consuming pro -
cess of computing the Fisher information matrix
[6-7]
at
each iteration, which is unavoidable fo r methods based
on the traditional Euclidean gradient. In addition, it
is more important that SVI ha s a n online training ca-
pacity. This owes mainly to essential characteristics of
SVI, i.e., the natural gradients computed on random
subsets of the observed data are noisy unbiased estima-
tions of the true gradient over the total observed data.
This capacity can r educe effectively the huge time over-
head of the learning procedure, and g uarantee well the
learning quality and high real-time demand. Therefore,
the capacity makes SVI more competitive to solve the
online applications.
Previously, we published a conference paper to
briefly pr e sent and preliminarily validate the aforemen-
tioned idea
[8]
. In this work, we extend previous work by
providing comprehensive descriptions and analysis, sur-
veying the related work, supplementing detailed math-
ematical proofs, adding the systematic analysis of time
complexity and communication complex ity, and provid-
ing the full validation of the proposed approach perfor-
mance on its accuracy, scalability, convergence, and on-
line training capability. We summarize the motivations
and contributions as follows.
1) An online strategy that aims at the sLDA elab-
oration is proposed, which uses SVI to approximately
infer the posterior distribution of sLDA and make the
training procedure more rapid a nd efficient.
2) The proposed online strategy is implemented into
the MapReduce framework to provide a parallel com-
puting mechanism of parameters and expand the ca-
pacity of the strategy for cloud computing and big data
processing.
3) The a c curacy, scalability and convergence of the
proposed stra tegy implemented by Ma pReduce are val-
idated using two data sets, and its online tra ining ca -
pacity is verified.
The rest of this paper is organized as follows. In
Section 2, we briefly s urvey related work. We re view
the ba ckground knowledge on LDA, sLDA, and SVI in
Section 3. In Section 4, we introduce the mathematical
and inferential details of sLDA based on SVI and em-
phatically present the parallel and online s LDA, as well
as a systematic a nalysis of time and communication
complexity of the proposed approach. In Section 5, we
conduct a series of experiments to rigorously validate
the accurac y, scalability, convergence, and efficiency of
online training of the proposed approach. Finally, Sec-
tion 6 highlights the main contributions and draws the
conclusions of this study.
2 Related Work
The original LDA
[1]
was pr oposed as an unsuper-
vised g e ne rative probabilistic model for document mod-
eling and tex t classification. The model can provide