基于随机变分推理的并行在线监督主题模型资源-CSDN文库

30 浏览量 2021-03-24 02:02:22 上传评论收藏 460KB PDF 举报

主题建模是处理文本数据的一种主流且有效的技术，广泛应用于文本分析、自然语言处理、个性化推荐、计算机视觉等领域。在所有已知的主题模型中，监督隐狄利克雷分配（sLDA）被认为是一种流行且具有竞争力的监督主题模型。然而，随着数据集规模的逐渐增加，sLDA变得越来越低效和耗时，限制了其应用范围非常有限。为了解决这个问题，本研究提出了一种名为PO-sLDA（并行在线sLDA）的并行在线sLDA。它采用随机变分推理作为学习方法，使训练过程更加迅速和高效，并提出了通过MapReduce框架实现的并行计算机制，以提升云计算和大数据处理的能力。PO-sLDA支持的在线训练能力扩展了这种方法的应用范围，使其在具有高实时需求的实际应用中非常有用。文章作者是Yang Li、Wen-Zhuo Song和Bo Yang，他们都来自吉林大学计算机科学与技术学院，其中Yang Li和Bo Yang同时兼任空军航空大学。这项研究在2018年9月发表在《计算机科学与技术杂志》上，卷号为33，期号为5，页码为1007-1022，文章的DOI编号为10.1007/s11390-018-1871-y。文章的接收日期为2017年9月19日，修订日期为2018年7月9日。随机变分推理（Stochastic Variational Inference）是一种处理大规模机器学习问题的方法，特别是在概率图模型和贝叶斯非参数模型中应用广泛。它通过将高维积分问题转换为迭代的优化问题来近似求解模型参数，极大地提高了大规模数据分析的可扩展性。并行计算机制是指通过并行算法和分布式计算框架来加速数据处理和分析任务的方法。在本研究中，MapReduce框架被用来实现并行机制，它是一种编程模型，用于处理大规模数据集的并行运算。MapReduce的两个核心操作是Map和Reduce，Map操作负责处理输入数据并生成中间结果，而Reduce操作则将这些中间结果合并成最终结果。这种框架使得大规模数据的处理变得更加高效和可靠。在线学习是指在数据不断到达的情况下，持续更新模型以适应新数据的学习方法。在主题模型中，这意味着能够实时地处理文本流，并动态调整主题分布。在线学习方法特别适合于大规模文本数据处理和实时推荐系统等应用场景。大规模文本分类指的是在大量文本数据上进行分类任务的技术，这在当今社会尤为重要，因为每天都有海量的文本数据产生。PO-sLDA模型能够处理大规模数据集，因此非常适合于大规模文本分类任务。验证部分中提到的两个不同大小的数据集表明，提出的PO-sLDA方法与sLDA具有相当的准确率，同时能够有效地加速训练过程。此外，良好的收敛性和在线训练能力使PO-sLDA对于大规模文本数据分析和处理来说非常有吸引力。关键词中还提及了大型文本分类，这是PO-sLDA扩展应用范围的关键所在。在文本分类问题中，模型需要将文本分配到一个或多个类别中。大型文本分类问题的复杂性随着数据量的增加而增加，因为模型必须处理更多的特征和潜在的类别。因此，像PO-sLDA这样的高效并行在线监督主题模型对于解决这些挑战至关重要。

资源推荐

资源详情

资源评论

Li Y, Song WZ, Yang B. Stochastic variational inference-based parallel and online supervised topic model for large-scale

text processing. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 33(5): 1007–1022 Sept. 2018. DOI

10.1007/s11390-018-1871-y

Stochastic Variational Inference-Based Parallel and Online Supervised

Topic Model for Large-Scale Text Processing

Yang Li

1,2,3

, Member, CCF, Wen- Zhuo Song

1,2

, Member, CCF, and Bo Yang

1,2,∗

, Distinguished Member, CCF

College of Computer Science and Technology, Jilin University, Changchun 130012, China

Key Laboratory of Symbolic Computation and Knowledge Engineering, Ministry of Education

Jilin University, Changchun 130012, China

Aviation University of Air Force, Changchun 130062, China

E-mail: developer8942@126.com; songwzup@foxmail.com; ybo@jlu.edu.cn

Received September 19, 2017; revised July 9, 2018.

Abstract Topic m odeling is a mainstream and eﬀective technology to deal with text data, with wide applications in

text analysis, natural language, personalized recommendation, computer vision, etc. Among all the known topic models,

supervised Latent Dirichlet Allocation (sLDA) is ack nowledged as a popular and competitive supervised topic model. How-

ever, the gradual increase of the scale of datasets makes sLDA more and more ineﬃcient and time-consuming, and limits

its applications in a very narrow range. To solve it, a parallel on line sLDA, named PO-sLDA (Parallel and O nline sLDA),

is proposed in this study. It uses the stochastic variational inference as the learning method to make the t raining procedure

more rapid and eﬃcient, and a parallel computing mechanism implemented via the MapReduce framework is proposed to

promote the capacity of cloud computing and big data processing. The online training capacity supported by PO-sLDA

expands the application scope of this approach, making it instrumental for real-life applications with high real-time demand.

The validation using two datasets with diﬀerent sizes shows that th e proposed approach has the comparative accuracy as

the sLDA and can eﬃciently accelerate the training procedure. Moreover, its good convergence and online training capacity

make it lucrative for the large-scale text data analyzing and processing.

Keywords topic mod eling, large-scale text classiﬁcation, stochastic variational inference, cloud computing, online learning

1 Introductio n

The topic modeling plays a critical role in the ar-

tiﬁcial intelligence and machine learning do mains, and

is widely used for text analysis, natural language pro-

cessing, personalized recommendation, computer vi-

sion, etc. As applied to text analysis, topic models

can estimate the inherent relationships between doc-

uments and words, discover the latent topics existing

in a co rpus, re veal the to pic structure of an unknown

document, infer the similarity o f diﬀerent documents,

analyze author preferences, and so on. As a typiﬁer

of topic model, Latent Dirichlet Allocation (LDA)

[1]

the most popular unsupervised generative model, which

only models the words in the corpus to discover the la-

tent topic structures of the doc uments. Although LDA

is instrumental for clas siﬁcation due to reducing data

dimension

[1]

, it is not a good choice for c oping with

predictive tasks, since it fails to use the explicit prior

information related closely to real classiﬁcation, such as

annotations, ratings, a nd reviews. Therefore, Blei and

Mcauliﬀe

[2]

proposed the supervised Latent Dirichlet

Allocation (sLDA) by introducing a supervised learn-

ing technology into LDA, which adds a response varia-

ble, such as a numeric al

[2]

or categorical

[3]

label to each

document to explicitly model this prior information.

By modeling both words and prior information, sLDA

not only has a better ability of solving the cla ssiﬁcation

Regular Paper

This work was supported in part by the National Natural Science Foundation of China under Grant Nos. 61572226 and 61876069,

and the Key Scientiﬁc and Technological Research and Development Project of Jilin Province of China under Grant Nos. 20180201067GX

and 20180201044GX.

∗

Corresponding Author

1008 J. Comput. Sci. & Technol., Sept. 2018, Vol.33, No.5

problems, but also exhibits a better predictive perfor-

mance than LDA

[2]

With a ra pid development of info rmation and web

technologies, various kinds of complicated data, such

as news, blogs, acade mic literatures, social data contin-

uously increase day by day. The respective explo sive

expansion of large-s c ale text data makes data analysis

by traditional methods quite problematic for the fol-

lowing reasons. Firstly, commonly used training meth-

ods such as Gibbs sampling and variational inference

(VI) are ineﬃcient and time-consuming. Gibbs sam-

pling is based on numerous iterations and its conver-

gence is hard to predict. VI often involves several hun-

dreds of iterations, and at each iteration, it must deal

with a ll documents in a corpus. The implementation

of these two methods in sLDA becomes especially hard

and slow, because the r esponse variable is nonlinear

over the topic a ssignment, i.e., z

in (1), and softmax

distribution pa rameters, i.e., µ

in (1), and the normali-

zation factor strongly couples the topic assignment of

each document

[4]

. Secondly, sLDA based on the two

training methods refers to a single process, which can-

not be applied to multi-core computers or be extended

to a computer cluster. Thirdly, sLDA based on the

two training methods lacks the online training capacity.

This implies that new input or streaming data c annot

be processed and analy z ed separately on the basis of

the available tr ained mo de l and estimated parameters .

As a stochastic version of VI, the stochastic vari-

ational inference (SVI)

[5]

is a stochastic optimization

technique, which can approximate the posterior distri-

bution of a probabilistic model co ntaining latent vari-

ables. When estimating variational parameters, SVI

just needs a s ubset sample of the obse rved data at

each iteration, while VI must cover all the observed

data, which makes the SVI tra ining procedure faster

for convergence than that of VI. More over, SVI uses

the noisy estimation of natural gra die nt in the Rie-

mannian space instead of the traditional gradie nt in

the Euclidean space to optimize the objective function.

The natural gradient not only provides a better esti-

mation of closeness between two probabilistic distr i-

butions according to the symmetrized Kullback-Leibler

divergence

[6]

, but also avoids the time-consuming pro -

cess of computing the Fisher information matrix

[6-7]

each iteration, which is unavoidable fo r methods based

on the traditional Euclidean gradient. In addition, it

is more important that SVI ha s a n online training ca-

pacity. This owes mainly to essential characteristics of

SVI, i.e., the natural gradients computed on random

subsets of the observed data are noisy unbiased estima-

tions of the true gradient over the total observed data.

This capacity can r educe eﬀectively the huge time over-

head of the learning procedure, and g uarantee well the

learning quality and high real-time demand. Therefore,

the capacity makes SVI more competitive to solve the

online applications.

Previously, we published a conference paper to

brieﬂy pr e sent and preliminarily validate the aforemen-

tioned idea

[8]

. In this work, we extend previous work by

providing comprehensive descriptions and analysis, sur-

veying the related work, supplementing detailed math-

ematical proofs, adding the systematic analysis of time

complexity and communication complex ity, and provid-

ing the full validation of the proposed approach perfor-

mance on its accuracy, scalability, convergence, and on-

line training capability. We summarize the motivations

and contributions as follows.

1) An online strategy that aims at the sLDA elab-

oration is proposed, which uses SVI to approximately

infer the posterior distribution of sLDA and make the

training procedure more rapid a nd eﬃcient.

2) The proposed online strategy is implemented into

the MapReduce framework to provide a parallel com-

puting mechanism of parameters and expand the ca-

pacity of the strategy for cloud computing and big data

processing.

3) The a c curacy, scalability and convergence of the

proposed stra tegy implemented by Ma pReduce are val-

idated using two data sets, and its online tra ining ca -

pacity is veriﬁed.

The rest of this paper is organized as follows. In

Section 2, we brieﬂy s urvey related work. We re view

the ba ckground knowledge on LDA, sLDA, and SVI in

Section 3. In Section 4, we introduce the mathematical

and inferential details of sLDA based on SVI and em-

phatically present the parallel and online s LDA, as well

as a systematic a nalysis of time and communication

complexity of the proposed approach. In Section 5, we

conduct a series of experiments to rigorously validate

the accurac y, scalability, convergence, and eﬃciency of

online training of the proposed approach. Finally, Sec-

tion 6 highlights the main contributions and draws the

conclusions of this study.

2 Related Work

The original LDA

[1]

was pr oposed as an unsuper-

vised g e ne rative probabilistic model for document mod-

eling and tex t classiﬁcation. The model can provide

Yang Li et al.: Stochastic Variational Inference-Based Parallel 1009

an explicit r epresentation for one document based on

discrete bag-of-words, and the model parameters are

estimated by using the VI method or Gibbs sampling.

Afterwards, LDA has received the extensive attention

from other scholars and had a wide range of applica-

tions in many ﬁelds . Li and Perona extended LDA to

the computer vision ﬁled and pre sented an approach

for le arning and recogniz ing image categories

[9]

. Grif-

ﬁths et al. e xplored a composite model, which regards

LDA as a semantic component and the hidden Markov

model (HMM) as a syntactic c omponent for natural

language processing, and adopts the Gibbs sampling to

draw iteratively the assignments of topics and words

[10]

Wang and Blei proposed a c ollaborative topic model for

recommending scientiﬁc articles, which combines LDA

with traditional collaborative ﬁltering

[11]

To meet the requirement of prediction, Blei and

Mcauliﬀe extended LDA to a supervised version

sLDA

[2]

, which aims at the document-response corpus,

such as documents with real categories, movie reviews

with explicit ratings, and other text datasets containing

true classiﬁcations. The sLDA handles this prior infor-

mation related to response as a numerical o r categorical

label and adds it to each document, in or der to explic-

itly represent rea l classiﬁcations. The similar idea of

integrating LDA and supervised learning was also im-

plemented by other models, such as DiscLDA

[12]

and

LabeledLDA

[13]

. Among these supervised topic mod-

els, sLDA is the most inﬂuential one. Taking a ddition-

ally explicit response information into account, sLDA

not only has a better ability of solving the classiﬁca-

tion problems, but also presents more powerful predic-

tive performance than LDA

[2]

. Likewise, sLDA also

has got the increasing attentio n in acade mic sources,

and a variety of ex tens ional studies and applications

are carrie d out continuously. For handling structural

labels with directories or hierarchies, Perotte et al. pre-

sented a hierarchically supervised latent Dirichlet allo-

cation (HSLDA), which uses a hierarchy of regressors

with a conditio nal depe ndenc e (equal to 1 or −1) to

label each document

[14]

. In the computer vision ﬁeld,

Wang et al. e xtended sLDA by embedding a proba-

bilistic model of image annotation, which ma kes it p os-

sible to use the same latent topic s pace to solve im-

age class iﬁcation and annotation problems

[3]

. Boyd-

Graber and Res nik proposed multilingual supervised

latent Dirichlet allocation (MLSLDA) by jointly model-

ing the multilingual topic distributions and word-based

sentiment analysis, which can derive cons istent top-

ics and word lists with similar sentiment, and make

predictions across languages

[15]

. Recently, [16] has at-

tempted to incorporate the power of deep neural net-

works (DNNs) architecture into the inference of topic

model to improve the pe rformance on text cla ssiﬁcation

and prediction. Typically, Chen et al. redesigned the

inference procedure of sLDA to be a deep feedforward

neural network with similar s tructure, and proposed an

end-to-end discriminative learning algorithm BP-sLDA

by using back pr opagation based on mirror-descent to

deal with the scalability of text classiﬁcation

[16]

. This

work presented a new research perspective of combin-

ing the strength o f both probabilistic model and DNNs

to solve the traditional research problem.

In recent years, in order to satisfy the scalable re-

quirements, some scholars have made attempts of par-

allelization or online computing of training methods

adopted by topic models. Zhai et al. pres ented a

parallelized version Mr.L DA

[17]

of unsupervised LDA

based on the variational inference in the MapReduce

framework

[18]

. Although Mr.LDA can analyze and pro-

cess even larger document collections than LDA, it to-

tally la cks online training ca pacity. Yu et al. proposed a

distributed asy nchronous algo rithm F+Nomad LDA

[19]

for estimating the LDA parameters, which is specialized

to collapsed Gibbs sampling, and also fails to s upport

the online computing function. Additionally, the para-

llel e xpectation-ma ximization (PEM) algorithms were

proposed to infer and estimate LDA parameters

[20]

, and

LightLDA was presented to cope with eﬃciency and

scalability of topic modelling on web-scale corpus

[21]

However, they are substantially the unsupervised topic

models. Raman et al. proposed the extreme stochastic

variational inference (ESVI) to exhibit data and model

parallelism by updating a subset of c omponents of lo-

cal latent variable

[22]

. Howeve r, the model used in the

above study b elongs to the Gaussian mixture models,

with unknown expectations on convergence, and it is

not clear about whether ESVI can be extended to LDA

or sLDA. Ho ﬀman et al.

[23]

developed an online varia-

tional Bayes algorithm Online-LDA by introducing the

online stochastic optimization into the variational infer-

ence, which can be treated as a stochastic natur al gra-

dient algorithm. But in fact Online-LDA is an unsuper-

vised topic model, which cannot deal with the datasets

with labels or responses

[23]

. In this paper, we propose

a parallel and online version PO-s L DA for extending

the ability of sLDA to deal with large-scale text and

streaming data. To the best of our knowledge, this is

the ﬁrst work which aims at sLDA based on SVI to

implement both parallelization and online computing.

剩余15页未读，继续阅读

评论收藏

内容反馈

weixin_38501045

粉丝: 5
资源: 962

基于随机变分推理的并行在线监督主题模型

使用随机变分推理和MapReduce的快速可扩展监督主题模型

论文研究-一种基于变分贝叶斯的半监督双聚类算法.pdf

学界 | UCSB提出变分知识图谱推理：在KG中引入变分推理框架.pdf

大规模网络广义社区发现随机变分推理算法

贝叶斯分层模型变分推理与概率编程方法综述.docx

学界 _ UCSB提出变分知识图谱推理：在KG中引入变分推理框架.zip

Python-Python软件包利用PyTorch的变分推理来促进使用贝叶斯深度学习方法

基于Spark的分布式并行推理算法.pdf

变分贝叶斯推理（平均场理论，变分法，贝叶斯推断，EM 算法，KL 散度，变分估计，变分消息传递）

最大期望模拟退火的贝叶斯变分推理算法.docx

varinf:变分推理的例子

通过RBM、截断SVD、随机SVD和变分推理向用户推荐电影_Jupyter Notebook_Python_下载.zip

依赖贝塔过程的变分推理算法计算机研究.docx

Python-PyTorch自编码变分推断主题模型

论文研究-基于随机集证据推理的构件软件体系可靠性模型.pdf

强杂波背景下基于变分贝叶斯推理的机载雷达目标跟踪算法.docx

基于Matlab模拟高斯混合模型的变分贝叶斯推理.zip

基于变分贝叶斯推理的鲁棒多重测量稀疏信号恢复

高移动性下基于变分推理的联合干扰缓解和OFDM均衡

BayesianCNN:贝努斯卷积神经网络与伯努利近似变分推理，Gal等。 2015年

gcvar:高斯 Copula 变分推理的代码库

vipp:对概率程序进行变分推理的实验

基于模糊推理的软件质量评价模型

Mr.LDA:在MapReduce中使用变分推理的可扩展主题建模

变分相关向量机

基于随机森林的特征提取方法

最新资源