基于信念传播的词汇量渐增的主题模型资源-CSDN文库

51 浏览量 2021-04-09 12:06:55 上传评论收藏 1.17MB PDF 举报

资源推荐

资源详情

资源评论

Knowledge-Based Systems 182 (2019) 104812

Contents lists available at ScienceDirect

Knowledge-Based Systems

journal homepage: www.elsevier.com/locate/knosys

Topic model with incremental vocabulary based on Belief

Propagation

✩

Meng Wang

∗

, Lu Yang, JianFeng Yan, Jianwei Zhang, Jie Zhou, Peng Xia

School of Computer Science & Technology, Soochow University, Su Zhou, Jiang Su, China

a r t i c l e i n f o

Article history:

Received 26 October 2018

Received in revised form 24 June 2019

Accepted 24 June 2019

Available online 31 July 2019

Keywords:

Topic model

Belief Propagation

Stick-breaking process

Online algorithm

a b s t r a c t

Most of the LDA algorithms make the same limiting assumption based on a fixed vocabulary. When

these algorithms process data streams in real time, the non-existent words in the vocabulary are

discounted. Unexpected words that appear in the streams are incapable to be processed, as the

atoms in the Dirichlet distribution are fixed. In order to address the drawbacks as mentioned above,

ivLDA with topic–word distribution stemming from the Dirichlet process that has infinite atoms

instead of Dirichlet distribution is proposed. ivLDA involves an incremental vocabulary that enables

the topic models to process data streams. Besides, two methods are presented to manage the indices

of the words, namely, ivLDA-Perp and ivLDA-PMI. ivLDA-Perp is capable of achieving high accuracy

and ivLDA-PMI is able to identify the most valuable words to represent the topic. As indicated by

experiments, ivLDA-Perp and ivLDA-PMI can achieve superior performance to infvoc-LDA and other

state-of-the-art algorithms with fixed vocabulary.

1. Introduction

Latent Dirichlet Allocation (LDA) [1,2] is known as a ma-

trix factorization algorithm that decomposes the document–word

matrix into the document–topic matrix and the topic–word ma-

trix. Batch LDA algorithms [3] represented by Variational Bayes

(VB) [1], Gibbs Sampling (GS) [4] and Belief Propagation (BP) [5]

all display a very high level of space complexity. As the size of

information sources grows on a continued basis in recent years,

these algorithms tent to have strict requirements on computer

memory. Therefore, for the handling of a large corpus, applying

online LDA algorithms [6,7] is a more efficient choice. Online

LDA algorithms eliminate the need to save the whole document–

word matrix into memory. They split the corpus into batches

and then make incremental estimate of the distribution. Each

batch is discarded from memory after the algorithms complete

processing of it. Most online algorithms are derived from their

batch versions, like Online Variational Bayes (OVB) [8,9], Online

Gibbs Sampling (OGS) [3,10,11] and Online Belief Propagation

(OBP) [12]. There are two major differences between them, one

being that they are based on different variants of LDA and the

✩

No author associated with this paper has disclosed any potential or

pertinent conflicts which may be perceived to have impending conflict with

this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.

2019.06.020.

∗

Corresponding author.

E-mail address: mwang030@stu.suda.edu.cn (M. Wang).

other being that they have different approaches to refreshing

parameters.

In our view, the most valuable product of LDA lies in the

topic–word distribution. Nevertheless, a large proportion of the

online LDA algorithms make the same limiting assumption based

on a fixed vocabulary, implying that the words we train are

determined at a time when the process is initiated. When algo-

rithms with fixed vocabulary process data streams in real time,

unexpected words that appear in streams cannot be processed,

as the atoms in the Dirichlet distribution are fixed, which means

words excluded from the vocabulary are discounted and new

words cannot be added. There are usually two solutions to LDA

algorithms with fixed vocabulary: (1) Making a self-satisfied vo-

cabulary. The words excluded from the vocabulary are ignored.

Despite being a common solution, it can cause the loss of crucial

information. (2) Making a very large vocabulary to ensure that all

words in streams are included. However, it places a big burden on

memory when the vocabulary is overly large.

In order to address the drawbacks as mentioned above, the

ivLDA algorithm is suggested. The topic–word distribution of

ivLDA follows the Dirichlet process (DP) [13,14], which has in-

finite atoms, not the Dirichlet distribution. ivLDA has an incre-

mental vocabulary which does not permit the addition of new

words. Despite that the vocabulary can be made flexible by using

DP, DP could give rise to some issues. Considering that DP is an

infinite version of the Dirichlet Allocation, so we make efforts to

ensure that DP has a truncation method. The order of atoms in

DP is essential, and the words with lower indices may have a

https://doi.org/10.1016/j.knosys.2019.06.020

2 M. Wang, L. Yang, J. Yan et al. / Knowledge-Based Systems 182 (2019) 104812

higher probability than the ones with higher indices, for which

we design two methods to manage the indices of words, ivLDA-

Perp and ivLDA-PMI. ivLDA-Perp can achieve high accuracy and

ivLDA-PMI can identify the most valuable words to represent the

topic.

In recent years, very few LDA algorithms with infinite vocab-

ulary have been proposed, the most representative of which is

infvoc-LDA [15]. infvoc-LDA was suggested by Ke Zhai in 2013,

who assumed the topic–word distribution stems from DP. Our

models are noticeably different from theirs:

(1) infvoc-LDA is premised on the structure of SOI [16], but

ivLDA is strictly based on BP. The basements of the two

algorithms are starkly different.

(2) There are two parameters of DP, one of which is the

base distribution

(

)

. The base distribution of ivLDA is

the uniform distribution, which suggests that every word

in topic k gets the identical probability from the stick-

breaking construction. The base distribution of infvoc-LDA

bears similarity to the N-gram model, which indicates that

each word in topic k has a different probability.

The details of the differences between ivLDA and infvoc-LDA are

described in the Section 3.

The remainder of this paper is structured as follows: The LDA

algorithm, the BP algorithm, and other related work are reviewed

in Section 2. In Section 3, the ivLDA algorithm is derived and

ivLDA-Perp and ivLDA-PMI are described in detail. A comparison

is performed of ivLDA with infvoc-LDA and other state-of-the-art

algorithms with fixed vocabulary in Section 4. Section 5 draws

conclusions and indicates envisions for future work.

2. Background

Latent Dirichlet Allocation (LDA) [1,2] is a graphical topic

model, which has a three-level structure: corpus level, document

level, and word level. It assumes that each document in corpus

can be characterized by a particular set of topics. The main task

of LDA is to assign a topic label for each word in each docu-

ment. In LDA, each document d in corpus can be represented

by a document–topic distribution θ

. The number of atoms of

distribution θ

is K, where K denotes the number of topics. Each

topic k can be represented by a topic–word distribution φ

. The

atoms of distribution φ

are the words in the vocabulary.

The generative process of LDA is as follows: Sample from the

Dirichlet distribution with a symmetric parameter α to get the

distribution of topics in document d. Sample from the Dirichlet

distribution θ

with a symmetric parameter β to get the distribu-

tion of words φ

in topic k. After that, choose a topic k for word

i in document d from θ

, choose a word w

from φ

. Repeat the

process above until all documents are generated.

The joint probability of LDA [1] is:

(

x, z, θ, φ|α, β

)

∝



k=1

(

|β

)



d=1

(

|α

)



d=1



i=1



, φ





|θ



(1)

2.1. Belief propagation

Integrating out the multinomial parameters θ

and φ

in (1),

we obtain the joint probability of the collapsed LDA [4]:

(

x, z|α, β

)

∝







w,d

+ α









w,d

+ β









w,d

+ W β



−1

(2)

where Γ (·) is the gamma function, α and β are the hyperparam-

eters. To maximize the likelihood of z in (2), the BP algorithm

provides exact solutions. In BP, the conditional joint probability



w,d

= 1|z

−

(

w,d

)

, x



, called message µ

w,d

(

)

, 0 ≤ µ

w,d

(

)

≤ 1,

can be normalized by



k=1

w,d

(k) = 1. The message update

equation is:

w,d

(

)

∝



−w,d

(

)

+ α





w,−d

(

)

+ β



−

(

w,d

)

(

)

+ W β

(3)

where the sufficient statistics are:

−w,d

(k) =



−w

w,d

(

)

(4)

w,−d

(

)



−d

w,d

(

)

(5)

where −w and −d denote all words except w and all documents

except d respectively. We can find that message µ

w,d

(k) depends

on the neighboring message µ

−(w,d)

(k). The topic distribution

of document d and the word distribution φ

of topic k can

be estimated from the sufficient statistics

(k) and

(

)

Dirichlet normalization:

(k) =

(

)

+ α



(

)

+ K α

(6)

(

)

(

)

+ β



(

)

+ W β

(7)

BP will iterate Eq. (3) until

(k) and

(

)

converge. More details

you can find in [5].

As we mentioned above, BP saves all documents in memory

and scans them over and over again until the algorithm con-

verges. When the corpus is too large, the cost of memory will

be very large. To overcome the drawback of BP, Zeng proposed

Online Belief Propagation (OBP) [12,14,17,18]. In OBP, corpus is

spilt into batches x

w,d

, d ∈ [1, D

], w ∈ [1, ∞], s ∈ [1, ∞],where

is the number of documents in the current batch and s is the

index of this batch. Each batch is freed from memory after the

processing is complete. Products of the batch in memory are the

message matrix µ

K ×NNZ

and the topic parameter matrix

K ×D

which depend only on the current batch, will also be freed too.

The global parameter matrix φ

K ×W

will remain in memory until

the algorithm ends. OBP initialize the messages randomly, and

then initialize the sufficient statistics

(k) and

(

)

(k) =



w,d

(

)

(8)

(

)

s−1

(

)



w,d

(

)

(9)

where

s−1

(

)

is the sufficient statistic of the previous batch.

After the initialization is completed, OBP will run BP on every

剩余8页未读，继续阅读

评论收藏

内容反馈

weixin_38672800

粉丝: 4
资源: 917

基于信念传播的词汇量渐增的主题模型

基于深度学习的主题模型研究.pdf

【项目实战】Python实现基于LDA主题模型进行电商产品评论数据情感分析

英语 十天内词汇量突破20000

基于TF-IDF算法和LDA主题模型数据挖掘技术在电力客户抱怨文本中的应用.pdf

topic-model_主题模型_

基于动态贝叶斯网络的大词汇量连续语音识别和音素切分研究

基于注意力的端到端大词汇量语音识别

一种基于LDA主题模型的话题发现方法_郭蓝天1

LDA(Latent Dirichlet Allocation)主题模型

利用主题模型进行基于交叉收集方面的意见挖掘.zip

融入词汇共现的社交网络用户情感Biterm主题模型.docx

主题模型java代码

LDA主题模型代码 分词代码

基于DSP的小词汇量语音识别系统

主题模型pdf讲义超详细

基于退火过渡采样的无向主题模型学习方法

LDA的时间主题模型TOT的Python代码

在线英语词汇量测试（在线考试系统）

基于知网的词汇语义相似度计算方法研究_葛斌

Topic Medels主题模型

零基础看懂LDA主题模型

基于Biterm主题模型的新闻线索生成方法 .docx

LDA主题模型code

主题模型与资料

小词汇量非特定人语音识别系统

测试你的词汇量--很好用的一个软件

最新资源

英语十天内词汇量突破20000

LDA主题模型代码分词代码