没有合适的资源?快使用搜索试试~ 我知道了~
基于信念传播的词汇量渐增的主题模型
0 下载量 51 浏览量
2021-04-09
12:06:55
上传
评论
收藏 1.17MB PDF 举报
温馨提示
大多数LDA算法都基于固定的词汇量做出相同的限制假设。 当这些算法实时处理数据流时,词汇表中不存在的单词会被打折。 由于Dirichlet分布中的原子是固定的,因此流中出现的意外单词无法进行处理。 为了解决上述缺点,提出了具有主题词分布的ivLDA,其源于具有无限原子的Dirichlet过程,而不是Dirichlet分布。 ivLDA涉及一个增量词汇表,使主题模型能够处理数据流。 此外,提出了两种方法来管理单词的索引,即ivLDA-Perp和ivLDA-PMI。 ivLDA-Perp能够实现高精度,而ivLDA-PMI能够识别代表该主题的最有价值的单词。 如实验所示,与固定词表的infvoc-LDA和其他最新算法相比,ivLDA-Perp和ivLDA-PMI可以实现更高的性能。
资源推荐
资源详情
资源评论
Knowledge-Based Systems 182 (2019) 104812
Contents lists available at ScienceDirect
Knowledge-Based Systems
journal homepage: www.elsevier.com/locate/knosys
Topic model with incremental vocabulary based on Belief
Propagation
✩
Meng Wang
∗
, Lu Yang, JianFeng Yan, Jianwei Zhang, Jie Zhou, Peng Xia
School of Computer Science & Technology, Soochow University, Su Zhou, Jiang Su, China
a r t i c l e i n f o
Article history:
Received 26 October 2018
Received in revised form 24 June 2019
Accepted 24 June 2019
Available online 31 July 2019
Keywords:
Topic model
Belief Propagation
Stick-breaking process
Online algorithm
a b s t r a c t
Most of the LDA algorithms make the same limiting assumption based on a fixed vocabulary. When
these algorithms process data streams in real time, the non-existent words in the vocabulary are
discounted. Unexpected words that appear in the streams are incapable to be processed, as the
atoms in the Dirichlet distribution are fixed. In order to address the drawbacks as mentioned above,
ivLDA with topic–word distribution stemming from the Dirichlet process that has infinite atoms
instead of Dirichlet distribution is proposed. ivLDA involves an incremental vocabulary that enables
the topic models to process data streams. Besides, two methods are presented to manage the indices
of the words, namely, ivLDA-Perp and ivLDA-PMI. ivLDA-Perp is capable of achieving high accuracy
and ivLDA-PMI is able to identify the most valuable words to represent the topic. As indicated by
experiments, ivLDA-Perp and ivLDA-PMI can achieve superior performance to infvoc-LDA and other
state-of-the-art algorithms with fixed vocabulary.
© 2019 Elsevier B.V. All rights reserved.
1. Introduction
Latent Dirichlet Allocation (LDA) [1,2] is known as a ma-
trix factorization algorithm that decomposes the document–word
matrix into the document–topic matrix and the topic–word ma-
trix. Batch LDA algorithms [3] represented by Variational Bayes
(VB) [1], Gibbs Sampling (GS) [4] and Belief Propagation (BP) [5]
all display a very high level of space complexity. As the size of
information sources grows on a continued basis in recent years,
these algorithms tent to have strict requirements on computer
memory. Therefore, for the handling of a large corpus, applying
online LDA algorithms [6,7] is a more efficient choice. Online
LDA algorithms eliminate the need to save the whole document–
word matrix into memory. They split the corpus into batches
and then make incremental estimate of the distribution. Each
batch is discarded from memory after the algorithms complete
processing of it. Most online algorithms are derived from their
batch versions, like Online Variational Bayes (OVB) [8,9], Online
Gibbs Sampling (OGS) [3,10,11] and Online Belief Propagation
(OBP) [12]. There are two major differences between them, one
being that they are based on different variants of LDA and the
✩
No author associated with this paper has disclosed any potential or
pertinent conflicts which may be perceived to have impending conflict with
this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.
2019.06.020.
∗
Corresponding author.
E-mail address: mwang030@stu.suda.edu.cn (M. Wang).
other being that they have different approaches to refreshing
parameters.
In our view, the most valuable product of LDA lies in the
topic–word distribution. Nevertheless, a large proportion of the
online LDA algorithms make the same limiting assumption based
on a fixed vocabulary, implying that the words we train are
determined at a time when the process is initiated. When algo-
rithms with fixed vocabulary process data streams in real time,
unexpected words that appear in streams cannot be processed,
as the atoms in the Dirichlet distribution are fixed, which means
words excluded from the vocabulary are discounted and new
words cannot be added. There are usually two solutions to LDA
algorithms with fixed vocabulary: (1) Making a self-satisfied vo-
cabulary. The words excluded from the vocabulary are ignored.
Despite being a common solution, it can cause the loss of crucial
information. (2) Making a very large vocabulary to ensure that all
words in streams are included. However, it places a big burden on
memory when the vocabulary is overly large.
In order to address the drawbacks as mentioned above, the
ivLDA algorithm is suggested. The topic–word distribution of
ivLDA follows the Dirichlet process (DP) [13,14], which has in-
finite atoms, not the Dirichlet distribution. ivLDA has an incre-
mental vocabulary which does not permit the addition of new
words. Despite that the vocabulary can be made flexible by using
DP, DP could give rise to some issues. Considering that DP is an
infinite version of the Dirichlet Allocation, so we make efforts to
ensure that DP has a truncation method. The order of atoms in
DP is essential, and the words with lower indices may have a
https://doi.org/10.1016/j.knosys.2019.06.020
0950-7051/© 2019 Elsevier B.V. All rights reserved.
2 M. Wang, L. Yang, J. Yan et al. / Knowledge-Based Systems 182 (2019) 104812
higher probability than the ones with higher indices, for which
we design two methods to manage the indices of words, ivLDA-
Perp and ivLDA-PMI. ivLDA-Perp can achieve high accuracy and
ivLDA-PMI can identify the most valuable words to represent the
topic.
In recent years, very few LDA algorithms with infinite vocab-
ulary have been proposed, the most representative of which is
infvoc-LDA [15]. infvoc-LDA was suggested by Ke Zhai in 2013,
who assumed the topic–word distribution stems from DP. Our
models are noticeably different from theirs:
(1) infvoc-LDA is premised on the structure of SOI [16], but
ivLDA is strictly based on BP. The basements of the two
algorithms are starkly different.
(2) There are two parameters of DP, one of which is the
base distribution
(
G0
)
. The base distribution of ivLDA is
the uniform distribution, which suggests that every word
in topic k gets the identical probability from the stick-
breaking construction. The base distribution of infvoc-LDA
bears similarity to the N-gram model, which indicates that
each word in topic k has a different probability.
The details of the differences between ivLDA and infvoc-LDA are
described in the Section 3.
The remainder of this paper is structured as follows: The LDA
algorithm, the BP algorithm, and other related work are reviewed
in Section 2. In Section 3, the ivLDA algorithm is derived and
ivLDA-Perp and ivLDA-PMI are described in detail. A comparison
is performed of ivLDA with infvoc-LDA and other state-of-the-art
algorithms with fixed vocabulary in Section 4. Section 5 draws
conclusions and indicates envisions for future work.
2. Background
Latent Dirichlet Allocation (LDA) [1,2] is a graphical topic
model, which has a three-level structure: corpus level, document
level, and word level. It assumes that each document in corpus
can be characterized by a particular set of topics. The main task
of LDA is to assign a topic label for each word in each docu-
ment. In LDA, each document d in corpus can be represented
by a document–topic distribution θ
d
. The number of atoms of
distribution θ
d
is K, where K denotes the number of topics. Each
topic k can be represented by a topic–word distribution φ
k
. The
atoms of distribution φ
k
are the words in the vocabulary.
The generative process of LDA is as follows: Sample from the
Dirichlet distribution with a symmetric parameter α to get the
distribution of topics in document d. Sample from the Dirichlet
distribution θ
d
with a symmetric parameter β to get the distribu-
tion of words φ
k
in topic k. After that, choose a topic k for word
i in document d from θ
d
, choose a word w
i
from φ
k
. Repeat the
process above until all documents are generated.
The joint probability of LDA [1] is:
p
(
x, z, θ, φ|α, β
)
∝
K
k=1
p
(
φ
k
|β
)
D
d=1
p
(
θ
d
|α
)
D
d=1
N
d
i=1
p
w
i
|z
k
i
, φ
k
p
z
k
i
|θ
d
(1)
2.1. Belief propagation
Integrating out the multinomial parameters θ
d
and φ
k
in (1),
we obtain the joint probability of the collapsed LDA [4]:
p
(
x, z|α, β
)
∝
d
k
Γ
w
x
w,d
z
k
w,d
+ α
×
w
k
Γ
d
x
w,d
z
k
w,d
+ β
×
k
Γ
w,d
x
w,d
z
k
w,d
+ W β
−1
(2)
where Γ (·) is the gamma function, α and β are the hyperparam-
eters. To maximize the likelihood of z in (2), the BP algorithm
provides exact solutions. In BP, the conditional joint probability
p
z
k
w,d
= 1|z
k
−
(
w,d
)
, x
, called message µ
w,d
(
k
)
, 0 ≤ µ
w,d
(
k
)
≤ 1,
can be normalized by
K
k=1
µ
w,d
(k) = 1. The message update
equation is:
µ
w,d
(
k
)
∝
ˆ
θ
−w,d
(
k
)
+ α
×
ˆ
φ
w,−d
(
k
)
+ β
ˆ
φ
−
(
w,d
)
(
k
)
+ W β
(3)
where the sufficient statistics are:
ˆ
θ
−w,d
(k) =
−w
x
w,d
µ
w,d
(
k
)
(4)
ˆ
φ
w,−d
(
k
)
=
−d
x
w,d
µ
w,d
(
k
)
(5)
where −w and −d denote all words except w and all documents
except d respectively. We can find that message µ
w,d
(k) depends
on the neighboring message µ
−(w,d)
(k). The topic distribution
θ
d
of document d and the word distribution φ
k
of topic k can
be estimated from the sufficient statistics
ˆ
θ
d
(k) and
ˆ
φ
w
(
k
)
by
Dirichlet normalization:
θ
d
(k) =
ˆ
θ
d
(
k
)
+ α
k
ˆ
θ
d
(
k
)
+ K α
(6)
φ
w
(
k
)
=
ˆ
φ
w
(
k
)
+ β
w
ˆ
φ
w
(
k
)
+ W β
(7)
BP will iterate Eq. (3) until
ˆ
θ
d
(k) and
ˆ
φ
w
(
k
)
converge. More details
you can find in [5].
As we mentioned above, BP saves all documents in memory
and scans them over and over again until the algorithm con-
verges. When the corpus is too large, the cost of memory will
be very large. To overcome the drawback of BP, Zeng proposed
Online Belief Propagation (OBP) [12,14,17,18]. In OBP, corpus is
spilt into batches x
s
w,d
, d ∈ [1, D
s
], w ∈ [1, ∞], s ∈ [1, ∞],where
D
s
is the number of documents in the current batch and s is the
index of this batch. Each batch is freed from memory after the
processing is complete. Products of the batch in memory are the
message matrix µ
K ×NNZ
x
and the topic parameter matrix
ˆ
θ
K ×D
s
,
which depend only on the current batch, will also be freed too.
The global parameter matrix φ
K ×W
will remain in memory until
the algorithm ends. OBP initialize the messages randomly, and
then initialize the sufficient statistics
ˆ
θ
s
d
(k) and
ˆ
φ
s
w
(
k
)
:
ˆ
θ
s
d
(k) =
w
x
s
w,d
µ
s
w,d
(
k
)
(8)
ˆ
φ
s
w
(
k
)
=
ˆ
φ
s−1
w
(
k
)
+
d
x
s
w,d
µ
s
w,d
(
k
)
(9)
where
ˆ
φ
s−1
w
(
k
)
is the sufficient statistic of the previous batch.
After the initialization is completed, OBP will run BP on every
剩余8页未读,继续阅读
资源评论
weixin_38672800
- 粉丝: 4
- 资源: 917
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 河北省雄安新区(马蹄湾村)航空高光谱遥感应用数据集下载 , ROI-GroundTruth 数据集的介绍和其他模块可参考 https://backend.blog.csdn.net/article/d
- 基于Pytorch的3D图像分割任务、数据准备和代码思路 vnet-main-bilibili.7z
- jdk17.0.13windows
- CH32V208GBU6绑定配对20250108-095034.7z
- 基于ssm的新能源汽车在线租赁管理系统+vue(Java毕业设计,附源码,数据库,教程).zip
- 基于SSM的校园美食交流系统+vue(Java毕业设计,附源码,数据库,教程).zip
- 基于SSM的小学生课外知识学习网站+vue(Java毕业设计,附源码,数据库,教程).zip
- 利用keil5在stm32f103vct6上成功运行,读取DS18B20温度值
- 基于SSM的学生公寓管理中心系统的设计与实现+jsp(Java毕业设计,附源码,数据库,教程).zip
- 基于SSM的新生报到系统+jsp(Java毕业设计,附源码,数据库,教程).zip
- 基于SSM的学生请假系统+jsp(Java毕业设计,附源码,数据库,教程).zip
- 基于SSM的学院党员管理系统+jsp(Java毕业设计,附源码,数据库,教程).zip
- 基于SSM的学校运动会信息管理系统+jsp(Java毕业设计,附源码,数据库,教程).zip
- 基于SSM的亚盛汽车配件销售业绩管理统+jsp(Java毕业设计,附源码,数据库,教程).zip
- 贷款违约数据集.zip
- 基于SSM的医院门诊挂号系统+jsp(Java毕业设计,附源码,数据库,教程).zip
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功