【免费】主题模型LDA的论文-Blei博士资源-CSDN文库

4星 · 超过85%的资源需积分: 0 156 浏览量 2013-03-21 15:38:45 上传评论 1 收藏 408KB PDF 举报

LDA（Latent Dirichlet Allocation，潜在狄利克雷分配）是由David M. Blei、Andrew Y. Ng和Michael I. Jordan三位学者共同提出的一种主题模型，它是文本挖掘和信息检索领域的经典模型之一。Blei博士的这篇论文详细阐述了LDA模型的原理、数学框架、推导过程以及在文档建模、文本分类和协同过滤等领域的应用，并且提出了基于变分方法和经验贝叶斯参数估计的EM算法的高效近似推理技术。在文档处理和信息检索领域，文本通常被视为由词汇或术语组成的离散数据集合。LDA作为一种生成式概率模型，它的目标是为这些离散数据集找到一个简洁的描述，使得可以高效处理大数据集，同时保持对文本分类、新奇性检测、总结以及相似性和相关性评估等基本任务有帮助的统计关系。LDA模型的核心思想是将每个文档视为多个主题的有限混合，而每个主题本身则是底层话题分布的无限混合。在文本建模的上下文中，话题分布为文档提供了一个明确的表示。 LDA是一个三层的分层贝叶斯模型：第一层是文档层，每个文档包含不同的话题组合；第二层是话题层，每个话题都与一个特定的词分布相关联；第三层是词层，每个话题生成的词是从这个分布中随机抽取的。这些层通过概率方式连接，构成了一个文档生成的完整模型。例如，当我们看到一个文档时，可以将文档视为一些话题的混合，而每个话题又可以生成一些词。 LDA的数学推导涉及到概率论、贝叶斯统计和图模型等知识，它基于隐含变量的概念，将文档的生成看作是一个随机场的过程。在这个过程中，文档不是由作者有意识地创建的，而是通过选择一系列的主题来间接生成的。每个主题生成词的过程是随机的，词的分布取决于主题，而主题的分布又取决于文档。 Blei博士的这篇论文不仅介绍了LDA模型的理论基础，还提供了实用的算法实现。他们提出了一种基于变分推断的近似算法和经验贝叶斯参数估计的EM算法，这些算法能够有效地从大规模文档集中推断出模型的参数。变分方法通过优化一个下界函数来近似复杂概率分布，是处理LDA这类模型的有效手段。EM算法（期望最大化算法）则是一种迭代算法，用于含有隐变量的极大似然估计或最大后验估计。在实际应用中，LDA模型可以在无监督学习场景下使用，无需事先标记数据，就能发现文档集中的主题。该模型被广泛应用于诸如文档聚类、信息检索、话题追踪、用户兴趣建模等任务中，也能够用于文本内容分析、推荐系统和社交网络分析等领域。 Blei博士的研究团队通过对LDA模型的深入探索，为后续研究者提供了理论基础和方法论，极大地推动了主题模型领域的发展。文章的发表标志着信息检索技术的一个重要进步，为文档的自动分类和文本信息的有效提取提供了新的思路。自发表以来，LDA模型就成为自然语言处理和数据分析领域中不可或缺的重要工具，并在各种实际应用中发挥了巨大作用。

资源推荐

资源详情

资源评论

Journal of Machine Learning Research 3 (2003) 993-1022 Submitted 2/02; Published 1/03

Latent Dirichlet Allocation

David M. Blei BLEI@CS.BERKELEY.EDU

Computer Science Division

University of California

Berkeley, CA 94720, USA

Andrew Y. Ng ANG@CS.STANFORD.EDU

Computer Science Department

Stanford University

Stanford, CA 94305, USA

Michael I. Jordan JORDAN@CS.BERKELEY.EDU

Computer Science Division and Department of Statistics

University of California

Berkeley, CA 94720, USA

Editor: John Lafferty

Abstract

We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of

discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each

item of a collection is modeled as a ﬁnite mixture over an underlying set of topics. Each topic is, in

turn, modeled as an inﬁnite mixture over an underlying set of topic probabilities. In the context of

text modeling, the topic probabilities provide an explicit representation of a document. We present

efﬁcient approximate inference techniques based on variational methods and an EM algorithm for

empirical Bayes parameter estimation. We report results in document modeling, text classiﬁcation,

and collaborative ﬁltering, comparing to a mixture of unigrams model and the probabilistic LSI

model.

1. Introduction

In this paper we consider the problem of modeling text corpora and other collections of discrete

data. The goal is to ﬁnd short descriptions of the members of a collection that enable efﬁcient

processing of large collections while preserving the essential statistical relationships that are useful

for basic tasks such as classiﬁcation, novelty detection, summarization, and similarity and relevance

judgments.

Signiﬁcant progress has been made on this problem by researchers in the ﬁeld of informa-

tion retrieval (IR) (Baeza-Yates and Ribeiro-Neto, 1999). The basic methodology proposed by

IR researchers for text corpora—a methodology successfully deployed in modern Internet search

engines—reduces each document in the corpus to a vector of real numbers, each of which repre-

sents ratios of counts. In the popular tf-idf scheme (Salton and McGill, 1983), a basic vocabulary

of “words” or “terms” is chosen, and, for each document in the corpus, a count is formed of the

number of occurrences of each word. After suitable normalization, this term frequency count is

compared to an inverse document frequency count, which measures the number of occurrences of a

2003 David M. Blei, Andrew Y. Ng and Michael I. Jordan.

BLEI,NG, AND JORDAN

word in the entire corpus (generally on a log scale, and again suitably normalized). The end result

is a term-by-document matrix X whose columns contain the tf-idf values for each of the documents

in the corpus. Thus the tf-idf scheme reduces documents of arbitrary length to ﬁxed-length lists of

numbers.

While the tf-idf reduction has some appealing features—notably in its basic identiﬁcation of sets

of words that are discriminative for documents in the collection—the approach also provides a rela-

tively small amount of reduction in description length and reveals little in the way of inter- or intra-

document statistical structure. To address these shortcomings, IR researchers have proposed several

other dimensionality reduction techniques, most notably latent semantic indexing (LSI) (Deerwester

et al., 1990). LSI uses a singular value decomposition of the X matrix to identify a linear subspace

in the space of tf-idf features that captures most of the variance in the collection. This approach can

achieve signiﬁcant compression in large collections. Furthermore, Deerwester et al. argue that the

derived features of LSI, which are linear combinations of the original tf-idf features, can capture

some aspects of basic linguistic notions such as synonymy and polysemy.

To substantiate the claims regarding LSI, and to study its relative strengths and weaknesses, it is

useful to develop a generative probabilistic model of text corpora and to study the ability of LSI to

recover aspects of the generative model from data (Papadimitriou et al., 1998). Given a generative

model of text, however, it is not clear why one should adopt the LSI methodology—one can attempt

to proceed more directly, ﬁtting the model to data using maximum likelihood or Bayesian methods.

A signiﬁcant step forward in this regard was made by Hofmann (1999), who presented the

probabilistic LSI (pLSI) model, also known as the aspect model, as an alternative to LSI. The pLSI

approach, which we describe in detail in Section 4.3, models each word in a document as a sample

from a mixture model, where the mixture components are multinomial random variables that can be

viewed as representations of “topics.” Thus each word is generated from a single topic, and different

words in a document may be generated from different topics. Each document is represented as

a list of mixing proportions for these mixture components and thereby reduced to a probability

distribution on a ﬁxed set of topics. This distribution is the “reduced description” associated with

the document.

While Hofmann’s work is a useful step toward probabilistic modeling of text, it is incomplete

in that it provides no probabilistic model at the level of documents. In pLSI, each document is

represented as a list of numbers (the mixing proportions for topics), and there is no generative

probabilistic model for these numbers. This leads to several problems: (1) the number of parame-

ters in the model grows linearly with the size of the corpus, which leads to serious problems with

overﬁtting, and (2) it is not clear how to assign probability to a document outside of the training set.

To see how to proceed beyond pLSI, let us consider the fundamental probabilistic assumptions

underlying the class of dimensionality reduction methods that includes LSI and pLSI. All of these

methods are based on the “bag-of-words” assumption—that the order of words in a document can

be neglected. In the language of probability theory, this is an assumption of exchangeability for the

words in a document (Aldous, 1985). Moreover, although less often stated formally, these methods

also assume that documents are exchangeable; the speciﬁc ordering of the documents in a corpus

can also be neglected.

A classic representation theorem due to de Finetti (1990) establishes that any collection of ex-

changeable random variables has a representation as a mixture distribution—in general an inﬁnite

mixture. Thus, if we wish to consider exchangeable representations for documents and words, we

need to consider mixture models that capture the exchangeability of both words and documents.

994

LATENT DIRICHLET ALLOCATION

This line of thinking leads to the latent Dirichlet allocation (LDA) model that we present in the

current paper.

It is important to emphasize that an assumption of exchangeability is not equivalent to an as-

sumption that the random variables are independent and identically distributed. Rather, exchange-

ability essentially can be interpreted as meaning “conditionally independent and identically dis-

tributed,” where the conditioning is with respect to an underlying latent parameter of a probability

distribution. Conditionally, the joint distribution of the random variables is simple and factored

while marginally over the latent parameter, the joint distribution can be quite complex. Thus, while

an assumption of exchangeability is clearly a major simplifying assumption in the domain of text

modeling, and its principal justiﬁcation is that it leads to methods that are computationally efﬁcient,

the exchangeability assumptions do not necessarily lead to methods that are restricted to simple

frequency counts or linear operations. We aim to demonstrate in the current paper that, by taking

the de Finetti theorem seriously, we can capture signiﬁcant intra-document statistical structure via

the mixing distribution.

It is also worth noting that there are a large number of generalizations of the basic notion of

exchangeability, including various forms of partial exchangeability, and that representation theo-

rems are available for these cases as well (Diaconis, 1988). Thus, while the work that we discuss in

the current paper focuses on simple “bag-of-words” models, which lead to mixture distributions for

single words (unigrams), our methods are also applicable to richer models that involve mixtures for

larger structural units such as n-grams or paragraphs.

The paper is organized as follows. In Section 2 we introduce basic notation and terminology.

The LDA model is presented in Section 3 and is compared to related latent variable models in

Section 4. We discuss inference and parameter estimation for LDA in Section 5. An illustrative

example of ﬁtting LDA to data is provided in Section 6. Empirical results in text modeling, text

classiﬁcation and collaborative ﬁltering are presented in Section 7. Finally, Section 8 presents our

conclusions.

2. Notation and terminology

We use the language of text collections throughout the paper, referring to entities such as “words,”

“documents,” and “corpora.” This is useful in that it helps to guide intuition, particularly when

we introduce latent variables which aim to capture abstract notions such as topics. It is important

to note, however, that the LDA model is not necessarily tied to text, and has applications to other

problems involving collections of data, including data from domains such as collaborative ﬁltering,

content-based image retrieval and bioinformatics. Indeed, in Section 7.3, we present experimental

results in the collaborative ﬁltering domain.

Formally, we deﬁne the following terms:

• A word is the basic unit of discrete data, deﬁned to be an item from a vocabulary indexed by

{1,... ,V}. We represent words using unit-basis vectors that have a single component equal to

one and all other components equal to zero. Thus, using superscripts to denote components,

the vth word in the vocabulary is represented by a V-vector w such that w

= 1 and w

= 0 for

u 6= v.

• A document is a sequence of N words denoted by w =(w

,... ,w

), where w

is the nth

word in the sequence.

• A corpus is a collection of M documents denoted by

D = {w

,... ,w

995

BLEI,NG, AND JORDAN

We wish to ﬁnd a probabilistic model of a corpus that not only assigns high probability to

members of the corpus, but also assigns high probability to other “similar” documents.

3. Latent Dirichlet allocation

Latent Dirichlet allocation (LDA) is a generative probabilistic model of a corpus. The basic idea is

that documents are represented as random mixtures over latent topics, where each topic is charac-

terized by a distribution over words.

LDA assumes the following generative process for each document w in a corpus D:

1. Choose N ∼ Poisson(ξ).

2. Choose θ ∼ Dir(α).

3. For each of the N words w

(a) Choose a topic z

∼ Multinomial(θ).

(b) Choose a word w

from p(w

,β), a multinomial probability conditioned on the topic

Several simplifying assumptions are made in this basic model, some of which we remove in subse-

quent sections. First, the dimensionality k of the Dirichlet distribution (and thus the dimensionality

of the topic variable z) is assumed known and ﬁxed. Second, the word probabilities are parameter-

ized by a k ×V matrix β where β

= p(w

= 1|z

= 1), which for now we treat as a ﬁxed quantity

that is to be estimated. Finally, the Poisson assumption is not critical to anything that follows and

more realistic document length distributions can be used as needed. Furthermore, note that N is

independent of all the other data generating variables (θ and z). It is thus an ancillary variable and

we will generally ignore its randomness in the subsequent development.

A k-dimensional Dirichlet random variable θ can take values in the (k − 1)-simplex (a k-vector

θ lies in the (k− 1)-simplex if θ

≥ 0,

∑

i=1

= 1), and has the following probability density on this

simplex:

p(θ|α)=



∑

i=1



∏

i=1

Γ(α

)

−1

···θ

−1

, (1)

where the parameter αis a k-vector with components α

> 0, and where Γ(x) is the Gamma function.

The Dirichlet is a convenient distribution on the simplex — it is in the exponential family, has ﬁnite

dimensional sufﬁcient statistics, and is conjugate to the multinomial distribution. In Section 5, these

properties will facilitate the development of inference and parameter estimation algorithms for LDA.

Given the parameters α and β, the joint distribution of a topic mixture θ, a set of N topics z, and

a set of N words w is given by:

p(θ,z,w|α, β)=p(θ|α)

∏

n=1

p(z

|θ)p(w

,β), (2)

1. We refer to the latent multinomial variables in the LDA model as topics, so as to exploit text-oriented intuitions, but

we make no epistemological claims regarding these latent variables beyond their utility in representing probability

distributions on sets of words.

996

LATENT DIRICHLET ALLOCATION

Figure 1: Graphical model representation of LDA. The boxes are “plates” representing replicates.

The outer plate represents documents, while the inner plate represents the repeated choice

of topics and words within a document.

where p(z

|θ) is simply θ

for the unique i such that z

= 1. Integrating over θ and summing over

z, we obtain the marginal distribution of a document:

p(w|α, β)=

p(θ|α)

∏

n=1

∑

p(z

|θ)p(w

,β)

dθ. (3)

Finally, taking the product of the marginal probabilities of single documents, we obtain the proba-

bility of a corpus:

D| α,β)=

∏

d=1

p(θ

|α)

∏

n=1

∑

p(z

|θ

)p(w

,β)

dθ

The LDA model is represented as a probabilistic graphical model in Figure 1. As the ﬁgure

makes clear, there are three levels to the LDA representation. The parameters α and β are corpus-

level parameters, assumed to be sampled once in the process of generating a corpus. The variables

are document-level variables, sampled once per document. Finally, the variables z

and w

are

word-level variables and are sampled once for each word in each document.

It is important to distinguish LDA from a simple Dirichlet-multinomial clustering model. A

classical clustering model would involve a two-level model in which a Dirichlet is sampled once

for a corpus, a multinomial clustering variable is selected once for each document in the corpus,

and a set of words are selected for the document conditional on the cluster variable. As with many

clustering models, such a model restricts a document to being associated with a single topic. LDA,

on the other hand, involves three levels, and notably the topic node is sampled repeatedly within the

document. Under this model, documents can be associated with multiple topics.

Structures similar to that shown in Figure 1 are often studied in Bayesian statistical modeling,

where they are referred to as hierarchical models (Gelman et al., 1995), or more precisely as con-

ditionally independent hierarchical models (Kass and Steffey, 1989). Such models are also often

referred to as parametric empirical Bayes models, a term that refers not only to a particular model

structure, but also to the methods used for estimating parameters in the model (Morris, 1983). In-

deed, as we discuss in Section 5, we adopt the empirical Bayes approach to estimating parameters

such as α and β in simple implementations of LDA, but we also consider fuller Bayesian approaches

as well.

997

剩余29页未读，继续阅读

评论收藏

内容反馈

ZY_GDUT

2013-05-21

英文看起来很吃力，不过还是很不错的
lyh888

2013-04-22

Blei博士最早的关于LDA模型的论文，现在看起来已经有点老了

walking酱

粉丝: 0
资源: 2

主题模型LDA的论文-Blei博士

主题模型7篇论文

零基础看懂LDA主题模型

LDA文档-主题项目

LDA:Blei 的 LDA (2003) 的 Python 实现

LDA资料（文章+源代码）

主题识别+信息提取模型-基于python实现-LDA--LDA主题模型.可以用于社交网络数据分析研究、异常检测方面研究

Qt 5实现串口调试助手 （源工程文件、0积分下载）

【SystemVerilog】路科验证V2学习笔记（全600页）.pdf

AutoSAR标准协议4.2.2

光伏-储能并网系统仿真.rar

XCP协议的规范文档

GD32替换STM32注意事项.pdf

NPPJSONViewer.zip

蓝牙BLE协议中文版.pdf

CANoe通过CAPL脚本实现自动测试

AD20官方中文教程.pdf

完整版 Microsoft.ACE.OLEDB.12.0 驱动下载.rar

电路分析基础第二版PDF电子书免费下载

Tangent免费.rar

qt样式表一键生成（花狗Fdog）

VS2015安装证书，JavaScript_ProjectSystem.msi，JavaScript_LanguageService.msi

CMSIS-DAP使用说明及驱动.rar

Matlab安装MinGW-w64 C/C++ 编译器

七参数坐标转换工具（可在WGS84、北京54、西安80、CGCS2000坐标系中任意两个转换）

BaiduOCR.zip

Elsevier期刊word模板.zip

BeyondCompare Pro 4.2.6.23150 x64中文版.zip

最新资源

Qt 5实现串口调试助手（源工程文件、0积分下载）