UnsupervisedLearningbyProbabilisticLatentSemanticAnalysis资源-CSDN文库

需积分: 10 94 浏览量 2012-11-18 19:48:22 上传评论收藏 203KB PDF 举报

### Unsupervised Learning by Probabilistic Latent Semantic Analysis #### Introduction In the realm of machine learning and natural language processing, the development of algorithms that can process text and natural language automatically has been one of the greatest challenges. With the advent of the World Wide Web, this challenge has become even more significant due to the vast amount of textual data available online. The need for intelligent systems capable of managing, filtering, and searching through huge repositories of text documents has led to the creation of a new industry. This paper introduces a novel statistical method for factor analysis of binary and count data known as **Probabilistic Latent Semantic Analysis (PLSA)**. #### Probabilistic Latent Semantic Analysis (PLSA) **Probabilistic Latent Semantic Analysis** is a technique that builds upon the foundations of Latent Semantic Analysis (LSA) but offers a more principled approach based on statistical inference. Unlike LSA, which uses linear algebra techniques and performs Singular Value Decomposition (SVD) on co-occurrence tables, PLSA employs a generative latent class model to perform probabilistic mixture decomposition. #### Key Features of PLSA - **Generative Latent Class Model**: PLSA uses a generative model to decompose the observed data into latent factors. This model allows for a more intuitive understanding of the underlying structure in the data. - **Temperature-Controlled EM Algorithm**: For fitting the model, a temperature-controlled version of the Expectation-Maximization (EM) algorithm is used. This variant of the EM algorithm has shown excellent performance in practice, especially in terms of convergence speed and stability. - **Statistical Inference Foundation**: PLSA is firmly grounded in statistical inference, providing a solid theoretical basis for the technique. This makes it more robust and reliable compared to other methods like LSA. #### Applications of PLSA - **Information Retrieval**: PLSA can be used to improve document retrieval by identifying latent topics or themes within a collection of documents. This can lead to more accurate and relevant search results. - **Natural Language Processing**: In NLP, PLSA can help in tasks such as text classification, sentiment analysis, and topic modeling. It can also aid in understanding the semantic relationships between words and phrases. - **Machine Learning from Text**: PLSA is useful for unsupervised learning tasks involving text data, such as clustering documents or extracting meaningful features from text corpora. - **Related Areas**: Other applications include automated document indexing, document summarization, and recommendation systems. #### Comparison with Standard LSA The paper presents perplexity results for different types of text and linguistic data collections, demonstrating substantial and consistent improvements of the probabilistic method over standard Latent Semantic Analysis. Perplexity is a measure of how well a model predicts a sample. Lower perplexity values indicate better model performance. #### Conclusion **Probabilistic Latent Semantic Analysis** represents a significant advancement in the field of unsupervised learning and natural language processing. By employing a generative latent class model and a temperature-controlled EM algorithm, PLSA provides a more principled and statistically sound approach to analyzing binary and count data. Its applications span various domains, including information retrieval, natural language processing, and machine learning from text. The empirical results presented in the paper highlight the effectiveness of PLSA in improving upon traditional methods like LSA, making it a valuable tool for researchers and practitioners working with large text corpora.

资源推荐

资源详情

资源评论

Machine Learning, 42, 177–196, 2001

° 2001 Kluwer Academic Publishers. Manufactured in The Netherlands.

Unsupervised Learning by Probabilistic Latent

Semantic Analysis

THOMAS HOFMANN th@cs.brown.edu

Department of Computer Science, Brown University, Providence, RI 02912, USA

Editor: Douglas Fisher

Abstract. This paper presents a novel statistical method for factor analysis of binary and count data which

is closely related to a technique known as Latent Semantic Analysis. In contrast to the latter method which

stems from linear algebra and performs a Singular Value Decomposition of co-occurrence tables, the proposed

technique uses a generative latent class model to perform a probabilistic mixture decomposition. This results

in a more principled approach with a solid foundation in statistical inference. More precisely, we propose to

make use of a temperature controlled version of the Expectation Maximization algorithm for model ﬁtting, which

has shown excellent performance in practice. Probabilistic Latent Semantic Analysis has many applications, most

prominently in information retrieval, natural language processing, machine learning from text, and in related areas.

The paper presents perplexity results for different types of text and linguistic data collections and discusses an

application in automated document indexing. The experiments indicate substantial and consistent improvements

of the probabilistic method over standard Latent Semantic Analysis.

Keywords: unsupervised learning, latent class models, mixture models, dimension reduction, EM algorithm,

information retrieval, natural language processing, language modeling

1. Introduction

The development of algorithms that enable computers to automatically process text and

natural language has always been one of the great challenges in Artiﬁcial Intelligence.

In recent years, this research direction has increasingly gained importance, last not least

due to the advent of the World Wide Web, which has ampliﬁed the need for intelligent

text and language processing. The demand for computer systems that manage, ﬁlter and

search through huge repositories of text documents has created a whole new industry,

as has the demand for smart and personalized interfaces. Consequently, any substantial

progress in this domain will have a strong impact on numerous applications ranging from

information retrieval, information ﬁltering, and intelligent agents, to speech recognition,

machine translation, and human-machine interaction.

There are two schools of thought: On one side, there is the traditional linguistics school,

which assumes that linguistic theory and logic can instruct computers to “learn” a language.

On the other side, there is a statistically-oriented community, which believes that machines

can learn (about) natural language from training data such as document collections and text

corpora. This paper follows the latter approach and presents a novel method for learning

178 T. HOFMANN

the meaning of words in a purely data-driven fashion. The proposed unsupervised learning

technique called Probabilistic Latent Semantic Analysis (PLSA) aims at identifying and

distinguishing between different contexts of word usage without recourse to a dictionary or

thesaurus. This has at least two important implications: Firstly, it allows us to disambiguate

polysems, i.e., words with multiple meanings, and essentially every word is polysemous.

Secondly, it revealstopicalsimilaritiesby grouping togetherwordsthat are part of a common

context. As a special case this includes synonyms, i.e., words with identical or almost

identical meaning.

As the name PLSA indicates, our approach has been largely inspired and inﬂuenced by

Latent Semantic Analysis (LSA) (Deerwester et al., 1990), although there are also notable

differences. The key idea in LSA is to map high-dimensional count vectors, such as term-

frequency (tf) vectors arising in the vector space representation of text documents (Salton

& McGill, 1983), to a lower dimensional representation in a so-called latent semantic

space. In doing so, LSA aims at ﬁnding a data mapping which provides information beyond

the lexical level of word occurrences. The ultimate goal is to represent semantic relations

between words and/or documents in terms of their proximity in the semantic space. Due to

its generality, LSA has proven to be a valuable analysis tool for many different problems

in practice and thus has a wide range of possible applications (e.g., Deerwester et al.,

1990; Foltz & Dumais, 1992; Landauer & Dumais, 1997; Wolfe et al., 1998; Bellegarda,

1998).

Despite its success, there are a number of shortcomings of LSA. First of all, the method-

ological foundation remains to a large extent unsatisfactory and incomplete. The origi-

nal motivation for LSA stems from linear algebra and is based on a L

-optimal approx-

imation of matrices of word counts based on a Singular Value Decomposition (SVD)

(Berry, Dumais, & Obrien, 1995). While SVD by itself is a well-understood and prin-

cipled method (Golub & Van Loan, 1996), its application to count data in LSA remains

somewhat ad hoc. From a statistical point of view, the utilization of a L

-norm approxi-

mation principle is reminiscent of a Gaussian noise assumption which is hard to justify in

the context of count variables. On a deeper, conceptual level the representation obtained

by LSA is unable to handle polysemy. For example, it is easy to show that in LSA the

coordinates of a word in the latent space can be written as a linear superposition of the

coordinates of the documents that contain the word. The superposition principle, how-

ever, is unable to explicitly capture multiple senses of a word, nor does it take into ac-

count that every word occurrence is typically intended to refer to only one meaning at a

time.

Probabilistic Latent Semantics Analysis (PLSA) stems from a statistical view of LSA.

In contrast to standard LSA, PLSA deﬁnes a proper generative data model. This has

several advantages: On the most general level it implies that standard techniques from

statistics can be applied for model ﬁtting, model selection and complexity control. For

example, one can assess the quality of a PLSA model by measuring its predictive perfor-

mance, e.g., with the help of cross-validation. More speciﬁcally, PLSA associates a latent

context variable with each word occurrence, which explicitly accounts for polysemy. A

more technical discussion of the differences between LSA and PLSA can be found in

Section 3.3.

PROBABILISTIC LATENT SEMANTIC ANALYSIS 179

2. Latent semantic analysis

2.1. Count data and co-occurrence tables

LSA can be applied to any type of count data over a discrete dyadic domain, so-called two-

mode data (Hofmann, Puzicha & Jordan, 1999). Yet, since the most prominent application

of LSA is in the analysis and retrieval of text documents, we focus on this setting. Suppose

therefore that we have given a collection of text documents D ={d

,...,d

}with terms

from a vocabulary W ={w

,...,w

}. By ignoring the sequential order in which words

occur in a document, one may summarize the data in a rectangular N × M co–occurrence

table of counts N =

n(d

)

, where n(d

)denotes the number of times the term w

occurred in document d

. In this particular case, N is also called the term-document matrix

and the rows/columns of N are referred to as document/term vectors, respectively. The key

assumption is that the simpliﬁed ‘bag-of-words’ or vector-space representation (Salton &

McGill, 1983) of documents will in many cases preserve most of the relevant information,

e.g., for tasks like text retrieval based on keywords.

The co-occurrence table representation immediately reveals the problem of data sparse-

ness (Katz, 1987), also known as the zero-frequency problem (Witten & Bell, 1991). A

typical term-document matrix derived from short articles, text summaries or abstracts may

only have a small fraction of non-zero entries (typically well below 1%), which reﬂects

the fact that only very few of the words in the vocabulary are actually used in any single

document. This has consequences, for example, in applications that are based on matching

queries with documents or evaluating similarities between documents by comparing com-

mon terms. The likelihood to ﬁnd many common terms even in closely related articles may

be small, just because they might not use exactly the same terms.

Forexample,mostofthematchingfunctionsutilized in this contextarebasedonsimilarity

functions that rely on inner products between pairs of document vectors. The encountered

problems are two-fold: On one hand, one has to account for synonyms in order not to

underestimate the true similarity of documents. On the other hand, one has to deal with pol-

ysems to avoid overestimating the true similarity between documents by counting common

terms that are used in different meanings. Both problems may lead to inappropriate lexical

matching scores which may not reﬂect the ‘true’ similarity hidden in the semantics of words.

2.2. Latent semantic analysis by singular value decomposition

As mentioned in the introduction, the key idea of LSA is to map documents—and by sym-

metry terms—to a vector space of reduced dimensionality, the latent semantic space, which

in a typical application in document indexing is chosen to have of the order ≈100−300

dimensions (Deerwester et al., 1990; Dumais, 1995). The mapping of the given docu-

ment/term vectors to its latent space representatives is restricted to be linear and is based

on a decomposition of the co-occurrence matrix by SVD. One thus starts with the standard

SVD given by

N = U6V

, (1)

剩余19页未读，继续阅读

评论收藏

内容反馈

flamestriker

粉丝: 0
资源: 1

Unsupervised Learning by Probabilistic Latent Semantic Analysis

机器学习 -- Unsupervised Learning: Principle Component Analysis

Natural Computing for Unsupervised Learning

Unsupervised Learning of Probably Symmetric Deformable 3D Object

Unsupervised Learning of Edges_Yin Li_2016(PDF)

Neural network (unsupervised learning)-Ch5

From neural PCA to deep unsupervised learning..pdf

机器学习教程 - Unsupervised Learning: Word Embedding

机器学习 -- Unsupervised Learning: Deep Auto-encoder

无监督学习入门 Hands-On Unsupervised Learning Using Python

Hands-On Unsupervised Learning Using Python epub格式

Lecture4---Unsupervised Learning Neural Networks 无监督神经网络

Unsupervised Learning.vtt

Autoencoders, Unsupervised Learning, and Deep Architectures

Unsupervised Machine Learning in Python [2016]

Hands-On Unsupervised Learning with Python.epub

吴恩达机器学习2022 Unsupervised learning

Unsupervised.Learning.with.R

Deep Learning Networks for Stock Market Analysis and Prediction

Qt 5实现串口调试助手 （源工程文件、0积分下载）

【SystemVerilog】路科验证V2学习笔记（全600页）.pdf

AutoSAR标准协议4.2.2

光伏-储能并网系统仿真.rar

XCP协议的规范文档

GD32替换STM32注意事项.pdf

NPPJSONViewer.zip

蓝牙BLE协议中文版.pdf

CANoe通过CAPL脚本实现自动测试

AD20官方中文教程.pdf

最新资源

Qt 5实现串口调试助手（源工程文件、0积分下载）