PLSA的matlab的源码，论文，实验数据资源-CSDN文库

共6个文件

pdf：4个

zip：2个

PLSA

matlab

源码

5星 · 超过95%的资源需积分: 10 143 浏览量 2010-11-09 21:31:45 上传评论 2 收藏 871KB RAR 举报

**主题模型：潜在语义分析（PLSA）** 潜在语义分析（Probabilistic Latent Semantic Analysis，简称PLSA）是一种统计建模方法，广泛应用于文本挖掘和信息检索领域。PLSA模型假设文档是由多个隐含的主题（topics）混合而成，并且每个主题都是由一系列单词概率分布表示的。通过分析大量文档，PLSA可以揭示文档背后的隐藏结构，从而帮助理解文档的主题内容。 **MATLAB实现** MATLAB是一种强大的数值计算和编程环境，它提供了丰富的数学函数库，适合进行复杂的算法开发和数据分析。在本资料包中，`plsa-code.pdf`很可能包含了用MATLAB编写的PLSA算法源代码。这些源码通常会详细注释，便于理解算法的每一步操作，包括文档-主题分布和主题-词项分布的初始化、EM（期望最大化）算法的迭代过程以及参数更新等步骤。 **EM算法与 tempering** EM算法是求解概率模型参数的常用方法，特别是在存在隐变量的情况下。在PLSA中，EM算法被用来交替估计文档的主题分布和主题的词项分布。`plsa-em-note.pdf`可能详细介绍了如何在MATLAB中应用EM算法来优化PLSA模型。而`tempered EM`是指在EM算法中引入温度参数，以解决局部最优问题，提高算法的全局搜索能力。 **实验数据** 实验数据是验证和评估模型性能的关键。本资料包中的实验数据可能是由一系列文档组成的，这些文档可用于训练PLSA模型并进行性能测试。通过对真实世界的数据进行建模，我们可以观察到模型在实际应用中的效果，如主题识别的准确性、召回率和F1分数等。 **笔记和论文** `Language Modeling using PLSA-Based Topic HMM.pdf`和`plsa-note.pdf`可能包含对PLSA理论的深入解释，以及如何将其应用于语言建模的详细讨论。笔记通常以简洁易懂的方式总结了复杂的概念，而原始论文则提供了更严谨的理论背景和实验结果。 **模拟退火算法** `模拟退火算法源码.zip`文件可能包含了模拟退火算法的实现，这是一种全局优化算法，灵感来源于固体物理中的退火过程。虽然PLSA主要使用EM算法，但模拟退火可能被用作一个替代或补充，尤其是在处理复杂优化问题时。这个压缩包提供了关于PLSA的全面资源，包括理论、实践、优化算法和实验数据，对于学习和研究PLSA及其在信息检索中的应用非常有价值。通过深入阅读和实践，不仅可以掌握PLSA的基本原理，还能了解到如何在MATLAB环境中实现和优化该模型。

资源推荐

资源详情

资源评论

收起资源包目录

plsa.rar （6个子文件）

Language Modeling using PLSA-Based Topic HMM.pdf 97KB

模拟退火算法源码.zip 490KB

plsa.zip 184KB

plsa-note.pdf 58KB

plsa-em-note.pdf 73KB

plsa-code.pdf 12KB

Language Modeling using PLSA-Based Topic HMM

Atsushi SAKO

, Tetsuya TAKIGUCHI

, Yasuo ARIKI

Department of Informatics and Electronics

Department of Computer and System Engineering

Kobe University, 1-1 Rokkodai, Nada, Kobe, 657-8501, JAPAN

sakoats@me.cs.scitec.kobe-u.ac.jp, {takigu, ariki}@kobe-u.ac.jp

Abstract

In this paper, we propose a PLSA-based language model for

sports live speech. This model is implemented in unigram

rescaling technique that combines a topic model and an n-gram.

In conventional method, unigram rescaling is performed with a

topic distribution estimated from a history of recognized tran-

scription. This method can improve the performance; how-

ever it cannot express topic transition. Incorporating concept

of topic transition, it is expected to improve the recognition per-

formance. Thus the proposed method employs a “Topic HMM”

instead of a history to estimate the topic distribution. The Topic

HMM is a Discrete Ergodic HMM that expresses typical topic

distributions and topic transition probabilities. Word accuracy

results indicate an improvement over tri-gram and PLSA-based

conventional method using a recognized history.

Index Terms: language modeling, text model, PLSA, HMM,

speech recognition

1. Introduction

Recently large quantities of multimedia contents are broadcast

and accessed through digital TV and WWW. In order to re-

trieve exactly what we want to know from them, automatic ex-

traction of meta-information or structuring is strongly required.

Sophisticated automatic speech recognition (ASR) plays an im-

portant role for extracting this kind of information because ac-

curate transcription is inevitable. The purpose of this study is

to improve the speech recognition accuracy for automatically

transcribing sports live speech especially baseball commentary

speech, in order to produce the closed caption and to structure

the sports games for highlight scene retrieval.

As the sports live speech, we used radio speech instead of

TV speech because it has much more information. However

the radio speech is rather fast and noisy. Furthermore, it is dis-

ﬂuent due to rephrasing, repetition, mistake and grammatical

deviation caused by spontaneous speaking style. To solve these

problems, we proposed the adaptation techniques for acoustic

model and language model [1] and the situation based language

model [2].

In order to further improve the speech recognition accuracy,

we focus on topic-based language models in this paper. Sev-

eral topic-based language models have been studied; stochastic

switching language model [3], Latent Semantic Analysis (LSA)

based language model [4] or a PLSA-based language model us-

ing unigram rescaling technique [5]. SS N -gram requires large

quantity of corpus however it is difﬁcult to create large corpus

in sports tasks. PLSA is a probabilistic model of LSA and a

more compatible with a N-gram than LSA. Thus, in this paper,

we focus on especially PLSA-based models.

The conventional PLSA-based model estimates a topic dis-

tribution using a “History” of recognized transcription. How-

ever, it cannot express topic transition. Considering topic tran-

sition, the recognition accuracy is improved because it enables

to use proper language model for each topic. Consequently,

we propose a new language model based on PLSA. The model

expresses typical distributions of topics and transition proba-

bilities between topics. We implemented this model as a Dis-

crete Ergodic HMM which has discrete distribution in each state

and transition probabilities between states. We call the HMM

“Topic-HMM”. Unigram probabilities are obtained from a dis-

tribution of a state through the algorithm described in Section

2. Moreover, tri-gram probabilities are also obtained from un-

igram rescaling technique. For each state of the Topic HMM,

tri-gram is computed as a topic dependent language model. The

experimental results show that the Topic HMM improves the

performance of the word accuracy.

2. PLSA-Based Language Modeling

Probabilistic Latent Semantic Analysis (PLSA) [6] is a topic

decomposition method for documents in a corpus. It is used to

analyze topic distributions of documents and unigram distribu-

tions in a topic. The model is estimated from the co-occurrence

probability of words and documents. Let d denote a document

from a text corpus, w denote a word, and z denote a latent vari-

able that represents a topic. Under the assumption that a doc-

ument and a word are independent of each other given a latent

variable, the conditional probability of generating a word from

a document is

P (w|d) =

z∈Z

P (w|z)P (z|d). (1)

The P (w|z) parameter is a unigram probability conditioned

on a latent variable. The P (z|d) parameter is a topic proba-

bility over each document. Note that, to distinguish a latent

topic of PLSA from a topic of Topic-HMM, we call a topic of

PLSA “a topic” and a topic of Topic-HMM “a state”. Thus,

the Topic HMM treats states as actual topics that consist of

topics as latent topics. Additionally, topic distribution is de-

ﬁned as a vector consisting of topics. Namely, topic distribu-

tion is (P (z

|d), · · · , P (z

|d))

for document d, where K is

the number of latent topics.

Each parameter is estimated by the Expectation Maximiza-

tion (EM) algorithm. The E-step is

P (z|d, w) =

P (w|z)P (z|d)

′

∈Z

P (w|z

′

)P (z

′

|d)

, (2)

and the M-step is

P (w|z) =

d∈D

N(d, w)P (z|d, w)

w∈W

d∈D

N(d, w)P (z|d, w)

, (3)

P (z|d) =

w∈W

N(d, w)P (z|d, w)

N(d)

(4)

where N(d,w) is the number of the co-occurrences of the word

w and the document d.

To use this PLSA-based model as a language model, it is

proposed in [5] to compute the probabilities of words given his-

tories P (w|h). Hence, P (w|h) is approximately computed as

follows:

P (w|h

) =

P (w|z)P (z|h

), (5)

P (z|h

) =

i + 1

P (w

|z)P (z|h

i−1

)

′

P (w

′

)P (z

′

i−1

)

(6)

i + 1

P (z|h

i−1

P (z|h

) = P (z) =

w,d

N(d, w)P (z|d)

w,d

N(w, d)

. (7)

However, the P (w|h) parameter is only a mixture of unigram

distribution. Thus, the unigram rescaling technique is proposed

in [5], which combines the PLSA-based model with an n-gram

as follow:

P (w

i−1

i−2

) ∝

P (w

)

P (w

)

P (w

i−1

i−2

). (8)

3. Language modeling using Topic HMM

Corpus PLSA

unigram

rescaling

Topic HMM

Decoder

tri-gram

Figure 1: Overview of proposed method. The Topic HMM is

learned using feature vectors obtained from topic distribution

of each utterance. A state of HMM corresponds to a discrete

distribution of topics. The decoding is performed with language

models constructed by combining the Topic HMM and a tri-

gram using unigram rescaling technique.

In this section, we describe how to construct a PLSA-

based Topic HMM. Figure 1 shows an overview of the proposed

method. First, PLSA is performed to estimate topic distribution

P (z|d) for all documents in a corpus and unigram distribution

P (w|z). Note that we employed an utterance as a document.

There are about 8,000 utterances that consist of 5 to 20 words.

Topic of each document is expressed as a vector consisting of

probabilities P (z

|d) · · · P (z

|d) where K is the number of

topics of PLSA.

Next, a Discrete Ergodic HMM (shown in ﬁgure 2) is

trained using feature vectors obtained from PLSA. Initial dis-

tribution of each state is computed by K-means method. Af-

ter Baum-Welch training, each state is a cluster that is col-

lected from similar situations or topics. The mean vector of

each state corresponds to the typical probabilities of topics.

However, there is no guarantee that sum of mean vector ele-

ments becomes 1. Hence, the normalization is performed to be

z∈Z

P (z|s) = 1. A state transition probability of an Ergodic

HMM is a topic transition probability.

Decoding is performed by driving the topic model de-

scribed above. Let W be a word sequence and S be a state

sequence of a Topic HMM. A language model is formulated as

follows:

P (W) =

P (W, S)

P (s

i−1

, w

i−1

)P (w

i−1

, s

)

≈ max

P (s

i−1

)P (w

i−1

i−2

, s

). (9)

Note that, in this research, a state transition probability

P (s

i−1

) is given only between utterances. This is because

the Topic HMM is trained using utterances as a unit. To adjust

the effect of a transition probability P (s

i−1

) of the Topic

HMM, scaling factor α is employed. Eq. 10 is derived from Eq.

9 with the scaling factor α:

P (W) = max

P (s

i−1

)

P (w

i−1

i−2

, s

). (10)

Here, Eq. 11 can be derived from Eq. 8 and Eq. 10,

P (w

i−2

i−1

, s

) ∝

P (w

)

P (w

)

P (w

i−2

i−1

). (11)

Here, P (w

) means the word unigram probability of the state

of the Topic HMM. It is computed by Eq. 1. P (z|s

) corre-

sponds to a component of a mean vector of the state.

Here, we describe how to recognize speeches using the

proposed language model in detail. Figure 3 shows a process

of decoding. Initially, the decoder knows P (z

), P (w

P (s

i−1

), P (w

) and P (w

i−1

) because these are learned

by PLSA and obtained from tri-gram probabilities. A language

model for each state is constructed in the following steps. First,

P (w

) is computed by Eq. 1 using a mean vector µ

of a

state of Topic-HMM.. Then, P (w

, s

) is computed by Eq.

8 for each state, which is the language model for a state. For

each utterance, speech recognition is performed, and then, the

most likely sequence of states S is obtained by dynamic pro-

gramming. Finally, the speech recognition result is the word

sequence corresponding to the sequence of states.

4. Experiments

To evaluate the language model using a PLSA-based Topic

HMM, speech recognition was performed. The test set is a com-

mentary speech on a baseball live game. We used a commen-

tary speech on a radio instead of a TV since it contains much

more information. We performed the experiments using three

methods; tri-gram, unigram rescaling from a recognized history

(called “History”), and proposed method. In the next section,

we describe the experimental conditions.

评论收藏

内容反馈

ZPF529

2014-07-11

资源很好，谢谢楼主
yxcwudi

2014-10-13

内容比较详尽，有程序有说明还不错
youyion

2013-07-09

挺好的，要研究下，matlab+c的混编
小_5强

2013-06-24

非常好的资源，初学者可以参考
xuxu2wenjing

2011-11-15

非常好，不但有代码还有算法思想和相关文献

前往

页

li_li_rui

粉丝: 135
资源: 16

PLSA的matlab的源码，论文，实验数据

pLSA的Matlab代码

数据统计和分析论文的matlab源代码

基于matlab编程中的概率潜在语义分析模型PLSA源码+项目说明.zip

基于matlab的图像处理系统源码+实验报告.rar

matlab课程设计含源代码

潜在语义分析：做潜在语义分析-matlab开发

基于EM算法实现的PLSA的python源码+项目说明.zip

数据分析主成分分析实验报告+Matlab代码

plsa算法介绍，包括SVD,LSA,EM算法的介绍

pls代码，，matlab版

频率调控Matlab代码-neology:论文代码“新词生于何处：新词及其语义邻域的分布语义分析”（SCiL2020）

图片分类的plsa源代码

PLSA和LSA的调研

plsa 文本分析源码

matlab论文

数值分析程序代码(MATLAB)+实验报告.rar

PLSA_demo源码

java matlab 文本分析

PLSA matlab.doc

一些期刊上的关于MATLAB应用方面的论文

MATLAB源码和论文

PLSA python实现

PLSA及EM算法

PLSA模型详解

MATLAB 论文

pLSA_demo.rar_DEMO_matlab drchrnd_plsa

最新资源