Language Modeling using PLSA-Based Topic HMM
Atsushi SAKO
1
, Tetsuya TAKIGUCHI
2
, Yasuo ARIKI
2
1
Department of Informatics and Electronics
2
Department of Computer and System Engineering
Kobe University, 1-1 Rokkodai, Nada, Kobe, 657-8501, JAPAN
sakoats@me.cs.scitec.kobe-u.ac.jp, {takigu, ariki}@kobe-u.ac.jp
Abstract
In this paper, we propose a PLSA-based language model for
sports live speech. This model is implemented in unigram
rescaling technique that combines a topic model and an n-gram.
In conventional method, unigram rescaling is performed with a
topic distribution estimated from a history of recognized tran-
scription. This method can improve the performance; how-
ever it cannot express topic transition. Incorporating concept
of topic transition, it is expected to improve the recognition per-
formance. Thus the proposed method employs a “Topic HMM”
instead of a history to estimate the topic distribution. The Topic
HMM is a Discrete Ergodic HMM that expresses typical topic
distributions and topic transition probabilities. Word accuracy
results indicate an improvement over tri-gram and PLSA-based
conventional method using a recognized history.
Index Terms: language modeling, text model, PLSA, HMM,
speech recognition
1. Introduction
Recently large quantities of multimedia contents are broadcast
and accessed through digital TV and WWW. In order to re-
trieve exactly what we want to know from them, automatic ex-
traction of meta-information or structuring is strongly required.
Sophisticated automatic speech recognition (ASR) plays an im-
portant role for extracting this kind of information because ac-
curate transcription is inevitable. The purpose of this study is
to improve the speech recognition accuracy for automatically
transcribing sports live speech especially baseball commentary
speech, in order to produce the closed caption and to structure
the sports games for highlight scene retrieval.
As the sports live speech, we used radio speech instead of
TV speech because it has much more information. However
the radio speech is rather fast and noisy. Furthermore, it is dis-
fluent due to rephrasing, repetition, mistake and grammatical
deviation caused by spontaneous speaking style. To solve these
problems, we proposed the adaptation techniques for acoustic
model and language model [1] and the situation based language
model [2].
In order to further improve the speech recognition accuracy,
we focus on topic-based language models in this paper. Sev-
eral topic-based language models have been studied; stochastic
switching language model [3], Latent Semantic Analysis (LSA)
based language model [4] or a PLSA-based language model us-
ing unigram rescaling technique [5]. SS N -gram requires large
quantity of corpus however it is difficult to create large corpus
in sports tasks. PLSA is a probabilistic model of LSA and a
more compatible with a N-gram than LSA. Thus, in this paper,
we focus on especially PLSA-based models.
The conventional PLSA-based model estimates a topic dis-
tribution using a “History” of recognized transcription. How-
ever, it cannot express topic transition. Considering topic tran-
sition, the recognition accuracy is improved because it enables
to use proper language model for each topic. Consequently,
we propose a new language model based on PLSA. The model
expresses typical distributions of topics and transition proba-
bilities between topics. We implemented this model as a Dis-
crete Ergodic HMM which has discrete distribution in each state
and transition probabilities between states. We call the HMM
“Topic-HMM”. Unigram probabilities are obtained from a dis-
tribution of a state through the algorithm described in Section
2. Moreover, tri-gram probabilities are also obtained from un-
igram rescaling technique. For each state of the Topic HMM,
tri-gram is computed as a topic dependent language model. The
experimental results show that the Topic HMM improves the
performance of the word accuracy.
2. PLSA-Based Language Modeling
Probabilistic Latent Semantic Analysis (PLSA) [6] is a topic
decomposition method for documents in a corpus. It is used to
analyze topic distributions of documents and unigram distribu-
tions in a topic. The model is estimated from the co-occurrence
probability of words and documents. Let d denote a document
from a text corpus, w denote a word, and z denote a latent vari-
able that represents a topic. Under the assumption that a doc-
ument and a word are independent of each other given a latent
variable, the conditional probability of generating a word from
a document is
P (w|d) =
X
z∈Z
P (w|z)P (z|d). (1)
The P (w|z) parameter is a unigram probability conditioned
on a latent variable. The P (z|d) parameter is a topic proba-
bility over each document. Note that, to distinguish a latent
topic of PLSA from a topic of Topic-HMM, we call a topic of
PLSA “a topic” and a topic of Topic-HMM “a state”. Thus,
the Topic HMM treats states as actual topics that consist of
topics as latent topics. Additionally, topic distribution is de-
fined as a vector consisting of topics. Namely, topic distribu-
tion is (P (z
1
|d), · · · , P (z
K
|d))
T
for document d, where K is
the number of latent topics.
Each parameter is estimated by the Expectation Maximiza-
tion (EM) algorithm. The E-step is
P (z|d, w) =
P (w|z)P (z|d)
P
z
′
∈Z
P (w|z
′
)P (z
′
|d)
, (2)
- 1
- 2
- 3
- 4
- 5
- 6
前往页