没有合适的资源?快使用搜索试试~ 我知道了~
统计模型说话人识别系统工具包GMTK说明书
4星 · 超过85%的资源 需积分: 9 37 下载量 99 浏览量
2009-05-17
20:10:47
上传
评论
收藏 745KB PDF 举报
温馨提示
试读
12页
GMTK说明说。GMTK是现今国最流行的语音工具包。实用于基于统计模型的说话人识别系统
资源推荐
资源详情
资源评论
1053-5888/05/$20.00©2005IEEE IEEE SIGNAL PROCESSING MAGAZINE [89] SEPTEMBER 2005
©ARTVILLE & COMSTOCK
A
graph is a two-dimensional visual formalism that can be used to describe many different phe-
nomena. Graphs are used in a wide variety of fields, including computer science, data and con-
trol flow, entity relationships and social networks, Petri and neural networks,
software/hardware visualization, and parallel computation. The popularity of graphs is in large
part due to their ability represent complex situations in an intuitive and visually appealing way.
Statistical graphical models are a family of graphical abstractions of statistical models where important
aspects (e.g., factorization) of such models are represented using graphs. In recent years, and due to a wide
range of research, it has become apparent that graphical models offer a mathematically formal but widely
flexible means for solving many of the problems in speech and language processing. Graphs are able to rep-
resent events at the very high level (such as relationships between linguistic classes), at the very low level
(such as correlations between spectral features or acoustic landmarks), and at all levels in between (such as
lexical pronunciation). A fundamental advantage of graphical models is rapidity. With a graphical model, it
is possible to quickly express a novel, complicated idea in an intuitive, concise, and mathematically precise
way, and to speedily and visually communicate that idea between colleagues. Moreover, with the right soft-
ware, it is possible to rapidly prototype that idea on a standard desktop workstation.
This article discusses the foundations of the use of graphical models for speech recognition [16], [25],
[26], [42], [47], giving detailed accounts of some of the more successful cases. Our discussion will, in
[
Jeff A. Bilmes and Chris Bartels
]
Graphical Model
Architectures for Speech
Recognition
[
A powerful representation paradigm
for both standard and novel speech
recognition architectures
]
Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on December 24, 2008 at 00:49 from IEEE Xplore. Restrictions apply.
IEEE SIGNAL PROCESSING MAGAZINE [90] SEPTEMBER 2005
particular, employ dynamic Bayesian networks (DBNs) and a
DBN extension using the Graphical Model Toolkit’s (GMTK’s)
basic template, a dynamic graphical model representation that
is more suitable for speech and language systems. While this
article will concentrate on speech recognition, it should be
noted that many of the ideas presented here are also applicable
to natural language processing and general time-series analysis.
This article assumes some familiarity with basic speech-to-
text concepts [16], [25], [26], [42], [46], [47] and the basics of
graphical models [28], [29], [32], [41], including notions such as
hidden and observed variables, evidence, and factorization and
conditional independence. Moreover, we will use the MATLAB-
like notation
1:N
to denote the set of integers
{1, 2,...,N}
. A
set of
N
random variables (RVs) is denoted as
X
1:N
. Given any
subset
S ⊆ 1:N
, where
S ={S
1
, S
2
,...,S
|S|
}
, the subset of
random variables is denoted as
X
S
={X
S
1
, X
S
2
,...,X
S
|S|
}
.
Lastly, we use upper case letters (such as
X
and
Q
) to refer to
random variables, and lower case letters (such as
x
and
q
) to
refer to random variable values.
DYNAMIC GRAPHICAL MODELS
Graphical models [27], [29], [32], [41] are a set of formalisms,
each of which describes families of probability distributions.
There are many different types of graphical models [27], [31],
[32], [41], each having its own semantics [10] that govern how
the graph specifies a set of factorization constraints on multi-
variate probability distributions. Of course, factorization and
conditional independence go hand in hand; thus, factorization
constraints typically (but not always) involve conditional inde-
pendence properties. A Bayesian network (BN) is one type of
graphical model where the graphs are directed and acyclic. In
a BN, the probability distribution over a set of variables
X
1:N
factorizes with respect to a directed acyclic graph (DAG) as
p(x
1:N
) =
i
p(x
i
|x
π
i
)
, where
π
i
⊂ 1:N
are the set of indices
of
X
i
’s immediate parents according to the BN’s DAG. This
factorization is called the directed factorization property [32].
There are many additional (and provably equivalent)
characterizations of BNs, including the notion of d-separation
[32], [41], but this one suffices for our discussion. It should be
clear that because of the strong relationship between factoriza-
tion and conditional independence, the above factorization
implies that a BN expresses a large number of conditional
independence statements to the extent that it has missing
edges in the graph. Moreover, it should be clear that it is the
common factorization properties of the family of probability
distributions that makes for efficient probabilistic inference
[28], [29], [32], [41].
Speech is inherently a temporal process, and any graphi-
cal model for speech must take this into account.
Accordingly, dynamic graphical models [5] are graphs that
represent the temporal evolution of the statistical properties
of a speech signal, ideally in such a way as to improve auto-
matic speech recognition (ASR) accuracy. For speech recog-
nition, DBNs [10], [15], [37], [49]–[51] have been most
successfully used. DBNs are simply BNs with a repeated
“template” structure over time. Other than this regularity,
however, DBNs have exactly the same semantics as BNs.
Specifically, a DBN of length
T
is a DAG
G = (V, E) =
(
T
t=1
V
t
, E
T
∪
T−1
t=1
E
t
∪E
→
t
)
with node set
V
and edge set
E
comprising pairs of nodes. If
uv ∈ E
for
u, v ∈ V
, then
uv
is an
edge of
G
. The sets
V
t
are the nodes at time slice
t
,
E
t
are the
intraslice edges between nodes in
V
t
, and
E
→
t
are the interslice
edges between nodes in
V
t
and
V
t+1
. A DBN, however, does not
typically have this much flexibility. That is, a DBN is specified
using a “rolled up” template giving nodes that are repeated in
each slice, the intraslice edges among those nodes, and the
interslice edges between nodes of adjacent slices. In other
words,
V
t
and
V
t+τ
have the same set of random variables that
are different only in that the time indexes of the variables differ
by
τ
. The same is true for
E
t
and
E
t+τ
, as well as for
E
→
t
and
E
→
t+τ
. The DBN template is then unrolled to any desired length
T
to yield the DBN
G
. As in any BN, the collection of edges
pointing into a node corresponds to a conditional probability
function (CPF). In a DBN, the CPF of a node is shared (or tied)
with the CPF of all other nodes that have come from the same
underlying node in the DBN template. If
V
t
∈ V
t
with parents
V
π
t
,
then
p(V
t
= v|V
π
t
= v
π
) = p(V
τ
= v|V
π
τ
= v
π
)
for all
t,τ
and
for all scalar values
v
and vector values
v
π
. Therefore, it is possi-
ble to represent a DBN of unbounded length but with only a
finite description length and a finite number of parameters.
It is well known that the hidden Markov model (HMM) is
one type of DBN [44]. Even given its success and flexibility,
however, the HMM is only one small model within the enor-
mous family of statistical techniques represented by DBNs. Like
an HMM, a DBN makes a temporal Markov assumption, mean-
ing that the future is independent of the past given the present.
In fact, it is true that many (but not all, see section titled
“Architectures over Observed Variables”) DBNs can be “flat-
tened” into a corresponding HMM, but staying within the DBN
framework has several advantages. First, in DBN form, there
can be exploitable computational advantages since the DBN
explicitly represents factorization properties and factorization is
the key to tractable probabilistic inference [29]. These factor-
izations, however, are lost when the model is flattened. Second,
the factorization specified by a DBN implies that there are con-
straints that the model must obey. For example, consider
Figure 1, which shows a two-Markov-chain DBN with chains
(Q
1
t
, Q
2
t
)
. A flattened HMM would have one chain
R
t
≡ (Q
1
t
, Q
2
t
)
with transition probabilities set
[FIG1] A simple two-stream Markov chain.
Q
1
t
Q
2
t
Q
1
t−1
t−1
Q
2
Q
1
t+1
t+1
Q
2
Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on December 24, 2008 at 00:49 from IEEE Xplore. Restrictions apply.
IEEE SIGNAL PROCESSING MAGAZINE [91] SEPTEMBER 2005
p(R
t
= r
t
|R
t−1
= r
t−1
)
= p
Q
1
t
= q
1
t
, Q
2
t
= q
2
t
Q
1
t−1
= q
1
t−1
, Q
2
t−1
= q
2
t−1
,
where
r
t
≡ (q
1
t
, q
2
t
)
is the joint HMM state space. Such flatten-
ing, however, ignores the factorization constraint expressed by
the graph, which is
p
Q
1
t
= q
1
t
, Q
2
t
= q
2
t
Q
1
t−1
= q
1
t−1
, Q
2
t−1
= q
2
t−1
= p
Q
1
t
= q
1
t
Q
1
t−1
= q
1
t−1
× p
Q
2
t
= q
2
t
Q
1
t−1
= q
1
t−1
, Q
2
t−1
= q
2
t−1
.
In other words, not all possible
p(r
t
|r
t−1
)
CPFs are allowed
given the graph due to its conditional independence property. In
the above, other factors in addition to
Q
1
t−1
would influence the
distribution of
Q
1
t
if no assumptions were made. Of course, the
HMM can represent a distribution designed under these con-
straints. When training parameters, however, we must find the
optimal solution within the parameter space subject to these
constraints. Moreover, it is during training (when the amount of
training data might be limited) that one wants to reduce the
amount of parameter freedom (via a set of constraints on the
model) as much as possible. Since a DBN naturally expresses
factorization, it is an ideal candidate to train model parameters
in this case. A third advantage of DBNs is that they convey struc-
tural information about the underlying problem. Such structure
might represent anything from the result of data-mining
process [3] on the training data to dependencies over high-level
knowledge sources, or both. In either case, information about a
domain is visually and intuitively portrayed.
Loosely speaking, DBN probabilistic inference (a generalization
of the Baum-Welch procedure for HMMs [42]) has a computational
cost upper bound (i.e., it is possible to show that this is the worst
case) equal to very roughly the joint state space (the number of
combined variable assignments that can occur with nonzero proba-
bility) of all the variables in two time slices of the graph [5], [37],
[49]–[51] multiplied by the total number of time slices
T
.
Therefore, one must take care when adding variables to a DBN that
the cost does not become prohibitive. While this article does not get
into the specifics of DBN inference, it should be known that this
cost often strongly depends on the DBN triangulation method used
[1], [5]. In other words, adding variables will often, but not neces-
sarily always, cause a significant increase in computational cost.
THE GMTK DYNAMIC TEMPLATE
Before exploring various ASR constructs using graphical models,
we define the GMTK’s [5], [7] extension of a DBN template. This
extension facilitates the expression of graphical models for
speech recognition and natural language processing.
A GMTK template extends a standard DBN template in five
distinct ways. First, it allows for not only forward but also back-
ward directed time links. This allows for a richer model specifi-
cation that enables, for example, representations of reverse-time
effects such as coarticulation in human speech (see Figure 2).
[FIG2] A multiframe GMTK template (top) with a two-frame prologue
P
, a three-frame chunk
C
, and a two-frame epilogue
E
and
unrolled one time (bottom).
A Two-Frame Prologue
A Two-Frame Prologue
Frame 0
Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 Frame 6
Frame 0
Frame 1
F
0
S
0
S
1
O
1
S
2
O
2
S
3
O
3
S
4
O
4
S
5
O
5
S
6
E
6
O
6
O
0
F
0
S
0
O
0
S
1
O
1
S
2
O
2
S
3
O
3
S
4
O
4
S
5
O
5
S
6
O
6
S
7
O
7
S
T-2
O
T-2
S
T-1
E
T-1
O
T-1
Frame 2 Frame 3 Frame 4 Frame 5
Frame 6
Frame 7 Frame T-2
Frame T-1
Unrolled Chunk One Time A Two-Frame Epilogue
A Three-Frame Chunk A Two-Frame Epilogue
Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on December 24, 2008 at 00:49 from IEEE Xplore. Restrictions apply.
剩余11页未读,继续阅读
资源评论
- guangying01142012-11-29说明的太详细了,就是太长了我看不懂啊看不懂
- fh422725122013-10-09同看不懂,望更简洁。
- charlesbabbage2012-12-12我也看不懂啊,不够简洁明了
nuonuo103
- 粉丝: 0
- 资源: 2
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功