统计模型说话人识别系统工具包GMTK说明书资源-CSDN文库

4星 · 超过85%的资源需积分: 9 99 浏览量 2009-05-17 20:10:47 上传评论收藏 745KB PDF 举报

资源推荐

资源详情

资源评论

©ARTVILLE & COMSTOCK

graph is a two-dimensional visual formalism that can be used to describe many different phe-

nomena. Graphs are used in a wide variety of fields, including computer science, data and con-

trol flow, entity relationships and social networks, Petri and neural networks,

software/hardware visualization, and parallel computation. The popularity of graphs is in large

part due to their ability represent complex situations in an intuitive and visually appealing way.

Statistical graphical models are a family of graphical abstractions of statistical models where important

aspects (e.g., factorization) of such models are represented using graphs. In recent years, and due to a wide

range of research, it has become apparent that graphical models offer a mathematically formal but widely

flexible means for solving many of the problems in speech and language processing. Graphs are able to rep-

resent events at the very high level (such as relationships between linguistic classes), at the very low level

(such as correlations between spectral features or acoustic landmarks), and at all levels in between (such as

lexical pronunciation). A fundamental advantage of graphical models is rapidity. With a graphical model, it

is possible to quickly express a novel, complicated idea in an intuitive, concise, and mathematically precise

way, and to speedily and visually communicate that idea between colleagues. Moreover, with the right soft-

ware, it is possible to rapidly prototype that idea on a standard desktop workstation.

This article discusses the foundations of the use of graphical models for speech recognition [16], [25],

[26], [42], [47], giving detailed accounts of some of the more successful cases. Our discussion will, in

[

Jeff A. Bilmes and Chris Bartels

]

Graphical Model

Architectures for Speech

Recognition

[

A powerful representation paradigm

for both standard and novel speech

recognition architectures

]

Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on December 24, 2008 at 00:49 from IEEE Xplore. Restrictions apply.

IEEE SIGNAL PROCESSING MAGAZINE [90] SEPTEMBER 2005

particular, employ dynamic Bayesian networks (DBNs) and a

DBN extension using the Graphical Model Toolkit’s (GMTK’s)

basic template, a dynamic graphical model representation that

is more suitable for speech and language systems. While this

article will concentrate on speech recognition, it should be

noted that many of the ideas presented here are also applicable

to natural language processing and general time-series analysis.

This article assumes some familiarity with basic speech-to-

text concepts [16], [25], [26], [42], [46], [47] and the basics of

graphical models [28], [29], [32], [41], including notions such as

hidden and observed variables, evidence, and factorization and

conditional independence. Moreover, we will use the MATLAB-

like notation

1:N

to denote the set of integers

{1, 2,...,N}

. A

set of

random variables (RVs) is denoted as

1:N

. Given any

subset

S ⊆ 1:N

, where

S ={S

, S

,...,S

|S|

}

, the subset of

random variables is denoted as

={X

, X

,...,X

|S|

}

Lastly, we use upper case letters (such as

and

) to refer to

random variables, and lower case letters (such as

and

) to

refer to random variable values.

DYNAMIC GRAPHICAL MODELS

Graphical models [27], [29], [32], [41] are a set of formalisms,

each of which describes families of probability distributions.

There are many different types of graphical models [27], [31],

[32], [41], each having its own semantics [10] that govern how

the graph specifies a set of factorization constraints on multi-

variate probability distributions. Of course, factorization and

conditional independence go hand in hand; thus, factorization

constraints typically (but not always) involve conditional inde-

pendence properties. A Bayesian network (BN) is one type of

graphical model where the graphs are directed and acyclic. In

a BN, the probability distribution over a set of variables

1:N

factorizes with respect to a directed acyclic graph (DAG) as

p(x

1:N

) =



p(x

)

, where

⊂ 1:N

are the set of indices

’s immediate parents according to the BN’s DAG. This

factorization is called the directed factorization property [32].

There are many additional (and provably equivalent)

characterizations of BNs, including the notion of d-separation

[32], [41], but this one suffices for our discussion. It should be

clear that because of the strong relationship between factoriza-

tion and conditional independence, the above factorization

implies that a BN expresses a large number of conditional

independence statements to the extent that it has missing

edges in the graph. Moreover, it should be clear that it is the

common factorization properties of the family of probability

distributions that makes for efficient probabilistic inference

[28], [29], [32], [41].

Speech is inherently a temporal process, and any graphi-

cal model for speech must take this into account.

Accordingly, dynamic graphical models [5] are graphs that

represent the temporal evolution of the statistical properties

of a speech signal, ideally in such a way as to improve auto-

matic speech recognition (ASR) accuracy. For speech recog-

nition, DBNs [10], [15], [37], [49]–[51] have been most

successfully used. DBNs are simply BNs with a repeated

“template” structure over time. Other than this regularity,

however, DBNs have exactly the same semantics as BNs.

Specifically, a DBN of length

is a DAG

G = (V, E) =

(



t=1

, E

∪



T−1

t=1

∪E

→

)

with node set

and edge set

comprising pairs of nodes. If

uv ∈ E

for

u, v ∈ V

, then

is an

edge of

. The sets

are the nodes at time slice

are the

intraslice edges between nodes in

, and

→

are the interslice

edges between nodes in

and

t+1

. A DBN, however, does not

typically have this much flexibility. That is, a DBN is specified

using a “rolled up” template giving nodes that are repeated in

each slice, the intraslice edges among those nodes, and the

interslice edges between nodes of adjacent slices. In other

words,

and

t+τ

have the same set of random variables that

are different only in that the time indexes of the variables differ

. The same is true for

and

t+τ

, as well as for

→

and

→

t+τ

. The DBN template is then unrolled to any desired length

to yield the DBN

. As in any BN, the collection of edges

pointing into a node corresponds to a conditional probability

function (CPF). In a DBN, the CPF of a node is shared (or tied)

with the CPF of all other nodes that have come from the same

underlying node in the DBN template. If

∈ V

with parents

then

p(V

= v|V

= v

) = p(V

= v|V

= v

)

for all

t,τ

and

for all scalar values

and vector values

. Therefore, it is possi-

ble to represent a DBN of unbounded length but with only a

finite description length and a finite number of parameters.

It is well known that the hidden Markov model (HMM) is

one type of DBN [44]. Even given its success and flexibility,

however, the HMM is only one small model within the enor-

mous family of statistical techniques represented by DBNs. Like

an HMM, a DBN makes a temporal Markov assumption, mean-

ing that the future is independent of the past given the present.

In fact, it is true that many (but not all, see section titled

“Architectures over Observed Variables”) DBNs can be “flat-

tened” into a corresponding HMM, but staying within the DBN

framework has several advantages. First, in DBN form, there

can be exploitable computational advantages since the DBN

explicitly represents factorization properties and factorization is

the key to tractable probabilistic inference [29]. These factor-

izations, however, are lost when the model is flattened. Second,

the factorization specified by a DBN implies that there are con-

straints that the model must obey. For example, consider

Figure 1, which shows a two-Markov-chain DBN with chains

, Q

)

. A flattened HMM would have one chain

≡ (Q

, Q

)

with transition probabilities set

[FIG1] A simple two-stream Markov chain.

t−1

t+1

Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on December 24, 2008 at 00:49 from IEEE Xplore. Restrictions apply.

IEEE SIGNAL PROCESSING MAGAZINE [91] SEPTEMBER 2005

p(R

= r

t−1

= r

t−1

)

= p



= q

, Q

= q



t−1

= q

t−1

, Q

t−1

= q

t−1



where

≡ (q

, q

)

is the joint HMM state space. Such flatten-

ing, however, ignores the factorization constraint expressed by

the graph, which is



= q

, Q

= q



t−1

= q

t−1

, Q

t−1

= q

t−1



= p



= q



t−1

= q

t−1



× p



= q



t−1

= q

t−1

, Q

t−1

= q

t−1



In other words, not all possible

p(r

t−1

)

CPFs are allowed

given the graph due to its conditional independence property. In

the above, other factors in addition to

t−1

would influence the

distribution of

if no assumptions were made. Of course, the

HMM can represent a distribution designed under these con-

straints. When training parameters, however, we must find the

optimal solution within the parameter space subject to these

constraints. Moreover, it is during training (when the amount of

training data might be limited) that one wants to reduce the

amount of parameter freedom (via a set of constraints on the

model) as much as possible. Since a DBN naturally expresses

factorization, it is an ideal candidate to train model parameters

in this case. A third advantage of DBNs is that they convey struc-

tural information about the underlying problem. Such structure

might represent anything from the result of data-mining

process [3] on the training data to dependencies over high-level

knowledge sources, or both. In either case, information about a

domain is visually and intuitively portrayed.

Loosely speaking, DBN probabilistic inference (a generalization

of the Baum-Welch procedure for HMMs [42]) has a computational

cost upper bound (i.e., it is possible to show that this is the worst

case) equal to very roughly the joint state space (the number of

combined variable assignments that can occur with nonzero proba-

bility) of all the variables in two time slices of the graph [5], [37],

[49]–[51] multiplied by the total number of time slices

Therefore, one must take care when adding variables to a DBN that

the cost does not become prohibitive. While this article does not get

into the specifics of DBN inference, it should be known that this

cost often strongly depends on the DBN triangulation method used

[1], [5]. In other words, adding variables will often, but not neces-

sarily always, cause a significant increase in computational cost.

THE GMTK DYNAMIC TEMPLATE

Before exploring various ASR constructs using graphical models,

we define the GMTK’s [5], [7] extension of a DBN template. This

extension facilitates the expression of graphical models for

speech recognition and natural language processing.

A GMTK template extends a standard DBN template in five

distinct ways. First, it allows for not only forward but also back-

ward directed time links. This allows for a richer model specifi-

cation that enables, for example, representations of reverse-time

effects such as coarticulation in human speech (see Figure 2).

[FIG2] A multiframe GMTK template (top) with a two-frame prologue

, a three-frame chunk

, and a two-frame epilogue

and

unrolled one time (bottom).

A Two-Frame Prologue

Frame 0

Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 Frame 6

Frame 0

Frame 1

T-2

T-1

Frame 2 Frame 3 Frame 4 Frame 5

Frame 6

Frame 7 Frame T-2

Frame T-1

Unrolled Chunk One Time A Two-Frame Epilogue

A Three-Frame Chunk A Two-Frame Epilogue

Authorized licensed use limited to: HEFEI UNIVERSITY OF TECHNOLOGY. Downloaded on December 24, 2008 at 00:49 from IEEE Xplore. Restrictions apply.

剩余11页未读，继续阅读

评论收藏

内容反馈

guangying0114

2012-11-29

说明的太详细了，就是太长了我看不懂啊看不懂
fh42272512

2013-10-09

同看不懂，望更简洁。
charlesbabbage

2012-12-12

我也看不懂啊，不够简洁明了

nuonuo103

粉丝: 0
资源: 2

统计模型说话人识别系统工具包GMTK说明书

短语音说话人识别算法及说话人识别技术应用研究

图模型工具箱-GMTK（Python）

论文研究-基于动态贝叶斯网络的语音识别及音素切分研究.pdf

The Graphical Models Toolkit使用说明

基于动态贝叶斯网络的语音识别及音素切分研究* (2007年)

GMTK-游戏-果酱-2020

GMTK2019-Platformer

GMTKJam2020：Tan军用弹药的回购，用于GMTK Jam 2020

gmtk-gamejam-2020：Unity 2019.2.19f1的小型原型

One-Knife-Ninja:GMTK2019 48小时游戏果酱游戏的源代码-one source code

MonoTreasure：一款适用于GMTK Game Jam 2019的游戏。搜索最有价值的宝藏，因为您只能携带一个！

GMTK_Jam_2020:由Godot制作的一款简短的解谜平台游戏，可在GLES 2.0上运行

SlatedGTK-Game:这最初是为了2020年的GMTK游戏卡纸。它存在一些问题，因此没有及时提交。 使用SlatedGTK-Old框架

joinedtogether:为 Game Maker 的 Tooklkit Game Jam 2021 制作的 Game Boy Advance 游戏，主题为 Joined Together

法尔姆计划

python大作业 含爬虫、数据可视化、地图、报告、及源码（整和为一个文件）（2014-2020全国各地区原油加工量）.rar

仿真电路以及操作方法

【纯干货啊】华为IPD流程管理(完整版).pptx

可编程语言标准IEC61131-3中文版.pdf

OFDM完整仿真过程与教程.zip

信号与系统——保研复习资料.pdf

Landsat_WRS2.zip

最全的Visio形状/图形库

AxureRP9项目原型50套、案例20个、元件库1套.zip

北理工+成电+东南——通信/信号保研面试真题.pdf

数字信号处理——保研复习资料.pdf

风电和储能并网Simulink模型

使用STM32F103C8T6+L298N+MG513P30电机使用外部中断法和输入捕获法进行编码器测速

COMSOL各个模块中文使用手册及教程，入门必备

最新资源

SlatedGTK-Game:这最初是为了2020年的GMTK游戏卡纸。它存在一些问题，因此没有及时提交。使用SlatedGTK-Old框架

python大作业含爬虫、数据可视化、地图、报告、及源码（整和为一个文件）（2014-2020全国各地区原油加工量）.rar