SMLI TR2004-0811
c
2004 SUN MICROSYSTEMS INC. 1
Sphinx-4: A Flexible Open Source Framework for
Speech Recognition
Willie Walker, Paul Lamere, Philip Kwok, Bhiksha Raj, Rita Singh, Evandro Gouvea, Peter Wolf, Joe Woelfel
Abstract
Sphinx-4 is a flexible, modular and pluggable framework to help foster new innovations in the core research of hidden Markov
model (HMM) recognition systems. The design of Sphinx-4 is based on patterns that have emerged from the design of past systems
as well as new requirements based on areas that researchers currently want to explore. To exercise this framework, and to provide
researchers with a ”research-ready” system, Sphinx-4 also includes several implementations of both simple and state-of-the-art
techniques. The framework and the implementations are all freely available via open source.
I. INTRODUCTION
W
HEN researchers approach the problem of core speech recognition research, they are often faced with the problem of
needing to develop an entire system from scratch, even if they only want to explore one facet of the field. Open source
speech recognition systems are available, such as HTK [1], ISIP [2], AVCSR [3] and earlier versions of the Sphinx systems
[4]–[6]. The available systems are typically optimized for a single approach to speech system design. As a result, these systems
intrinsically create barriers to future research that departs from the original purpose of the system. In addition, some of these
systems are encumbered by licensing agreements that make entry into the research arena difficult for non-academic institutions.
To facilitate new innovation in speech recognition research, we formed a distributed, cross-discipline team to create Sphinx-4
[7]: an open source platform that incorporates state-of-the art methodologies and also addresses the needs of emerging research
areas. Given our technical goals as well as our diversity (e.g., we used different operating systems on different machines, etc.),
we wrote Sphinx-4 in the Java
TM
programming language, making it available to a large variety of development platforms.
First and foremost, Sphinx-4 is a modular and pluggable framework that incorporates design patterns from existing systems,
with sufficient flexibility to support emerging areas of research interest. The framework is modular in that it comprises separable
components dedicated to specific tasks, and it is pluggable in that modules can be easily replaced at runtime. To exercise the
framework, and to provide researchers with a working system, Sphinx-4 also includes a variety of modules that implement
state-of-the-art speech recognition techniques.
The remainder of this document describes the Sphinx-4 framework and implementation, and also includes a discussion of
our experiences with Sphinx-4 to date.
II. SELECTED HISTORICAL SPEECH RECOGNITION SYSTEMS
The traditional approach to speech recognition system design has been to create an entire system optimized around a particular
methodology. As evidenced by past research systems such as Dragon [8], Harpy [9], Sphinx and others, this approach has
proved to be quite valuable in that the resulting systems have provided foundational methods for speech recognition research.
In the same light, however, each of these systems was largely dedicated to exploring a single specific groundbreaking area
of speech recognition. For example, Baker introduced hidden Markov models (HMMs) with his Dragon system, [8], [10]
and earlier predecessors of Sphinx explored variants of HMMs such as discrete HMMs [4], semicontinuous HMMs [5], and
continuous HMMs [11]. Other systems explored specialized search strategies such as using lex tree searches for large N-Gram
models [12].
Because they were focused on such fundamental core theories, the creators of these systems tended to hardwire their
implementations to a high degree. For example, the predecessor Sphinx systems restrict the order of the HMMs to a constant
value and also fix the unit context to a single left and right context. Sphinx-3 eliminated support for context free grammars
(CFGs) due to the specialization on large N-Gram models. Furthermore, the decoding strategy of these systems tended to
be deeply entangled with the rest of the system. As a result of these constraints, the systems were difficult to modify for
experiments in other areas.
Design patterns for these systems emerged over time, however, as exemplified by Jelinek’s source-channel model [13] and
Huang’s basic system architecture [14]. In developing Sphinx-4, one of our primary goals was to develop a framework that
supported these design patterns, yet also allowed for experimentation in emerging areas of research.
W. Walker, P. Lamere, and P. Kwok are with Sun Microsystems
E. Gouvea and R. Singh are with Carnegie Mellon University
B. Raj, P, Wolf, and J. Woelfel are with Mitsubishi Electric Research Labs
评论0
最新资源