Reduced-Rank Spectra and Minimum-Entropy Priors as Consistent and
Reliable Cues for Generalized Sound Recognition
Michael A. Casey
MERL, Cambridge Research Laboratory
casey@merl.com
Abstract
We propose a generalized sound recognition system that uses
reduced-dimension log-spectral features and a minimum
entropy hidden Markov model classifier. The proposed
system addresses the major challenges of generalized sound
recognition—namely, selecting robust acoustic features and
finding models that perform well across diverse sound types.
To test the generality of the methods, we sought sound classes
consisting of time-localized events, sequences, textures and
mixed scenes. In other words, no assumptions on signal
composition were imposed on the corpus.
Comparison between the proposed system and conventional
maximum likelihood training showed that minimum entropy
models yielded superior performance in a 20-class recognition
experiment. The experiment tested discrimination between
speech, non-speech utterances, environmental sounds, general
sound effects, animal sounds, musical instruments and
commercial music recordings.
1. Introduction
1.1. Generalized Sound Recognition
There are many uses for generalized sound recognition in
audio applications. For example, robust speech / non-speech
classifiers may be used to enhance the performance of
automatic speech recognition systems, and classifiers that
recognize ambient acoustic sources may provide signal-to-
noise ratio estimates for missing-feature methods.
Additionally, audio and video recordings may be indexed and
searched using the classifiers and model state-variables
utilized for fast query-by-example retrieval tasks from large
general audio databases.
1.2. Previous Work
With each type of classifier comes the task of finding robust
features that yield classifications with high accuracy on novel
data sets. Previous work on non-speech audio classification
has addressed recognition of audio sources using ad-hoc
collections of features that are tested and fine-tuned to a
specific classification task.
Such audio classification systems generally employ front-
end processing to encode salient acoustic information; such as
fundamental frequency, attack time and spectral centroid.
These features are often subjected to further analysis to find
an optimal set for a given task such as speech/music
discrimination, musical instrument identification and sound
effects recognition, [1][2][3]. Whilst each of these systems
performs satisfactorily in their own right, they do not
generalize beyond their intended applications due to the prior
assumptions on the structure and composition of the input
signals. For example, fundamental frequency assumes
periodicity and, along with the spectral centroid, also assumes
that the observable signal was produced by a single source.
Here, we are concerned with general methods that can be
uniformly applied to diverse source classification tasks with
accurate performance, which is the goal of generalized sound
recognition (GSR). An acceptable criterion for GSR
performance is >90% recognition for a multi-way classifier
tested on novel data.
2. Maximally Informative Features
Machine learning systems are dependent upon the choice of
representation of the input data. A common starting point for
audio analysis is frequency-domain conversion using basis
functions. The complex exponentials used by the Fourier
transform form such a basis and yield complete
representations of spectral magnitude information. The
advantage of this complete spectral basis approach is that no
assumptions are made on signal composition.
However, this representation consists of many dimensions
and yields a high degree of correlation in the data. This
renders much of the data redundant and therefore requires
greater effort to be expended in parameter inference for
statistical models. In many cases the redundancy also creates
problems with numerical stability during training and
adversely affects model performance during recognition.
To understand why such representations are problematic
consider that higher dimensional populations of samples are
more sparsely distributed across each dimension. This
encourages over-fitting of the available data points, thus
decreasing the reliability of density estimates. In contrast, a
low dimensional representation of the same population yields
a more densely sampled distribution from which parameters
are more accurately inferred.
2.1.1.
Independent Subspace Analysis
To address the problems of dimensionality and redundancy,
whilst keeping the benefits of complete spectral
representations, we use projection to low-dimensional
subspaces via reduced-rank spectral basis functions. It is
assumed that much of the information in the data occupies a
subspace, or manifold, that is embedded in the larger spectral
data space. A number of methods exist that yield maximally
informative subspaces multivariate data; such as, local-linear
embedding, non-linear principal components analysis,
projection pursuit and independent component analysis. It has
been shown that these algorithms form a family of closely
评论5