User Guided Audio Selection from Complex Sound
Mixtures
Paris Smaragdis
Adobe Systems Inc.
ABSTRACT
In this paper we present a novel interface for selecting sounds
in audio mixtures. Traditional interfaces in audio editors pro-
vide a graphical representation of sounds which is either a
waveform, or some variation of a time/frequency transform.
Although with these representations a user might be able to
visually identify elements of sounds in a mixture, they do not
facilitate ob ject- sp ecific ed iting (e.g. selecting only the voice
of a singer in a song). This interface uses audio guidance
from a user in order to select a target sound within a mix-
ture. The user is asked to vocalize (or otherwise sonically
represent) the desired target sound, and an automatic process
identifies and isolates the elements of the mixture that best
relate to the user’s input. This way of pointing to specific
parts of an audio stream allows a user to perform audio se-
lections which would have been infeasib le otherwise.
ACM Classification: H.5.5 [Multimedia Information Sys-
tems]: Sound and Music Computing, H.5.2 [User Interfaces]:
Voice I /O
General terms: Algorithms, Human Factors
Keywords: Audio interfaces
INTRODUCTION
With the advent of user-friendly software to manipulate me-
dia, consumers today are presented with a wide variety of
tools with which to edit content they create, or content they
wish to experiment with. This trend has been partially fueled
by the incorporation of interfaces which allow users to intu-
itively interact with media. A user who wishes to remove
somebody from a photograph can approximately trace the
outline of that person and th en automatically delete them. A
user who wants to change the co lor of the sky in a video can
scribble over the sky to identify that area and then specify the
new color for it. Esp ecially in the imaging world, editing of
photographs and v ideos is b y now a commonplace operation;
one of practical significance but also of artistic expression.
The same does not hold for audio processing. Editing and
manipulating complex audio signals presents a unique chal-
lenge to users - one that we don’t often encounter in other
forms of media. Whereas it is relatively simple for a user to
point towards specific objects in images and videos, doing
so in an audio track is not straightforward. Commonplace
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
UIST’09,, October 4–7, 2009, Victoria, British Columbia, Canada.
Copyright 2009 ACM 978-1-60558-745-5/09/10...
$10.00.
0 1 2 3 4 5
Time (sec)
Amplitude
Figure 1: A waveform representation of a piece of mu-
sic audio. An experienced audio engineer would iden-
tify this sound as a piece of music, and would obtain a
sense of where the drum beats are, by observing the
sharp onsets. However the presence of the singer’s
voice, or any other instrument is not visually possible
to detect. Casual users often use this representation
to detect the start and end time of an audio track, but
cannot deduce much more information.
recordings such as music, or home video soundtracks are al-
most always composed out of superimposed sounds that oc-
cur simultaneously. Invariably though, a user is most inter-
ested in only editing one sound (e.g. the sneeze during the
piano concerto, or just the guitar in a music recording). The
fact that all these sounds are intertwined inside one wave-
form, presents a significant challenge in terms of a user in-
terface, since there is no clear way to select a specific sound.
This problem has been partially addressed by two distinct
fields, the field of audio visualization, and that of sound
source separation. In terms of audio visualization, makers of
audio processing software have spend significant resources
in visualizing audio in forms that help a user understand and
manipulate audio. The most widespread (and least informa-
tive) representation for audio is the trace of the actual air
pressure across time, which is often referred to as the wave-
form (figure 1). This provides highly accurate visualization
of sound, but unfortunately conveys a small amount of infor-
mation. An experienced user might be able to deduce some
basic information using this representation, but in the case
of most sound mixtures there is very little information to be
found.
In order to pr esent users with more intuitive representations
of audio, especially ones that can assist complex editing, au-
dio software is now increasingly relying on time-frequency
visualizations (often referred to as frequency or spectral rep-
resentation). Time-frequency decompositions are a family of
numerical transforms that allow u s to display any time series
(like sound) in terms of its time-varying frequency energy
content [1]. The most common of these representations is
the spectrogram, which one can readily find in many mod-
ern audio processing editors. More exotic time-frequency
transforms, such as wavelets, warped spectrograms and sinu-
- 1
- 2
前往页