Audiocodingusingcompressedsensing资源-CSDN文库

音频编码

压缩感知

需积分: 9 105 浏览量 2011-10-05 22:59:30 上传评论收藏 1.36MB PDF 举报

资源推荐

资源详情

资源评论

1382 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 5, JULY 2011

Single-Channel and Multi-Channel Sinusoidal Audio

Coding Using Compressed Sensing

Anthony Grifﬁn, Toni Hirvonen, Christos Tzagkarakis, Athanasios Mouchtaris, Member, IEEE, and

Panagiotis Tsakalides, Member, IEEE

Abstract—Compressed sensing (CS) samples signals at a much

lower rate than the Nyquist rate if they are sparse in some basis.

In this paper, the CS methodology is applied to sinusoidally mod-

eled audio signals. As this model is sparse by deﬁnition in the fre-

quency domain (being equal to the sum of a small number of si-

nusoids), we investigate whether CS can be used to encode audio

signals at low bitrates. In contrast to encoding the sinusoidal pa-

rameters (amplitude, frequency, phase) as current state-of-the-art

methods do, we propose encoding few randomly selected samples

of the time-domain description of the sinusoidal component (per

signal segment). The potential of applying compressed sensing both

to single-channel and multi-channel audio coding is examined. The

listening test results are encouraging, indicating that the proposed

approach can achieve comparable performance to that of state-of-

the-art methods. Given that CS can lead to novel coding systems

where the sampling and compression operations are combined into

one low-complexity step, the proposed methodology can be consid-

ered as an important step towards applying the CS framework to

audio coding applications.

Index Terms—Audio coding, compressed sensing (CS), signal re-

construction, signal sampling, sinusoidal model.

I. INTRODUCTION

HE growing demand for audio content far outpaces the

corresponding growth in users’ storage space or band-

width. Thus, there is a constant incentive to further improve the

compression of audio signals. This can be accomplished either

by applying compression algorithms to the actual samples of a

digital audio signal, or using initially a signal model and then

encoding the model parameters as a second step. In this paper,

we propose a novel method for encoding the parameters of the

sinusoidal model.

Manuscript received December 24, 2009; revised May 05, 2010; accepted Oc-

tober 17, 2010. Date of publication November 09, 2010; date of current version

May 13, 2011. This work was supported in part by the Marie Curie TOK-DEV

“ASPIRE” grant and in part by the PEOPLE-IAPP “AVID-MODE” grant within

the 6th and 7th European Community Framework Programs, respectively. The

associate editor coordinating the review of this manuscript and approving it for

publication was Dr. Patrick Naylor.

A. Grifﬁn, C. Tzagkarakis, A. Mouchtaris, and P. Tsakalides are with the In-

stitute of Computer Science, Foundation for Research and Technology-Hellas

(FORTH-ICS), and Department of Computer Science, University of Crete, Her-

aklion, Crete GR-70013, Greece (e-mail: agrifﬁn@ics.forth.gr; tzagarak@ics.

forth.gr; mouchtar@ics.forth.gr; tsakalid@ics.forth.gr).

T. Hirvonen was with the Institute of Computer Science, Foundation for

Research and Technology-Hellas (FORTH-ICS), Heraklion, Crete GR-70013,

Greece. He is now with the Dolby Laboratories, Stockholm SE-113 30,

Sweden, (e-mail: toni.hirvonen@dolby.com).

Color versions of one or more of the ﬁgures in this paper are available online

at http://ieeexplore.ieee.org.

Digital Object Identiﬁer 10.1109/TASL.2010.2090656

The sinusoidal model represents an audio signal using a

small number of time-varying sinusoids [1]. The remainder

error signal—often termed the residual signal—can also be

modeled to further improve the resulting subjective quality of

the sinusoidal model [2]. The sinusoidal model allows for a

compact representation of the original signal and for efﬁcient

encoding and quantization. Extending the sinusoidal model to

multi-channel audio applications has also been proposed (e.g.,

[3]).

Various methods for quantization of the sinusoidal model

parameters (amplitude, phase, and frequency) have been pro-

posed in the literature. Initial methods in this area suggested

quantizing the parameters independently of each other [4]–[8].

The frequency locations of the sinusoids were quantized based

on research into the just noticeable differences in frequency

(JNDF), while the amplitudes were quantized based either on

the just noticeable differences in amplitude (JNDA) or the esti-

mated frequency masking thresholds. In these initial quantizers,

phases were uniformly quantized, or were not quantized at all

for low-bitrate applications. More recent quantizers operate

by jointly encoding all the sinusoidal parameters based on

high-rate theory and can be expressed analytically [9]–[12].

The bitrates achieved by these methods can be further reduced

using differential coding, e.g., [13]. It must be noted that all

the aforementioned methods encode the sinusoidal parameters

independently for each short-time segment of the audio signal.

Extensions of these methods, where the sinusoidal parameters

can be jointly quantized across neighboring segments, have

recently been proposed, e.g., [14].

In this paper, we propose using the emerging compressed

sensing (CS) [15], [16] methodology to encode and compress

the sinusoidally modeled audio signals. Compressed sensing

seeks to represent a signal using a number of linear, non-adap-

tive measurements. Usually, the number of measurements is

much lower than the number of samples needed if the signal

is sampled at the Nyquist rate. CS requires that the signal is

sparse in some basis—in the sense that it is a linear combina-

tion of a small number of basis functions—in order to correctly

reconstruct the original signal. Clearly, the sinusoidally mod-

eled part of an audio signal is a sparse signal, and it is thus nat-

ural to wonder how CS might be used to encode such a signal.

We present such an investigation of how CS can be applied to

encoding the time-domain signal of the model instead of the si-

nusoidal model parameters as state-of-the-art methods propose,

extending our recent work in [17], [18]. We extend our previous

work in terms of providing more results for the single-channel

audio coding case, but also we propose here a system which

GRIFFIN et al.: SINGLE-CHANNEL AND MULTI-CHANNEL SINUSOIDAL AUDIO CODING USING CS 1383

applies CS to the case of sinusoidally modeled multi-channel

audio. At the same time, the paper proposes a psychoacoustic

modeling analysis for the selection of sinusoidal components in

a multi-channel audio recording, which provides a very com-

pact description of multi-channel audio and is very efﬁcient for

low-bitrate applications.

This is, to our knowledge, the ﬁrst attempt to exploit the

sparse representation of the sinusoidal model for audio signals

using compressed sensing, and many interesting and important

issues are raised in this context. The most important problems

encountered in this work are summarized in this paragraph.

The encoding operation is based on randomly sampling the

time-domain sinusoidal signal, which is obtained after applying

the sinusoidal model to a monophonic or multi-channel audio

signal. The random samples can be further encoded (here

scalar quantization is suggested, but other methods could be

used to improve performance). An issue that arises is that as

the encoding is performed in the time-domain—rather than

the Fourier domain—the quantization error is not localized in

frequency, and it is therefore more complicated to predict the

audio quality of the reconstructed signal; this was addressed

by suggesting a spectral whitening procedure for the sinu-

soidal amplitudes. Another issue is that the sinusoidal model

estimated frequencies should correspond to single bins of the

discrete Fourier transform, or else the sparsity requirement

cannot be satisﬁed. In practice, this translates into encoding

the sinusoidal parameters selected from a peak-picking proce-

dure (with the possible inclusion of a psychoacoustic model),

without further reﬁnement of the estimated frequencies. This

important problem can be addressed (as explained in detail

later) by employing zero-padding in the Fourier analysis (i.e.,

improving the frequency resolution by shortening the bin

spacing), and also by employing interpolation techniques in the

decoder (since sparsity is not needed after the CS decoding).

The improved frequency resolution resulted in a need to in-

crease the number of CS measurements, and consequently the

bitrate, and this problem was alleviated by employing a process

termed “frequency mapping.” Another important problem

which was addressed in this paper is the fact that CS theory

allows for signal reconstruction with high probability but not

with certainty; three different ways of overcoming this problem

(termed “operating modes”) are suggested in this paper. In

summary, several practical problems were raised during our

research; by providing a complete end-to-end design of a

CS-based sinusoidal coding system, this paper both clariﬁes

several limitations of CS to audio coding, but also presents

ways to overcome them, and in this sense we believe that this

paper will be of interest to researchers working on applying the

CS theory to signal coding.

The paper deals only with encoding the sinusoidal part of the

model (i.e., there is no treatment for the residual signal). It is

noted that other than the proposed method, the authors are only

familiar with the work of [19] for applying the CS methodology

to audio coding in general. While our focus in this paper is on

exploiting the sinusoidal model in this context, in [19] the goal

was to exploit the excitation/ﬁlter model using CS.

The importance of applying CS theory to audio coding lies

mainly to the applicability of CS to sensor network applications.

Sensor-based local encoding of audio signals could enable a va-

riety of audio-related applications, such as environmental mon-

itoring, recording audio in large outdoor venues, and so forth.

This paper provides an important step towards applying CS to

audio coding, at least in low-bitrate audio applications where

the sinusoidal part of an audio signal provides sufﬁcient quality.

It is shown here for multi-channel audio signals that, except for

one primary (reference) audio channel, a simple low-complexity

system can be used to encode the sinusoidal model for all re-

maining channels of the multi-channel recording. This is an im-

portant result given that research in CS is still at an early stage,

and its practical value in coding applications is still unclear.

The remainder of the paper is organized as follows. In

Section II, background information about the sinusoidal model

is given, and a novel psychoacoustic model for sinusoidal

modeling for multi-channel audio signals is proposed. Back-

ground information about the CS methodology is presented in

Section III. In Section IV, a detailed discussion about the prac-

tical implementation of the method is provided related to issues

such as alleviating the effects of quantization (Section IV-A);

bitrate improvements (Section IV-B); quantization and en-

tropy coding (Section IV-C); CS reconstruction algorithms

(Section IV-D); achieved bitrates (Section IV-E); operating

modes (Section IV-F); and complexity (Section IV-G). The

discussion of Section IV is then extended to the multi-channel

case in Section V. In Section VI, results from listening tests

demonstrate the audio quality achieved with the proposed

coding scheme for the single-channel (Section VI-A) and

the multi-channel case (Section VI-B), while in Section VII

concluding remarks are made.

II. S

INUSOIDAL MODEL

The sinusoidal model was initially used in the analysis/syn-

thesis of speech [1]. A short-time segment of an audio signal

is represented as the sum of a small number of sinu-

soids with time-varying amplitudes and frequencies. This can

be written as

(1)

where

, , and are the amplitude, frequency, and phase,

respectively. To estimate the parameters of the model, one needs

to segment the signal into a number of short-time frames and

compute a short-time frequency representation for each frame.

Consequently, the prominent spectral peaks are identiﬁed using

a peak detection algorithm (possibly enhanced by perceptual-

based criteria). Interpolation methods can be used to increase

the accuracy of the algorithm [2]. Each peak in the

th frame is

represented as a triad of the form

(amplitude,

frequency, phase), corresponding to the

th sinewave. A peak

continuation algorithm is usually employed in order to assign

each peak to a frequency trajectory by matching the peaks of

the previous frame to the current frame, using linear amplitude

interpolation and cubic phase interpolation.

A more accurate representation of audio signals is achieved

when a stochastic component is included in the model. This

1384 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 19, NO. 5, JULY 2011

model is usually called the sinusoids plus noise model, or deter-

ministic plus stochastic decomposition. In this model, the sinu-

soidal part corresponds to the “deterministic” part of the signal

due to the structured nature of this model. The remaining signal

is the sinusoidal noise component

, also referred to here as

residual or sinusoidal error signal, which is the “stochastic” part

of the audio signal, since it is very difﬁcult to accurately model,

but at the same time essential for high-quality audio synthesis.

Accurately modeling the stochastic component has been exam-

ined both for the single-channel case, e.g., [2], [20], [21] and

the multi-channel audio case [3]. Practically, after the sinusoidal

parameters are estimated, the noise component is computed by

subtracting the sinusoidal component from the original signal.

Note that in this paper we are only interested in encoding the

sinusoidal part.

A. Single-Channel Sinusoidal Selection

To perform single-channel sinusoidal analysis, we employed

state-of-the-art psychoacoustic analysis based on [22]. In the

iteration, the algorithm picks a perceptually optimal sinusoidal

component frequency, amplitude, and phase. This choice mini-

mizes the perceptual distortion measure

(2)

where

is the Fourier transform of the residual signal

(original frame minus the currently selected sinusoids) after the

th iteration, and is a frequency weighting function set

as the inverse of the current masking threshold energy.

One issue with CS encoding is that no further reﬁnement of

the sinusoid frequencies can be performed in the encoder, be-

cause frequencies which do not correspond to exact frequency

bins would result in loss of the sparsity in the frequency do-

main. This is an important problem, because it implies that we

must restrict the sinusoidal frequency estimation to the selection

of frequency bins (e.g., following a peak-picking procedure),

without the possibility of further reﬁnement of the estimated fre-

quencies in the encoder. This can be alleviated by zero-padding

the signal frame, in other words improving the frequency res-

olution during the parameter estimation by reducing the bin

spacing. We have found, though, that for CS-based encoding

this can be performed to a limited degree, as zero-padding will

increase the number of measurements that must be encoded as

explained in Section IV (and consequently the bitrate). Fortu-

nately, this problem can be partly addressed by employing the

“frequency mapping” procedure, described in Section IV. Fur-

thermore, since the sparsity restriction need not hold after the

signal is decoded, frequency re-estimation can be performed in

the decoder, such as interpolation among frames.

B. Multi-Channel Sinusoidal Selection

To perform multi-channel sinusoidal analysis, we have

extended the sinusoidal modeling method presented in

[23]—which employs a matching pursuit algorithm to de-

termine the model parameters of each frame—to include the

psychoacoustic analysis of [22]. For the multichannel case,

in each iteration, the algorithm picks a sinusoidal compo-

nent frequency that is optimal for all channels, as well as

channel-speciﬁc amplitudes and phases. This choice minimizes

the perceptual distortion measure

(3)

where

is the Fourier transform of the residual signal of

the

th channel after the th iteration, and is a frequency

weighting function set as the inverse of the current masking

threshold energy. The contributions of each channel are simply

summed to obtain the ﬁnal measure.

An important question is what masking model is suitable for

multi-channel audio where the different channels have different

binaural attributes in the reproduction. In transform coding, a

common problem is caused by binaural masking level differ-

ence (BMLD); sometimes quantization noise that is masked in

monaural reproduction is detectable because of binaural release,

and using separate masking analysis for different channels is not

suitable for loudspeaker rendering. However, this effect in para-

metric coding is not so well established.

We performed preliminary experiments using: 1) separate

masking analysis, i.e., individual

based on the masker

of channel

for each signal separately [see (3)]; 2) the masker

of the sum signal of all channel signals to obtain

for

all

; and 3) power summation of the other signals’ attenuated

maskers to the masker of channel

according to

(4)

In the above equation,

indicates the masker energy,

the estimated attenuation (panning) factor that was varied

heuristically, and

iterates through all channel signals ex-

cluding

. In this paper, we chose to use the ﬁrst method, i.e.,

separate masking analysis for channels

, for the

reason that we did not ﬁnd notable differencies in BMLD noise

unmasking, and that the sound quality seemed to be marginally

better with headphone reproduction. For loudspeaker reproduc-

tion, the second or third method may be more suitable.

The use of this psychoacoustic multi-channel sinusoidal

model resulted in sparser modeled signals, increasing the

effectiveness of our compressed sensing encoding.

III. C

OMPRESSED SENSING

Compressed sensing [15], [16]—also known as compressive

sensing or compressive sampling—is an emerging ﬁeld which

has grown up in response to the increasing amount of data that

needs to be sensed, processed and stored. A great majority of

this data is compressed as soon as it has been sensed at the

Nyquist rate. The idea behind compressed sensing is to go di-

rectly from the full-rate, analog signal to the compact represen-

tation by using measurements in the sparse basis. Thus, the CS

theory is based on the assumption that the signal of interest is

sparse in some basis as it can be accurately and efﬁciently repre-

sented in that basis. This is not possible unless the sparse basis is

known in advance, which is generally not the case. Thus com-

pressed sensing uses random measurements in a basis that is

剩余13页未读，继续阅读

评论收藏

内容反馈

boundles

粉丝: 0
资源: 3

Audio coding using compressed sensing

最新资源

Audio coding using compressed sensing

compressed sensing

使用压缩感知的多个音频对象的多阶段编码方案

Applications of Compressed Sensing

Introduction To Digital Audio Coding And Standards

Compressed Sensing Theory and Applications 2012

Video coding using the H.264 MPEG-4 AVC compression standard

Block compressed sensing of natural images

Compressed Sensing

13818-7 Advanced Audio Coding (AAC)

ITCT.rar_coding USING MATLAB

A_Practical_Guide_to_Error－Control_Coding_Using_MATLAB

High-Fidelity Multichannel Audio Coding.pdf

ERROR-CONTROL-CODING.rar_coding USING MATLAB_the code

ISO 13818-7 Advanced Audio Coding (AAC)

ErrorResilientVideo Coding Using Unequally ProtectedKeyPictures

Video_coding_using_the_H.264_MPEG-4_AVC_compression_standard

Video coding using the H.264/MPEG-4 AVC compression standard

论文研究-基于压缩网络编码的WSN数据传输技术.pdf

Building Single Page Applications in .NET Core 3 Jumpstart Coding Using Blazor

SuB-band.rar_audio coding_sub

Audio Signal Processing and Coding

Distributed compressed video sensing康立伟.

IEEE Std 1857.2-2013 Advanced Audio Coding - 完整英文电子版（342页）.zip

IEEE Std 1857.8-2020 Second Generation Audio Coding - 完整英文电子版（47

ISO.IEC 14496-3 2009, Coding of audio-visual objects

最新资源