【免费】H.264/AVC编码算法运动估计的判断资源-CSDN文库

需积分: 0 80 浏览量 2009-12-31 15:38:09 上传评论收藏 401KB PDF 举报

资源推荐

资源详情

资源评论

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 8, AUGUST 2009 1095

Motion Estimation Optimization for H.264/AVC

Using Source Image Edge Features

Zhenyu Liu, Member, IEEE, Junwei Zhou, Satoshi Goto, Fellow, IEEE, and Takeshi Ikenaga, Member, IEEE

Abstract— The H.264/AVC coding standard processes variable

block size motion-compensated prediction with multiple reference

frames to achieve a pronounced improvement in compression

efﬁciency. Accordingly, the computation of motion estimation

increases in proportion to the product of the number of reference

frame and the number of intermode. The mathematical analysis

in this paper illustrates that the motion-compensated prediction

errors are mainly determined by the detailed textures in the

source image. The image block being rich in textures contains

numerous high-frequency signals, which make variable block size

and multiple reference frame techniques essential. On the basis

of rate-distortion theory, in this paper, the spatial homogeneity of

an image block is made as a relative concept with respect to the

current quantization step. For the homogenous block, its futile

reference frames and intermodes can be eliminated efﬁciently. It

is further revealed that the sum of absolute differences value of an

image block is mainly determined by the sum of its edge gradient

amplitude and the current quantization step. Consequently, the

image content-based early termination algorithm is proposed,

and it outperforms the original method adopted by JVT

reference software. Moreover, the dynamic search range

algorithm based on the edge gradient amplitude of source image

block is analyzed. One eminent advantage of the proposed edge-

based algorithms is their efﬁciency to the macroblock-pipelining

architecture, and another desirable feature is their orthogonality

to fast block-matching algorithms. Experimental results

show that when these algorithms are integrated with hybrid

unsymmetrical-cross multi-hexagongrid search, an averaged

31.4–60.0% motion estimation time can be saved, whereas the

averaging BDPSNR loss is 0.0497 dB for all tested sequences.

Index Terms— Edge gradient, fast mode decision, H.264/AVC,

motion estimation (ME), multiple reference frame (MRF), vari-

able block size (VBS).

I. INTRODUCTION

IGH-PERFORMANCE video coding algorithms strive

to reduce temporal and spatial redundancy. For this pur-

pose, the latest international video coding standard H.264/AVC

adopts the state-of-the-art techniques, which include quarter-

pixel accurate, variable block size (VBS), and multiple ref-

erence frame (MRF) techniques, to improve the coding gain.

Manuscript received January 1, 2008; revised June 15, 2008, August 31,

2008, and December 15, 2008. First version published May 12, 2009; current

version published August 14, 2009. This work was supported by CREST

JST. This paper was recommended by Associate Editor G. Wen.

Z. Liu is with RIIT of Tsinghua University, Beijing, 100084 China (e-mail:

liuzhenyu73@tsinghua.edu.cn).

J. Zhou is with the Sun Microsystems Incorporation, Santa Clara, CA 95054

USA (e-mail: junwei.zhou@sun.com).

S. Goto, and T. Ikenaga are with the Graduate School of IPS,

Waseda University, Tokyo, 808-0135 Japan (e-mail: goto@waseda.jp;

ikenaga@waseda.jp).

Color versions of one or more of the ﬁgures in this paper are available

online at http://ieeexplore.ieee.org.

Digital Object Identiﬁer 10.1109/TCSVT.2009.2022796

According to the analysis in [1], 89.2% computation power is

consumed by motion estimation (ME) part, and hence reducing

the redundant computation of ME in H.264/AVC has become

the fundamental research topic. In general, the computation

saving of ME rests on the following approaches: 1) reduc-

ing the search positions through the efﬁcient search pattern

(fast block matching), for example, 1-D full search, four-step

search, and diamond search; 2) eliminating the searches of the

redundant search modes and reference frames; 3) early termi-

nation of the block matching by deﬁning some thresholds; and

4) dynamically adjusting the search range. The ﬁrst category

of algorithms, i.e., fast block matching, have been discussed

thoroughly in many papers [2]–[4] and widely adopted in the

software- or hardware-oriented implementations [5], [6]. In

this paper, we focus on exploring the approaches in the other

three categories. Through analyzing the edge features of the

source image block, we discard the trivial inter-search modes

and reference frames, deﬁne the content-based thresholds for

early termination, and dynamically reduce the search range.

The proposed approaches are compatible with the traditional

fast block-matching methods. In addition, they are friendly to

the macroblock (MB)-pipelining architecture, which is widely

adopted in hardwired encoder designs [7], [8].

VBS and MRF algorithms are the major issues leading to

massive computation in H.264/AVC encoding. The required

computation is in direct proportion to the product of the num-

ber of reference frames and the number of intermodes. The

traditional fast block-matching algorithms cannot efﬁciently

reduce the computational complexity introduced by VBS and

MRF techniques. On the other hand, it has been justiﬁed

that the performance of VBS and MRF algorithms depends

mainly on the nature of video sequences [1]. This means that

a great deal of computation is performed without achieving

any coding improvement. The experiments in this paper further

reveal that, at low bit rates, signiﬁcant superﬂuous computation

exists in H.264/AVC ME processing.

Many algorithms have been provided to discard the redun-

dant computation in MRFs [1], [9]–[11]. Huang et al.[1],

developed four criteria for early terminating the motion search

on MRFs. Other works [9]–[11] reduced the search areas

depending on the strong correlations of motion vectors (MV)

in consequent pictures. However, the restrictions of MB-

pipelining architecture either degrade their performance, or

introduce considerable hardware overheads, as we shall see

in Section VI. Although some reasons have been adduced

in [1] and [9] for the superior prediction performance of

the MRF technique, the analysis of [12], [13] reveals that

Authorized licensed use limited to: China Three Gorges University IEL Trial. Downloaded on November 3, 2009 at 22:27 from IEEE Xplore. Restrictions apply.

1096 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 8, AUGUST 2009

the most critical issue is the aliasing problem [14] coming

from high-frequency signals (or detailed textures in the spatial

domain) of the source image. Consequently, the efﬁciency of

the MRF technique depends on the homogeneity of the image

block. With rate-distortion theory, we make the homogeneity

as a relative concept, which depends on the power spectral

density of current quantization noise. For the homogenous

image block, discarding its MRF technique merely introduces

negligible coding quality loss, as shown in Section VII-A.

With the augment of quantization parameter (QP), more image

blocks are determined as homogenous by the proposed criteria,

and in consequence, the computation-saving performance of

our methods is ameliorated with increased QP. Moreover,

the proposed homogeneity-based reference frame reduction

algorithm is efﬁcient to the MB-pipelining architecture, as

analyzed in Section VI.

The homogeneity-based fast intermode decision algorithm

was ﬁrst proposed by Wu et al. [15] to discard the redundant

intermodes in H.264/AVC. In detail, the homogeneity of one

N × N block, where N = 16 or N = 8, is determined

by evaluating the sum of magnitude of edges at all pixels

in this block. If the sum is less than the preset constant

threshold, this block is designated as a homogenous one and it

is not further split for other intermode search. The thresholds

for homogeneity decision in [15] are constants. However,

experiments illustrate that with the increase of QP, as the

rate cost for side information, such as MVs and sub-MB

types, becomes expensive, the ratio of 8 × 8 sub-MB modes

always declines. Hence, in this paper the relative homogeneous

concept is also adopted in the intermode reduction algorithm

to improve the computation saving at low bit rates.

Early termination schemes were used in [1], [16], [17],

which depend on either the all_zero block detection or some

constant thresholds derived by experiments. The investigations

in [16] show that at high bit rates, as the quantization interval

has a small value, the all_zero block detection algorithm hardly

provides any computation saving, especially for those textures-

rich video sequences. In this paper, the spatial investigation

reveals that the sum of absolute difference (SAD) value of

the image block has strong correlations with the sum of

edge amplitude of all pixels in this block and the current

quantization interval (Q

step

). This conclusion indeed interprets

the strong SAD correlation in pictures, which is illustrated

in [5]. Consequently, the content-based early termination

thresholds for integer motion estimation (IME) are developed.

Experiments show that at high and moderate bit rates, the pro-

posed adaptive thresholds outperform the original method [17]

adopted by JM reference software version 11.0 (JM11.0)

in terms of the coding quality and the computation saving,

especially to those sequences being rich in detailed textures.

The mathematical analysis in [18] suggested that the motion

between the object and the camera sensor works as a low-pass

ﬁlter, which can smooth the edges of the sampled image, and

this is designated as motion blur [19]. Therefore, when consid-

ering the typical video recording conditions [20] which refer to

30-frames/s frame rate, 1/60 s exposure time, and no synthetic

videos, we can estimate the motion speed of the image block

according to its edge gradient nature. In detail, for a block

containing a large edge gradient amplitude, it is reasonable

to assume that this block undergoes slow movement and then

its search range can be reduced accordingly. This algorithm

is combined with the original dynamic search range (DSR)

algorithm in JM11.0 [21] to further improve its computation

saving.

The rest of this paper is organized as follows. In Sec-

tion II, the impact of the spatial edge gradient on prediction

error is analyzed. And then, the edge-based fast reference

frame and intermode decision algorithms are proposed in

Section III. The content-based early termination approach

and the edge gradient-based search range decision algorithm

are described in Section IV and Section V, respectively.

The whole process ﬂow of the proposed fast algorithms

is depicted in Section VI. The proposed algorithms are

integrated with unsymmetrical-cross multihexagongrid search

(UMHexagonS) [5] to demonstrate their compatibility with

the traditional fast block-matching algorithms. Section VII

presents the detailed experimental results to verify the per-

formance of the proposed schemes. Finally, conclusions are

drawn in Section VIII.

II. M

ATHEMAT I CA L ANALYSIS OF EDGE GRADIENT

IMPACT TO PREDICTION ERROR

With the hybrid coder model [12], Girod deduced that when

>, the power spectral density of prediction errors is

() = S

()



1 −

|P()|

()

() +



(1)

where  = (ω

,ω

), S

and S

denote the power spectral

density of the prediction errors and the source signals, respec-

tively, P() is the 2-D Fourier transform of the probability

density function (pdf) of the displacement estimation error,

and  can be interpreted as the power spectral density of the

white noise incurred for the quantization processing. From (1),

the prediction error power (S

) has a strong correlation with

the source signals S

. When S

 , (1) can be simpliﬁed

as S

() = S

()[1 −|P()|

]. In this case, the power of

the prediction error hinges entirely on the image content and

pdf of displacement estimation error.

The spatial analysis presents a more straightforward insight

of the prediction error power as compared to the spectral

analysis provided in [12]. In this section, the impact of the

edge intensity of the source image on the prediction errors

is investigated in the spatial domain. In order to simplify the

mathematical description, the analysis is ﬁrst restricted to 1-D

spatial signals, as shown in Fig. 1, and the quantization noise

is temporarily ignored. s

(x) and s

t−1

(x) denote the spatial-

continuous signals at time instance t and t − 1. s

(x) is a

displaced version of s

t−1

(x) and the distance is d

, which

can be expressed as s

(x) = s

t−1

(x − d

). These continuous

image signals are sampled by the sensor array before digital

processing. The spatial sampling interval is denoted as u

.The

displacement estimation error is 

−round(d

) · u

From Fig. 1, the prediction error e(i ·u

) of pixel i can be

approximated as

e(i · u

) ≈ 

· s



(i · u

) (2)

Authorized licensed use limited to: China Three Gorges University IEL Trial. Downloaded on November 3, 2009 at 22:27 from IEEE Xplore. Restrictions apply.

LIU et al.: MOTION ESTIMATION OPTIMIZATION FOR H.264/AVC USING SOURCE IMAGE EDGE FEATURES 1097

Camera sensor

(x)

(i·u

)

t – 1

(x)

(i·u

)

e(i·u

)

i + 1 i + 2 i + 3

Fig. 1. Analysis of 1-D prediction error caused by edge gradient and

displacement estimation error.

where s



(i · u

) is the edge gradient of s

(x) at the ith

camera sensor and the displacement estimation error 

a random variable with zero mean and 

∈ [−u

/2, u

/2].

When 

=±u

/2, |e(i · u

)| reaches its maximum value

·|s



(i · u

)|)/2 and when 

= 0, |e(i · u

)| vanishes.

This conclusion agrees with the aliasing investigation in the

spectral domain provided in the literature [14]. Equation

(2) also interprets the necessity of MRFs during prediction

processing: If the displacement error 

x,t−1

between the

current image s

(x) and the ﬁrst previous one s

t−1

(x) is larger

than that of the kth previous image s

t−k

(x), i.e., 

x,t−k

t−k

(x) is preferred to be chosen as the prediction signal

because its prediction error coming from aliasing problem is

reduced.

In order to simplify the notations in the following discus-

sions, it is assumed that the spatial sampling intervals in x-

and y-direction are u

= u

= 1. From (2), it is convenient

to derive the 2-D prediction error in one pixel

e(i, j) ≈ 

(i, j) ·

∂s

(i, j)

∂x

+ 

(i, j) ·

∂s

(i, j)

∂y

. (3)

If it is assumed that 

(i, j) and 

(i, j) are independent,

E(

) = E(

) = 0, and E(

) = E(

) = σ



, the variance

of e(i, j), i.e., σ(i, j), is written as

(i, j) = σ







∂s

(i, j)

∂x





∂s

(i, j)

∂y





. (4)

Using the prediction error variance of one pixel (4), the

prediction error power of an image block can be deduced as



i, j

(i, j) = σ





i, j





∂s

(i, j)

∂x





∂s

(i, j)

∂y





(5)

where (i, j ) ∈ block.

Like the spectral analysis represented by (1), (5) also

indicates that the prediction error power is determined by

the image features and the displacement estimation error.

Additionally, the spatial analysis illustrates that the power

of the block prediction error is proportional to the sum of

squares of the edge gradient amplitudes. This conclusion plays

an important role in the proposed early termination threshold

deﬁnition described in Section IV.

Optimum forward channel

(u,v) E

(u,v)

G(u,v)

F(u,v)

N(u,v)

(u,v)

t–1

(u,v)

Fig. 2. Model of hybrid coder with the optimum forward channel, G (u,v) =

max[0, 1 − (/(S

(u,v)))] and the power spectral density of N (u,v) is

(u,v) = max[0,(1 − (/(S

(u,v)))].

Equation (3) yields two important conclusions.

1) According to the terms of displacement error |

and |

|, the impact of aliasing vanishes at full pixel

displacements and is at its maximum at half pixel

displacements.

2) Because of the terms of edge gradient

(

∂s

(i, j)/∂ x,∂s

(i, j)/∂y

)

, aliasing is caused by

high-frequency signals in the source image.

In practice, a picture that is rich in sharp edges must con-

tain numerous high-frequency signals. In the literature [22],

for 2-D spatial signal s(x, y),

(

∂s(x, y)/∂ x,∂s(x, y)/∂ y

)

deﬁned as the local spatial frequency, which is introduced to

describe the local frequency feature in a region. The spatial

edge gradient analysis is superior to the spectral analysis

because it can efﬁciently reveal the local frequency nature of

the image with trivial computational overhead. Therefore, as

we shall see in Section III, when the image block contains

numerous textures, the power of its prediction errors becomes

augmented, which requires advanced coding approaches, such

as VBS and MRF techniques. Otherwise, the redundant com-

putation can be discarded with negligible coding quality

degradation. This is the essence of our homogeneity-based fast

algorithms.

III. H

OMOGENEITY-BASED REFERENCE FRAME AND

INTERMODE REDUCTION

Using rate-distortion theory, the relative homogeneity con-

cept is developed in Section III-A. Based on the relative

homogeneous block detection algorithm, the futile reference

frames and intermodes could be eliminated efﬁciently, which

is described in Section III-B.

A. Relative Homogeneous Block Detection Algorithm

Based on the hybrid coder model with the optimum forward

channel, as shown in Fig. 2, it is convenient to develop the

relative homogeneity concept. Capital letters, for example

(u,v), represent the discrete 2-D Fourier transforms of

the corresponding spatial signals. Let S

(u,v) denote the

N × N small image block to be encoded through the hybrid

coder and



t−1

(u,v) is the prediction signals generated from

the previously decoded image signals by the low-pass ﬁlter

F(u,v). The optimum forward channel consists of a nonideal

band-limiting ﬁlter G(u,v) and an additional noise N (u,v).

With rate-distortion theory [23], the distortion D and the

Authorized licensed use limited to: China Three Gorges University IEL Trial. Downloaded on November 3, 2009 at 22:27 from IEEE Xplore. Restrictions apply.

剩余12页未读，继续阅读

评论收藏

内容反馈