PRE-PUBLICATION DRAFT, TO APPEAR IN IEEE TRANS. ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, DEC. 2012
cient levels are coded using a three-dimensional run-level-last
VLC, with tables optimized for lower bit rates. The first ver-
sion of H.263 contains four annexes (annexes D through G)
that specify additional coding options, among which annexes
D and F are frequently used for improving coding efficiency.
The usage of annex D allows motion vectors to point outside
the reference picture, a key feature that is not permitted in
H.262/MPEG-2 Video. Annex F introduces a coding mode for
P pictures, the inter 8×8 mode, in which four motion vectors
are transmitted for a MB, each for an 8×8 sub-block. It further
specifies the usage of overlapped block motion compensation.
The second and third versions of H.263, which are often
called H.263+ and H.263++, respectively, add several optional
coding features in the form of annexes. Annex I improves the
intra coding by supporting a prediction of intra AC coeffi-
cients, defining alternative scan patterns for horizontally and
vertically predicted blocks, and adding a specialized quantiza-
tion and VLC for intra coefficients. Annex J specifies a
deblocking filter that is applied inside the motion compensa-
tion loop. Annex O adds scalability support, which includes a
specification of B pictures roughly similar to those in
H.262/MPEG-2 Video. Some limitations of version 1 in terms
of quantization are removed by annex T, which also improves
the chroma fidelity by specifying a smaller quantization step
size for chroma coefficients than for luma coefficients. An-
nex U introduces the concept of multiple reference pictures.
With this feature, motion-compensated prediction is not re-
stricted to use just the last decoded I/P picture (or, for coded B
pictures using annex O, the last two I/P pictures) as a refer-
ence picture. Instead, multiple decoded reference pictures are
inserted into a picture buffer and can be used for inter predic-
tion. For each motion vector, a reference picture index is
transmitted, which indicates the employed reference picture
for the corresponding block. The other annexes in H.263+ and
H.263++ mainly provide additional functionalities such as the
specification of features for improved error resilience.
The H.263 profiles that provide the best coding efficiency
are the Conversational High Compression (CHC) profile and
the High Latency Profile (HLP). The CHC profile includes
most of the optional features (annexes D, F, I, J, T, and U) that
provide enhanced coding efficiency for low-delay applica-
tions. The High Latency Profile adds the support of B pictures
(as defined in annex O) to the coding efficiency tools of the
CHC profile and is targeted for applications that allow a high-
er coding delay.
C. ISO/IEC 14496-2 (MPEG-4 Visual)
MPEG-4 Visual [9], a.k.a. Part 2 of the MPEG-4 suite is
backward-compatible to H.263 in the sense that each conform-
ing MPEG-4 decoder must be capable of decoding H.263
Baseline bitstreams (i.e. bitstreams that use no H.263 optional
annex features). Similarly as for annex F in H.263, the inter
prediction in MPEG-4 can be done with 16×16 or 8×8 blocks.
While the first version of MPEG-4 only supports motion com-
pensation with half-sample precision motion vectors and bi-
linear interpolation (similar to H.262/MPEG-2 Video and
H.263), version 2 added support for quarter-sample precision
motion vectors. The luma prediction signal at half-sample
locations is generated using an 8-tap interpolation filter. For
generating the quarter-sample positions, bi-linear interpolation
of the integer- and half-sample positions is used. The chroma
prediction signal is generated by bi-linear interpolation. Mo-
tion vectors are differentially coded using a component-wise
median prediction and are allowed to point outside the refer-
ence picture. MPEG-4 Visual supports B pictures (in some
profiles), but it does not support the feature of multiple refer-
ence pictures (except on a slice basis for loss resilience pur-
poses) and it does not specify a deblocking filter inside the
motion compensation loop.
The transform coding in MPEG-4 is basically similar to that
of H.262/MPEG-2 Video and H.263. However, two different
quantization methods are supported. The first quantization
method, which is sometimes referred to as MPEG-style quan-
tization, supports quantization weighting matrices similarly to
H.262/MPEG-2 Video. With the second quantization method,
which is called H.263-style quantization, the same quantiza-
tion step size is used for all transform coefficients with the
exception of the DC coefficient in intra blocks. The transform
coefficient levels are coded using a three-dimensional run-
level-last code as in H.263. Similarly as in annex I of H.263,
MPEG-4 Visual also supports the prediction of AC coeffi-
cients in intra blocks as well as alternative scan patterns for
horizontally and vertically predicted intra blocks and the usage
of a separate VLC table for intra coefficients.
For the comparisons in this paper, we used the Advanced
Simple Profile (ASP) of MPEG-4 Visual, which includes all
relevant coding tools. We generally enabled quarter-sample
precision motion vectors. MPEG-4 ASP additionally includes
global motion compensation. Due to the limited benefits expe-
rienced in practice and the complexity and general difficulty
of estimating global motion fields suitable for improving the
coding efficiency, this feature is rarely supported in encoder
implementations and is also not used in our comparison.
D. ITU-T Rec. H.264 | ISO/IEC 14496-10 (MPEG-4 AVC)
H.264/MPEG-4 AVC [10][12] is the second video coding
standard that was jointly developed by ITU-T VCEG and
ISO/IEC MPEG. It still uses the concept of 16×16 macro-
blocks, but contains many additional features. One of the most
obvious differences from older standards is its increased flexi-
bility for inter coding. For the purpose of motion-compensated
prediction, a macroblock can be partitioned into square and
rectangular block shapes with sizes ranging from 4×4 to
16×16 luma samples. H.264/MPEG-4 AVC also supports
multiple reference pictures. Similarly to annex U of H.263,
motion vectors are associated with a reference picture index
for specifying the employed reference picture. The motion
vectors are transmitted using quarter-sample precision relative
to the luma sampling grid. Luma prediction values at half-
sample locations are generated using a 6-tap interpolation
filter and prediction values at quarter-sample locations are
obtained by averaging two values at integer- and half-sample
positions. Weighted prediction can be applied using a scaling
and offset of the prediction signal. For the chroma compo-
nents, a bi-linear interpolation is applied. In general, motion
vectors are predicted by the component-wise median of the