Entropy 2014, 16 6591
1. Introduction
Three-dimensional (3D) video can provide the viewers a high-quality and immersive multimedia
experience, which has drawn increasing attention among industry and academic researchers [1]. Two
typical 3D applications have appeared in the form of three-dimensional television (3DTV) [2] and
free-viewpoint television (FTV) [3]. In 3DTV applications, multiple views from different viewing angles
can be rendered for depth perception of the scene while in FTV applications, arbitrary viewpoints within
a certain range can be selected interactively by viewers.
The basic format of 3D video is a multiview representation which is usually captured simultaneously
by multiple cameras with slightly displaced positions [4]. However, with an increasing number of the
views, the huge amount of data from multiview video poses great challenge for 3D applications, such as
data compression and transmission. In order to solve this problem, the multiview video plus depth (MVD)
format has emerged as an efficient data representation for 3D systems. Compared to the pure multiview
video format without depth information, the main advantage of the MVD format is that desired virtual
views at arbitrary viewpoint positions can be conveniently synthesized via the depth-image-based
rendering (DIBR) technique [5].
Depth images represent the distance information between the camera and the objects in the
scene. The depth images are often treated as grey scale image sequences, which are similar to the
luminance component of texture video. However, differently from the texture video, the depth image
has its own special characteristics. Firstly, the depth image signal is much sparser than the texture video
under certain transform basis, such as Discrete Cosine Transform (DCT) or Discrete Wavelet Transform
(DWT), etc. It contains no texture but sharp object boundaries, since the gray levels are nearly the same
in most regions within an object but change abruptly across the boundaries. Furthermore, the depth image
is not directly used for display, but it plays an important role in the virtual view synthesis. The distortion
of depth data, especially around the object boundaries, will seriously degrade the quality of the rendered
virtual views [6]. Therefore, how to employ the depth image characteristics for efficient compression is
an essential part in 3D systems.
In view of the sparsity characteristics of depth images, we attempt to apply compressive sensing
(CS) [7] to represent depth information efficiently. CS is a new method to capture and represent
compressible signals at a rate significantly below the conventional Shannon/Nyquist rate. In the
conventional Shannon/Nyquist sampling theorem, when capturing a signal, one must sample at least two
times faster than the signal bandwidth in order to avoid losing information. Due to the low sampling
rate, CS can avoid the big burden of data storage and processing at the conventional encoder.
In recent years, CS is applied in image compression and the basic framework is shown in Figure 1.
At the encoder, the input image can be processed block by block. For each block in the image, sparse
transform, such as DCT or DWT, is used to produce the coefficients with sparse characteristics. Then
compressive sensing is employed to encode the transform coefficients and generate the same amount of
measurements for each block. At the decoder, a convex optimization method, such as the log-barrier or
multiplier [8], can be adopted for the CS recovery. In the end, the corresponding inverse transform can
be used for the image reconstruction. Block compressed sensing for natural images is proposed using
the same measurement matrix, which is claimed that it can sufficiently capture the complicated
geometric structures of natural images [9]. A new image/video coding approach is proposed, which can