Snapshot HDR Video Construction Using a Coded Mask A PREPRINT
limited dynamic range under 16 f-stops and are extremely expensive. To overcome this limitation, many computational
imaging techniques have been developed via co-designing the sensor architecture and post-processing algorithms for
HDR image acquisition Reinhard et al. [2010]. These methods can be categorized into three distinct approaches. The
most common way is to capture a sequence of low dynamic range (LDR) images with different exposures and fuse
them into an HDR image Debevec and Malik [1997], Mann et al. [1995]. Modern cameras and mobile devices can
easily afford successive image capture, making this method capable of producing decent HDR images for static scenes.
However, when either the scene is dynamic or the camera shakes during capture, the resulting images can suffer from
ghosting artifacts. The second approach is to utilize multiple sensors to simultaneously capture differently exposed LDR
images by, for example, splitting the light to multiple sensors with a beam-splitter McGuire et al. [2007], Tocci et al.
[2011], Kronander et al. [2013]. This sophisticated approach is expensive and needs additional rigorous calibration.
The third approach is to capture a single LDR image with a per-pixel or per-scanline coded exposure. Reconstruction
algorithms are applied later to create HDR images Nayar and Mitsunaga [2000], Nayar and Branzoi [2003], Serrano
et al. [2016]. This type of computational camera can be achieved by using a per-pixel coded exposures in the sensor
architecture Kensei et al. [2014] or by mounting an optical mask onto an off-the-shelf camera sensor.
In Alghamdi et al. [2019] for easy implementation of a grayscale mask, we choose to place a random binary optical
mask at a short distance (typically 1-2 mm) in front of the sensor. Note that we did not optimize the distance, but
simply mounted our mask on the cover glass that is usually present in front of the sensor. Light propagation from the
mask to the sensor results in a blurred version of the binary mask. The actual statistics depends on both the mask and
propagation distance. Figure 1 illustrates the effect of distance on the resulting optical mask. For HDR reconstruction
we introduced an algorithm built upon an inception network that decodes reliable HDR images from the raw noisy
coded Bayer data. We demonstrate both in simulation and using a prototype that the combination of hardware encoding
and software decoding leads to a simple, yet efficient, HDR image acquisition system.
In Alghamdi et al. [2021] we present a transfer learning framework for solving the HDR reconstruction part. Our
motivation comes from the fact that available HDR image datasets are small compared to the typical requirement for
training deep neural networks. In Alghamdi et al. [2019] we solved this issue by pre-training on a large simulated HDR
dataset. This pre-training is expensive in both in memory and time; experimenting with different network structures will
need weeks of pre-training. In tIn Alghamdi et al. [2021] we incorporate architectures pre-trained on a different large
scale task, and transfer them to our HDR reconstruction. This new approach reduces our processing time substantially.
Specifically, we propose an encoder-decoder framework, that learns an initial estimation of the HDR image, as well
as useful image features. We then refine our estimate through residual learning Ronneberger et al. [2015]. Our final
network can be trained end-to-end. For the encoder, we use a VGG16 Simonyan and Zisserman [2014] network
pre-trained on ImageNet. With few epochs of training on a small dataset the network learned to reconstruct high quality
results.
3D-CNNs have successfully been applied to high-level vision tasks for videos, such as action recognition and event
classification Ji et al. [2012], Tran et al. [2015]. The spatio-temporal feature extraction capability of 3D-CNNs was
demonstrated in Ji et al. [2012], Tran et al. [2015]. In Tran et al. [2015], the authors argued that 3D-CNNs provide an
adequate video descriptor, and a homogeneous architecture with small 3
×
3
×
3 convolution kernels in all layers is among
the best-performing architecture for 3D-CNNs. Moreover, the capabilities of 3 D-CNN in video enhancement, inpainting
and super-resolution have been proven Lv et al. [2018], Kappeler et al. [2016], Wang et al. [2017], Wan [2019]. This
article will use a 3D CNN to globally perform a joint demosaicking, denoising, and HDR video reconstruction coded
LDR video. As far as the author knows, there is no published work on the construction of HDR video from coded LDR
images that utilizes temporal information in the reconstruction process.
3 Methods
3.1 Imaging Model
In our HDR system, we propose placing an optical mask into the optical path in close proximity to the image sensor.
The propagation of light from the mask to the sensor leads to a grayscale modulation pattern on the captured image. In
a color camera, a Bayer Color Filter Array (CFA) samples the radiance into three color channels. The camera sensor
then converts the photons impinging on the image plane over a specific exposure time into electrons, and quantizes the
voltage values into digital numbers (DNs). Basically, the process of capturing coded LDR video can be mathematically
expressed as follows:
y
k
= g (f (BΦx
k
∆t)) , k = 1, 2, 3, .. (1)
3
评论0
最新资源