没有合适的资源?快使用搜索试试~ 我知道了~
Multi-focus image fusion with a deep convolutional neural networ...
4星 · 超过85%的资源 需积分: 0 36 下载量 188 浏览量
2017-10-25
16:49:11
上传
评论 1
收藏 5.71MB PDF 举报
温馨提示
试读
17页
本论文是最近发表的关于神经网络算法对多聚焦融合的一个改进算法,效果很好。
资源推荐
资源详情
资源评论
Information Fusion 36 (2017) 191–207
Contents lists available at ScienceDirect
Information Fusion
journal homepage: www.elsevier.com/locate/inffus
Multi-focus image fusion with a deep convolutional neural network
Yu Liu
a
, Xun Chen
a
,
∗
, Hu Peng
a
, Zengfu Wang
b
a
Department of Biomedical Engineering, Hefei University of Technology, Hefei 230 0 09, China
b
Department of Automation, University of Science and Technology of China, Hefei 230026, China
a r t i c l e i n f o
Article history:
Received 21 March 2016
Revised 29 November 2016
Accepted 4 December 2016
Available online 5 December 2016
Keywords:
Image fusion
Multi-focus image fusion
Deep learning
Convolutional neural networks
Activity level measurement
Fusion rule
a b s t r a c t
As is well known, activity level measurement and fusion rule are two crucial factors in image fusion. For
most existing fusion methods, either in spatial domain or in a transform domain like wavelet, the activity
level measurement is essentially implemented by designing local filters to extract high-frequency details,
and the calculated clarity information of different source images are then compared using some elabo-
rately designed rules to obtain a clarity/focus map. Consequently, the focus map contains the integrated
clarity information, which is of great significance to various image fusion issues, such as multi-focus im-
age fusion, multi-modal image fusion, etc. However, in order to achieve a satisfactory fusion performance,
these two tasks are usually difficult to finish. In this study, we address this problem with a deep learning
approach, aiming to learn a direct mapping between source images and focus map. To this end, a deep
convolutional neural network (CNN) trained by high-quality image patches and their blurred versions is
adopted to encode the mapping. The main novelty of this idea is that the activity level measurement
and fusion rule can be jointly generated through learning a CNN model, which overcomes the difficulty
faced by the existing fusion methods. Based on the above idea, a new multi-focus image fusion method is
primarily proposed in this paper. Experimental results demonstrate that the proposed method can obtain
state-of-the-art fusion performance in terms of both visual quality and objective assessment. The compu-
tational speed of the proposed method using parallel computing is fast enough for practical usage. The
potential of the learned CNN model for some other-type image fusion issues is also briefly exhibited in
the experiments.
©2016 Elsevier B.V. All rights reserved.
1. Introduction
In the field of digital photography, it is often difficult for an
imaging device like a digital single-lens reflex camera to take an
image in which all the objects are captured in focus. Typically, un-
der a certain focal setting of optical lens, only the objects within
the depth-of-field (DOF) have sharp appearance in the photograph
while other objects are likely to be blurred. A popular technique
to obtain an all-in-focus image is fusing multiple images of the
same scene taken with different focal settings, which is known
as multi-focus image fusion. At the same time, multi-focus im-
age fusion is also an important subfield of image fusion. With
or without modification, many algorithms for merging multi-focus
images can also be employed for other image fusion tasks such
as visible-infrared image fusion and multi-modal medical image
fusion (and vice versa). From this point of view, the meaning of
studying multi-focus image fusion is twofold, which makes it an
∗
Corresponding author.
E-mail addresses: yuliu@hfut.edu.cn , liuyu1@mail.ustc.edu.cn (Y. Liu),
xun.chen@hfut.edu.cn (X. Chen).
active topic in image processing community. In recent years, vari-
ous image fusion methods have been proposed, and these methods
can be roughly classified into two categories [1] : transform domain
methods and spatial domain methods.
The most classic transform domain fusion methods are based
on multi-scale transform (MST) theories, which have been ap-
plied in image fusion for more than thirty years since the Lapla-
cian pyramid (LP)-based fusion method [2] was proposed. Since
then, a large number of multi-scale transform based image fusion
methods have appeared in this field. Some representative examples
include the morphological pyramid (MP)-based method [3] , the
discrete wavelet transform (DWT)-base method [4] , the dual-tree
complex wavelet transform (DTCWT)-based method [5] , and the
non-subsampled contourlet transform (NSCT)-based method [6] .
These MST-based methods share a universal three-step framework,
namely, decomposition, fusion and reconstruction [7] . The basic as-
sumption of MST-based methods is that the activity level of source
images can be measured by the decomposed coefficients in a se-
lected transform domain. Apart from the selection of MST domain,
the rules designed for merging decomposed coefficients also play a
very important role in MST-based methods, and many studies have
http://dx.doi.org/10.1016/j.inffus.2016.12.001
1566-2535/© 2016 Elsevier B.V. All rights reserved.
192 Y. Liu et al. / Information Fusion 36 (2017) 191–207
also been taken in this direction [8–11] . In recent years, a new
kind of transform domain fusion methods [12–16] has emerged as
an attractive branch in this field. Different from the above intro-
duced MST-based methods, these methods transform images into
a single-scale feature domain with some advanced signal repre-
sentation theories such as independent component analysis (ICA)
and sparse representation (SR). This category of methods usually
employs the sliding window technique to pursue an approximate
shift-invariant fusion process. The key issue of these methods is to
explore an effective feature domain for the calculation of activity
level. For instance, as one of the most representative approaches
belonging to this category, the SR-based method [13] transforms
the source image patches into sparse domain and applies the L1-
norm of sparse coefficients as the activity level measurement.
The spatial domain methods in the early stage usually adopt
a block-based fusion strategy, in which the source images are de-
composed into blocks and each pair of block is fused with a de-
signed activity level measurement like spatial frequency and sum-
modified-Laplacian [17] . Clearly, the block size has a great im-
pact on the quality of fusion results. Since the earliest block-based
methods [18,19] using manually fixed size appeared, many im-
proved versions have been proposed on this topic, such as the
adaptive block based method [20] using differential evolution al-
gorithm to obtain a fixed optimal block size, and some recently
introduced quad-tree based methods [21,22] in which the images
can be adaptively divided into blocks with different sizes accord-
ing to image content. Another type of spatial domain methods
[23,24] is based on image segmentation by sharing the similar idea
of block-based methods, but the fusion quality of these methods
relies heavily on the segmentation accuracy. In the past few years,
some novel pixel-based spatial domain methods [25–31] based on
gradient information have been proposed, which can currently ob-
tain state-of-the-art results in multi-focus image fusion. To further
improve the fusion quality, these methods usually apply relatively
complex fusion schemes (can be regarded as rules in a broad sense)
to their calculation results of activity level measurement.
It is well known that for either transform domain or spatial do-
main image fusion methods, activity level measurement and fu-
sion rule are two crucial factors. In most existing image fusion
methods, these two issues are considered separately and designed
manually [32] . To make further improvements, many recently pro-
posed methods tend to be more and more complicated on these
two issues. In the MST-based methods, new transform domains in
[33,34] and new fusion rules in [9–11] were introduced. In the SR-
based methods, there were new sparse models and more complex
fusion rules in [35–37] . In the block-based methods, new focus
measures were proposed in [21,22] . In the pixel-based methods,
new activity level measurements were introduced in [27,29] and
the fusion schemes employed in [26,28–30] are very intricate. The
above introduced works were all published within the last five
years. It is worthwhile to clarify that we don’t mean these elab-
orately designed activity level measurements and fusion rules are
not important contributions, but the problem is that manual de-
sign is really not an easy task. Moreover, from a certain point of
view, it is almost impossible to come up with an ideal design that
takes all the necessary factors into account.
In this paper, we address this problem with a deep learning
approach, aiming to learn a direct mapping between source im-
ages and focus map. The focus map here indicates a pixel-level
map which contains the clarity information after comparing the
activity level measure of source images. To achieve this target, a
deep convolutional neural network (CNN) [38] trained by high-
quality image patches and their blurred versions is adopted to en-
code the mapping. The main novelty of this idea is that the ac-
tivity level measurement and fusion rule can be jointly generated
through learning a CNN model, which overcomes the above diffi-
culty faced by existing fusion methods. Based on this idea, we pro-
pose a new multi-focus image fusion method in spatial domain.
We demonstrate that the focus map obtained from the convolu-
tional network is reliable that very simple consistency verification
techniques can lead to high-quality fusion results. The computa-
tional speed of the proposed method using parallel computing is
fast enough for practical usage. At last, we briefly exhibit the po-
tential of the learned CNN model for some other-type image fusion
issues, such as visible-infrared image fusion, medical image fusion
and multi-exposure image fusion.
To the best of our knowledge, this is the first time that the con-
volutional neural network is applied to an image fusion task. The
most similar work was proposed by Li et al. [19] , in which they
pointed out that the multi-focus image fusion can be viewed as
a classification problem and presented a fusion method based on
artificial neural networks. However, there exist significant differ-
ences between the method in [19] and our method. The method in
[19] first calculates three commonly used focus measures (feature
extraction) and then feeds them to a three-layer (input-hidden-
output) network, so the network just acts as a classifier for the
fusion rule design. As a result, the source images must be fused
patch by patch in [19] . In this work, the CNN model is simultane-
ously used for activity level measure (feature extraction) and fu-
sion rule design (classification). The original image content are the
input of the CNN model. Thus, the network in this study should be
deeper than the “shallow” network used in [19] . Considering that
the GPU parallel computation is becoming more and more popu-
lar, the computational speed of CNN-based fusion is not a concern
nowadays. In addition, owing to the convolutional characteristic of
CNNs [39] , the source images in our method can be fed to the net-
work as a whole to further improve the computational efficiency.
The rest of this paper is organized as follows. In Section 2 , we
give a brief introduction to CNN and explain its feasibility as well
as advantage for image fusion problem. In Section 3 , the proposed
CNN-based multi-focus fusion method is presented in detail. The
experimental results and discussions are provided in Section 4 . Fi-
nally, Section 5 concludes the paper.
2. CNN model for image fusion
2.1. CNN model
CNN is a typical deep learning model, which attempts to learn
a hierarchical feature representation mechanism for signal/image
data with different levels of abstraction [40] . More concretely, CNN
is a trainable multi-stage feed-forward artificial neural network
and each stage contains a certain number of feature maps corre-
sponding to a level of abstraction for features. Each unit or coef-
ficient in a feature map is called a neuron . The operations such
as linear convolution, non-linear activation and spatial pooling ap-
plied to neurons are used to connect the feature maps at different
stages.
Local receptive fields, shared weights and sub-sampling are three
basic architectural ideas of CNNs [38] . The first one indicates a
neuron at a certain stage is only connected with a few spatially
neighboring neurons at its previous stage, which is in accord with
the mechanism of mammal visual cortex. As a result, local con-
volutional operation is performed on the input neurons in CNNs,
unlike the fully-connected mechanism used in conventional mul-
tilayer perception. The second idea means the weights of a con-
volutional kernel is spatially invariant in feature maps at a certain
stage. By combining these two ideas, the number of weights to be
trained is greatly reduced. Mathematically, let x
i
and y
j
denote the
i -th input feature map and j -th output feature map of a convolu-
tional layer, respectively. The 3D convolution and non-linear ReLU
Y. Liu et al. / Information Fusion 36 (2017) 191–207 193
activation [41] applied in CNNs are jointly expressed as
y
j
= max (0 , b
j
+
i
k
ij
∗ x
i
) , (1)
where k
ij
is the convolutional kernel between x
i
and y
j
, and b
j
is the bias. The symbol
∗
indicates convolutional operation. When
there are M input maps and N output maps, this layer will contain
N 3D kernels of size d × d × M ( d × d is the size of local recep-
tive fields) and each kernel owns a bias. The last idea sub-sampling
is also known as pooling, which can reduce data dimension. Max-
pooling and average-pooling are popular operations in CNNs. As an
example, the max-pooling operation is formulated as
y
i
r,c
= max
0 ≤m,n<s
{ x
i
r·s + m,c·s + n
} , (2)
where y
i
r,c
is the neuron located at ( r, c ) in the i -th output map of
a max-pooing layer, and it is assigned with the maximal value over
a local region of size s × s in the i -th input map x
i
. By combining
the above three ideas, convolutional networks could obtain some
important invariances on translation and scale to a certain degree.
In [42] , Krizhevsky et al. proposed a CNN model for image clas-
sification and achieved a landmark success. In the past three years,
CNNs have been successfully introduced into various fields in com-
puter vision from high-level tasks to low-level tasks, such as face
detection [43] , face recognition [44] , semantic segmentation [45] ,
super-resolution [46] , patch similarity comparison [47] , etc. These
CNN-based methods usually outperform conventional methods in
their respective fields, owning to the fast development of modern
powerful GPUs, the great progress on effective training techniques,
and the easy access to a large amount of image data. This study
also benefits from these factors.
2.2. CNNs for image fusion
2.2.1. Feasibility
As mentioned above, the generation of focus map in image fu-
sion can be viewed as a classification problem [19] . Specifically, the
activity level measurement is known as feature extraction, while
the role of fusion rule is similar to that of a classifier used in gen-
eral classification tasks. Thus, it is theoretically feasible to employ
CNNs for image fusion. The CNN architecture for visual classifica-
tion is an end-to-end framework [38] , in which The input is an
image while the output is a label vector that indicates the proba-
bility for each category. Between these two ends, the network con-
sists of several convolutional layers (a non-linear layer like ReLU
always follows a convolutional layer, so we don’t explicitly men-
tion it later), max-pooling layers and fully-connected layers. The
convolutional and max-pooling layers are generally viewed as fea-
ture extraction part in the system, while the fully-connected layers
existing at the output end are regarded as the classification part.
We further explain this point from the view of implementa-
tion. For most existing fusion methods, either in spatial domain
or transform domain, the activity level measurement is essentially
implemented by designing local filters to extract high-frequency
details. On one hand, for most transform domain fusion methods,
the images or image patches are represented using a set of pre-
designed bases such as wavelet or trained dictionary atoms. From
the view of image processing, this is generally equivalent to con-
volving them with those bases [46] . For example, the implementa-
tion of discrete wavelet transform is exactly based on filtering. On
the other hand, for spatial domain fusion methods, the situation is
even clearer that so many activity level measurements are based
on high-pass spatial filtering. Furthermore, the fusion rule, which
is usually interpreted as the weight assignment strategy for differ-
ent source images based on the calculated activity level measures,
can be transformed into a filtering-based form as well. Consider-
ing that the basic operation in a CNN model is convolution (the
full connection operation can be viewed as convolution with the
kernel size that equals to the spatial size of input data [45] ), it is
practically feasible to apply CNNs to image fusion.
2.2.2. Superiority
Similar to the situation in visual object classification applica-
tions, the advantages of CNN-based fusion method over existing
methods are twofold. First, it overcomes the difficulty on manu-
ally designing complicated activity level measurement and fusion
rules. The main task is replaced by the design of network architec-
ture. With the emergence of some easy-to-use CNN platforms such
as Caffe [48] and MatConvNet [49] , the implementation of network
design becomes convenient to researchers. Second, and more im-
portantly, the activity level measurement and fusion rule can be
jointly generated via learning a CNN model. The learned result can
be viewed as an “optimal” solution to some extent, and therefore is
likely to be more effective than manually designed ones. Thus, the
CNN-based method has a great potential to produce fusion results
in higher quality than conventional methods.
3. The proposed method
3.1. Overview
In this Section, the proposed CNN-based multi-focus image fu-
sion method is presented in detail. The schematic diagram of our
algorithm is shown in Fig. 1 . In this study, we mainly consider the
situation that there are only two pre-registered source images. To
deal with more than two multi-focus images, one can fuse them
one by one in series. It can be seen from Fig. 1 that our method
consists of four steps: focus detection, initial segmentation, consis-
tency verification and fusion . In the first step, the two source im-
ages are fed to a pre-trained CNN model to output a score map,
which contains the focus information of source images. Particu-
larly, each coefficient in the score map indicates the focus prop-
erty of a pair of corresponding patches from two source images.
Then, a focus map with the same size of source images is obtained
from the score map by averaging the overlapping patches. In the
second step, the focus map is segmented into a binary map with a
threshold of 0.5. In the third step, we refine the binary segmented
map with two popular consistency verification strategies, namely,
small region removal and guided image filtering [50] , to generate
the final decision map. In the last step, the fused image is obtained
with the final decision map using the pixel-wise weighted-average
strategy.
3.2. Network design
In this work, multi-focus image fusion is viewed as a two-class
classification problem. For a pair of image patches { p
A
, p
B
} of the
same scene, our goal is to learn a CNN whose output is a scalar
ranging from 0 to 1. Specifically, the output value should be close
to 1 when p
A
is focused while p
B
is defocused, and the value
should be close to 0 when p
B
is defocused while p
A
is focused. In
other words, the output value indicates the focus property of the
patch pair. To this end, we employ a large number of patch pairs
as training examples. Each training example is a patch pair of the
same scene. One training example { p
1
, p
2
} is defined as a positive
example when p
1
is clearer than p
2
, and its label is set to 1. On
the contrary, the example is defined as a negative example when
p
2
is clearer than p
1
and the label is set to 0.
In practical usage, the source images have arbitrary spatial size.
One possible way is to apply sliding-window technique to divide
the images into overlapping patches, and then input each pair of
patches into the network to obtain a score. However, considering
194 Y. Liu et al. / Information Fusion 36 (2017) 191–207
Fig. 1. Schematic diagram of the proposed CNN-based multi-focus image fusion algorithm. Data courtesy of M. Nejati [30] .
that there are a large number of repeated convolutional calcula-
tions since the patches are greatly overlapped, this patch-based
manner is very time consuming. Another approach is to input the
source images into the network as a whole without dividing them
into patches, as was applied in [39,43,45] , aiming to directly gener-
ate a dense prediction map. Since the fully-connected layers have
fixed dimensions on input and output data, to make it possible,
the fully-connected layers should be firstly converted into convo-
lutional layers by reshaping parameters [39,43,45] (as mentioned
above, the full connection operation can be viewed as convolution
with the kernel size that equals to the spatial size of input data
[45] , so the offline reshaping process is straightforward). After the
conversion, the network only consists of convolutional and max-
pooling layers, so it can process source images of arbitrary size as
a whole to generate dense predictions [39] . As a result, the output
of the network now is a score map, and each coefficient within it
indicates the focus property of a pair of patches in source images.
The patch size equals to the size of training examples. When the
kernel stride of each convolutional layer is one pixel, the stride of
adjacent patches in source images will be just determined by the
number of max-pooling layers in the network. To be more specific,
the stride is 2
k
when there are totally k max-pooling layers and
each with a kernel stride of two pixels [39,43,45] .
In [47] , three types of CNN models are presented for patch
similarity comparison: siamese, pseudo-siamese and 2-channel . The
siamese network and pseudo-siamese network both have two
branches with the same architectures, and each branch takes one
image patch as input. The difference between these two networks
is the two branches in the former one share the same weights
while in the latter one do not. Thus, the pseudo-siamese net-
work is more flexible than the siamese one. In the 2-channel
network, the two patches are concatenated as a 2-channel im-
age to be fed to the network. The 2-channel network just has
one trunk without branches. Clearly, for any solution of a siamese
or pseudo-siamese network, it can be reshaped to the 2-channel
manner, so the 2-channel network provides further more flexibil-
ity [47] . All the above three types of networks can be adopted in
the proposed CNN-based image fusion method. In this work, we
choose the siamese one as our CNN model mainly for the follow-
ing two considerations. First, the siamese network is more natu-
ral to be explained in image fusion tasks. The two branches with
same weights demonstrate that the approach of feature extrac-
tion or activity level measure is exactly the same for two source
images, which is a generally recognized manner in most image
fusion methods. Second, a siamese network is usually easier to
be trained than the other two types of networks. As mentioned
above, the siamese network can be viewed as a special case of
the pseudo-siamese one and 2-channel one, so its solution space
is much smaller than those of the other two types, leading to an
easier convergence.
Another important issue in network design is the selection of
input patch size. When the patch size is set to 32 × 32, the clas-
sification accuracy of the network is usually higher since more im-
age contents are used. However, there are several defects which
cannot be ignored using this setting. As is well known, the max-
pooling layers have important significance to the performance of a
convolutional network. When the patch size is 32 × 32, the num-
ber of max-pooling layers is not easy to determine. More specif-
ically, when there are two or even more max-pooling layers in a
branch, which means that the stride of patches is at least four
pixels, the fusion results tend to suffer from block artifacts. On
the other hand, when there is only one max-pooling layer in a
branch, the CNN model size is usually very large since the number
of weights in fully-connected layers significantly increases. Further-
more, for multi-focus image fusion, the setting of 32 × 32 is often
not very accurate because a 32 × 32 patch is more likely to contain
both focused and defocused regions, which will lead to undesirable
results around the boundary regions in the fused image. When the
patch size is set to 8 × 8, the patches used to train a CNN model
is too small that the classification accuracy cannot be guaranteed.
Based on the above considerations as well as experimental tests,
we set the patch size to 16 × 16 in this study.
Fig. 2 shows the CNN model used in the proposed fusion algo-
rithm. It can be seen that each branch in the network has three
convolutional layers and one max-pooling layer. The kernel size
and stride of each convolutional layer are set to 3 × 3 and 1,
respectively. The kernel size and stride of the max-pooling layer
are set to 2 × 2 and 2, respectively. The 256 feature maps ob-
tained by each branch are concatenated and then fully-connected
with a 256-dimensional feature vector. The output of the network
is a 2-dimensional vector that is fully-connected with the 256-
dimensional vector. Actually, the 2-dimensional vector is fed to a
2-way softmax layer (not shown in Fig. 2 ) which produces a proba-
bility distribution over two classes. In the test/fusion process, after
converting the two fully-connected layers into convolutional ones,
the network can be fed with two source images of arbitrary size
as a whole to generate a dense score map [39,43,45] . When the
source images are of size H × W , the size of the output score map
is (
H/ 2
− 8 + 1) × (
W/ 2
− 8 + 1) , where · denotes the ceil-
ing operation. Fig. 3 shows the correspondence between the source
images and the obtained score map. Each coefficient in the score
map keeps the output score of a pair of source image patches of
size 16 × 16 going forward through the network. In addition, the
stride of the adjacent patches in source images is two pixels be-
cause there is one max-pooling layer in each branch of the net-
work.
剩余16页未读,继续阅读
资源评论
- 拼搏无悔2018-01-15文章内容很齐全,没有缺页现象
A1036857413
- 粉丝: 1
- 资源: 11
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功