【免费】Multi-focusimagefusionwithadeepconvolutionalneuralnetwork资源-CSDN文库

多聚焦融合

4星 · 超过85%的资源需积分: 0 163 浏览量 2017-10-25 16:49:11 上传评论 1 收藏 5.71MB PDF 举报

资源推荐

资源详情

资源评论

Information Fusion 36 (2017) 191–207

Contents lists available at ScienceDirect

Information Fusion

journal homepage: www.elsevier.com/locate/inffus

Multi-focus image fusion with a deep convolutional neural network

Yu Liu

, Xun Chen

∗

, Hu Peng

, Zengfu Wang

Department of Biomedical Engineering, Hefei University of Technology, Hefei 230 0 09, China

Department of Automation, University of Science and Technology of China, Hefei 230026, China

a r t i c l e i n f o

Article history:

Received 21 March 2016

Revised 29 November 2016

Accepted 4 December 2016

Available online 5 December 2016

Keywords:

Image fusion

Multi-focus image fusion

Deep learning

Convolutional neural networks

Activity level measurement

Fusion rule

a b s t r a c t

As is well known, activity level measurement and fusion rule are two crucial factors in image fusion. For

most existing fusion methods, either in spatial domain or in a transform domain like wavelet, the activity

level measurement is essentially implemented by designing local ﬁlters to extract high-frequency details,

and the calculated clarity information of different source images are then compared using some elabo-

rately designed rules to obtain a clarity/focus map. Consequently, the focus map contains the integrated

clarity information, which is of great signiﬁcance to various image fusion issues, such as multi-focus im-

age fusion, multi-modal image fusion, etc. However, in order to achieve a satisfactory fusion performance,

these two tasks are usually diﬃcult to ﬁnish. In this study, we address this problem with a deep learning

approach, aiming to learn a direct mapping between source images and focus map. To this end, a deep

convolutional neural network (CNN) trained by high-quality image patches and their blurred versions is

adopted to encode the mapping. The main novelty of this idea is that the activity level measurement

and fusion rule can be jointly generated through learning a CNN model, which overcomes the diﬃculty

faced by the existing fusion methods. Based on the above idea, a new multi-focus image fusion method is

primarily proposed in this paper. Experimental results demonstrate that the proposed method can obtain

state-of-the-art fusion performance in terms of both visual quality and objective assessment. The compu-

tational speed of the proposed method using parallel computing is fast enough for practical usage. The

potential of the learned CNN model for some other-type image fusion issues is also brieﬂy exhibited in

the experiments.

1. Introduction

In the ﬁeld of digital photography, it is often diﬃcult for an

imaging device like a digital single-lens reﬂex camera to take an

image in which all the objects are captured in focus. Typically, un-

der a certain focal setting of optical lens, only the objects within

the depth-of-ﬁeld (DOF) have sharp appearance in the photograph

while other objects are likely to be blurred. A popular technique

to obtain an all-in-focus image is fusing multiple images of the

same scene taken with different focal settings, which is known

as multi-focus image fusion. At the same time, multi-focus im-

age fusion is also an important subﬁeld of image fusion. With

or without modiﬁcation, many algorithms for merging multi-focus

images can also be employed for other image fusion tasks such

as visible-infrared image fusion and multi-modal medical image

fusion (and vice versa). From this point of view, the meaning of

studying multi-focus image fusion is twofold, which makes it an

∗

Corresponding author.

E-mail addresses: yuliu@hfut.edu.cn , liuyu1@mail.ustc.edu.cn (Y. Liu),

xun.chen@hfut.edu.cn (X. Chen).

active topic in image processing community. In recent years, vari-

ous image fusion methods have been proposed, and these methods

can be roughly classiﬁed into two categories [1] : transform domain

methods and spatial domain methods.

The most classic transform domain fusion methods are based

on multi-scale transform (MST) theories, which have been ap-

plied in image fusion for more than thirty years since the Lapla-

cian pyramid (LP)-based fusion method [2] was proposed. Since

then, a large number of multi-scale transform based image fusion

methods have appeared in this ﬁeld. Some representative examples

include the morphological pyramid (MP)-based method [3] , the

discrete wavelet transform (DWT)-base method [4] , the dual-tree

complex wavelet transform (DTCWT)-based method [5] , and the

non-subsampled contourlet transform (NSCT)-based method [6] .

These MST-based methods share a universal three-step framework,

namely, decomposition, fusion and reconstruction [7] . The basic as-

sumption of MST-based methods is that the activity level of source

images can be measured by the decomposed coeﬃcients in a se-

lected transform domain. Apart from the selection of MST domain,

the rules designed for merging decomposed coeﬃcients also play a

very important role in MST-based methods, and many studies have

http://dx.doi.org/10.1016/j.inffus.2016.12.001

192 Y. Liu et al. / Information Fusion 36 (2017) 191–207

also been taken in this direction [8–11] . In recent years, a new

kind of transform domain fusion methods [12–16] has emerged as

an attractive branch in this ﬁeld. Different from the above intro-

duced MST-based methods, these methods transform images into

a single-scale feature domain with some advanced signal repre-

sentation theories such as independent component analysis (ICA)

and sparse representation (SR). This category of methods usually

employs the sliding window technique to pursue an approximate

shift-invariant fusion process. The key issue of these methods is to

explore an effective feature domain for the calculation of activity

level. For instance, as one of the most representative approaches

belonging to this category, the SR-based method [13] transforms

the source image patches into sparse domain and applies the L1-

norm of sparse coeﬃcients as the activity level measurement.

The spatial domain methods in the early stage usually adopt

a block-based fusion strategy, in which the source images are de-

composed into blocks and each pair of block is fused with a de-

signed activity level measurement like spatial frequency and sum-

modiﬁed-Laplacian [17] . Clearly, the block size has a great im-

pact on the quality of fusion results. Since the earliest block-based

methods [18,19] using manually ﬁxed size appeared, many im-

proved versions have been proposed on this topic, such as the

adaptive block based method [20] using differential evolution al-

gorithm to obtain a ﬁxed optimal block size, and some recently

introduced quad-tree based methods [21,22] in which the images

can be adaptively divided into blocks with different sizes accord-

ing to image content. Another type of spatial domain methods

[23,24] is based on image segmentation by sharing the similar idea

of block-based methods, but the fusion quality of these methods

relies heavily on the segmentation accuracy. In the past few years,

some novel pixel-based spatial domain methods [25–31] based on

gradient information have been proposed, which can currently ob-

tain state-of-the-art results in multi-focus image fusion. To further

improve the fusion quality, these methods usually apply relatively

complex fusion schemes (can be regarded as rules in a broad sense)

to their calculation results of activity level measurement.

It is well known that for either transform domain or spatial do-

main image fusion methods, activity level measurement and fu-

sion rule are two crucial factors. In most existing image fusion

methods, these two issues are considered separately and designed

manually [32] . To make further improvements, many recently pro-

posed methods tend to be more and more complicated on these

two issues. In the MST-based methods, new transform domains in

[33,34] and new fusion rules in [9–11] were introduced. In the SR-

based methods, there were new sparse models and more complex

fusion rules in [35–37] . In the block-based methods, new focus

measures were proposed in [21,22] . In the pixel-based methods,

new activity level measurements were introduced in [27,29] and

the fusion schemes employed in [26,28–30] are very intricate. The

above introduced works were all published within the last ﬁve

years. It is worthwhile to clarify that we don’t mean these elab-

orately designed activity level measurements and fusion rules are

not important contributions, but the problem is that manual de-

sign is really not an easy task. Moreover, from a certain point of

view, it is almost impossible to come up with an ideal design that

takes all the necessary factors into account.

In this paper, we address this problem with a deep learning

approach, aiming to learn a direct mapping between source im-

ages and focus map. The focus map here indicates a pixel-level

map which contains the clarity information after comparing the

activity level measure of source images. To achieve this target, a

deep convolutional neural network (CNN) [38] trained by high-

quality image patches and their blurred versions is adopted to en-

code the mapping. The main novelty of this idea is that the ac-

tivity level measurement and fusion rule can be jointly generated

through learning a CNN model, which overcomes the above diﬃ-

culty faced by existing fusion methods. Based on this idea, we pro-

pose a new multi-focus image fusion method in spatial domain.

We demonstrate that the focus map obtained from the convolu-

tional network is reliable that very simple consistency veriﬁcation

techniques can lead to high-quality fusion results. The computa-

tional speed of the proposed method using parallel computing is

fast enough for practical usage. At last, we brieﬂy exhibit the po-

tential of the learned CNN model for some other-type image fusion

issues, such as visible-infrared image fusion, medical image fusion

and multi-exposure image fusion.

To the best of our knowledge, this is the ﬁrst time that the con-

volutional neural network is applied to an image fusion task. The

most similar work was proposed by Li et al. [19] , in which they

pointed out that the multi-focus image fusion can be viewed as

a classiﬁcation problem and presented a fusion method based on

artiﬁcial neural networks. However, there exist signiﬁcant differ-

ences between the method in [19] and our method. The method in

[19] ﬁrst calculates three commonly used focus measures (feature

extraction) and then feeds them to a three-layer (input-hidden-

output) network, so the network just acts as a classiﬁer for the

fusion rule design. As a result, the source images must be fused

patch by patch in [19] . In this work, the CNN model is simultane-

ously used for activity level measure (feature extraction) and fu-

sion rule design (classiﬁcation). The original image content are the

input of the CNN model. Thus, the network in this study should be

deeper than the “shallow” network used in [19] . Considering that

the GPU parallel computation is becoming more and more popu-

lar, the computational speed of CNN-based fusion is not a concern

nowadays. In addition, owing to the convolutional characteristic of

CNNs [39] , the source images in our method can be fed to the net-

work as a whole to further improve the computational eﬃciency.

The rest of this paper is organized as follows. In Section 2 , we

give a brief introduction to CNN and explain its feasibility as well

as advantage for image fusion problem. In Section 3 , the proposed

CNN-based multi-focus fusion method is presented in detail. The

experimental results and discussions are provided in Section 4 . Fi-

nally, Section 5 concludes the paper.

2. CNN model for image fusion

2.1. CNN model

CNN is a typical deep learning model, which attempts to learn

a hierarchical feature representation mechanism for signal/image

data with different levels of abstraction [40] . More concretely, CNN

is a trainable multi-stage feed-forward artiﬁcial neural network

and each stage contains a certain number of feature maps corre-

sponding to a level of abstraction for features. Each unit or coef-

ﬁcient in a feature map is called a neuron . The operations such

as linear convolution, non-linear activation and spatial pooling ap-

plied to neurons are used to connect the feature maps at different

stages.

Local receptive ﬁelds, shared weights and sub-sampling are three

basic architectural ideas of CNNs [38] . The ﬁrst one indicates a

neuron at a certain stage is only connected with a few spatially

neighboring neurons at its previous stage, which is in accord with

the mechanism of mammal visual cortex. As a result, local con-

volutional operation is performed on the input neurons in CNNs,

unlike the fully-connected mechanism used in conventional mul-

tilayer perception. The second idea means the weights of a con-

volutional kernel is spatially invariant in feature maps at a certain

stage. By combining these two ideas, the number of weights to be

trained is greatly reduced. Mathematically, let x

and y

denote the

i -th input feature map and j -th output feature map of a convolu-

tional layer, respectively. The 3D convolution and non-linear ReLU

Y. Liu et al. / Information Fusion 36 (2017) 191–207 193

activation [41] applied in CNNs are jointly expressed as

= max (0 , b



∗ x

) , (1)

where k

is the convolutional kernel between x

and y

, and b

is the bias. The symbol

∗

indicates convolutional operation. When

there are M input maps and N output maps, this layer will contain

N 3D kernels of size d × d × M ( d × d is the size of local recep-

tive ﬁelds) and each kernel owns a bias. The last idea sub-sampling

is also known as pooling, which can reduce data dimension. Max-

pooling and average-pooling are popular operations in CNNs. As an

example, the max-pooling operation is formulated as

r,c

= max

0 ≤m,n<s

{ x

r·s + m,c·s + n

} , (2)

where y

r,c

is the neuron located at ( r, c ) in the i -th output map of

a max-pooing layer, and it is assigned with the maximal value over

a local region of size s × s in the i -th input map x

. By combining

the above three ideas, convolutional networks could obtain some

important invariances on translation and scale to a certain degree.

In [42] , Krizhevsky et al. proposed a CNN model for image clas-

siﬁcation and achieved a landmark success. In the past three years,

CNNs have been successfully introduced into various ﬁelds in com-

puter vision from high-level tasks to low-level tasks, such as face

detection [43] , face recognition [44] , semantic segmentation [45] ,

super-resolution [46] , patch similarity comparison [47] , etc. These

CNN-based methods usually outperform conventional methods in

their respective ﬁelds, owning to the fast development of modern

powerful GPUs, the great progress on effective training techniques,

and the easy access to a large amount of image data. This study

also beneﬁts from these factors.

2.2. CNNs for image fusion

2.2.1. Feasibility

As mentioned above, the generation of focus map in image fu-

sion can be viewed as a classiﬁcation problem [19] . Speciﬁcally, the

activity level measurement is known as feature extraction, while

the role of fusion rule is similar to that of a classiﬁer used in gen-

eral classiﬁcation tasks. Thus, it is theoretically feasible to employ

CNNs for image fusion. The CNN architecture for visual classiﬁca-

tion is an end-to-end framework [38] , in which The input is an

image while the output is a label vector that indicates the proba-

bility for each category. Between these two ends, the network con-

sists of several convolutional layers (a non-linear layer like ReLU

always follows a convolutional layer, so we don’t explicitly men-

tion it later), max-pooling layers and fully-connected layers. The

convolutional and max-pooling layers are generally viewed as fea-

ture extraction part in the system, while the fully-connected layers

existing at the output end are regarded as the classiﬁcation part.

We further explain this point from the view of implementa-

tion. For most existing fusion methods, either in spatial domain

or transform domain, the activity level measurement is essentially

implemented by designing local ﬁlters to extract high-frequency

details. On one hand, for most transform domain fusion methods,

the images or image patches are represented using a set of pre-

designed bases such as wavelet or trained dictionary atoms. From

the view of image processing, this is generally equivalent to con-

volving them with those bases [46] . For example, the implementa-

tion of discrete wavelet transform is exactly based on ﬁltering. On

the other hand, for spatial domain fusion methods, the situation is

even clearer that so many activity level measurements are based

on high-pass spatial ﬁltering. Furthermore, the fusion rule, which

is usually interpreted as the weight assignment strategy for differ-

ent source images based on the calculated activity level measures,

can be transformed into a ﬁltering-based form as well. Consider-

ing that the basic operation in a CNN model is convolution (the

full connection operation can be viewed as convolution with the

kernel size that equals to the spatial size of input data [45] ), it is

practically feasible to apply CNNs to image fusion.

2.2.2. Superiority

Similar to the situation in visual object classiﬁcation applica-

tions, the advantages of CNN-based fusion method over existing

methods are twofold. First, it overcomes the diﬃculty on manu-

ally designing complicated activity level measurement and fusion

rules. The main task is replaced by the design of network architec-

ture. With the emergence of some easy-to-use CNN platforms such

as Caffe [48] and MatConvNet [49] , the implementation of network

design becomes convenient to researchers. Second, and more im-

portantly, the activity level measurement and fusion rule can be

jointly generated via learning a CNN model. The learned result can

be viewed as an “optimal” solution to some extent, and therefore is

likely to be more effective than manually designed ones. Thus, the

CNN-based method has a great potential to produce fusion results

in higher quality than conventional methods.

3. The proposed method

3.1. Overview

In this Section, the proposed CNN-based multi-focus image fu-

sion method is presented in detail. The schematic diagram of our

algorithm is shown in Fig. 1 . In this study, we mainly consider the

situation that there are only two pre-registered source images. To

deal with more than two multi-focus images, one can fuse them

one by one in series. It can be seen from Fig. 1 that our method

consists of four steps: focus detection, initial segmentation, consis-

tency veriﬁcation and fusion . In the ﬁrst step, the two source im-

ages are fed to a pre-trained CNN model to output a score map,

which contains the focus information of source images. Particu-

larly, each coeﬃcient in the score map indicates the focus prop-

erty of a pair of corresponding patches from two source images.

Then, a focus map with the same size of source images is obtained

from the score map by averaging the overlapping patches. In the

second step, the focus map is segmented into a binary map with a

threshold of 0.5. In the third step, we reﬁne the binary segmented

map with two popular consistency veriﬁcation strategies, namely,

small region removal and guided image ﬁltering [50] , to generate

the ﬁnal decision map. In the last step, the fused image is obtained

with the ﬁnal decision map using the pixel-wise weighted-average

strategy.

3.2. Network design

In this work, multi-focus image fusion is viewed as a two-class

classiﬁcation problem. For a pair of image patches { p

, p

} of the

same scene, our goal is to learn a CNN whose output is a scalar

ranging from 0 to 1. Speciﬁcally, the output value should be close

to 1 when p

is focused while p

is defocused, and the value

should be close to 0 when p

is defocused while p

is focused. In

other words, the output value indicates the focus property of the

patch pair. To this end, we employ a large number of patch pairs

as training examples. Each training example is a patch pair of the

same scene. One training example { p

, p

} is deﬁned as a positive

example when p

is clearer than p

, and its label is set to 1. On

the contrary, the example is deﬁned as a negative example when

is clearer than p

and the label is set to 0.

In practical usage, the source images have arbitrary spatial size.

One possible way is to apply sliding-window technique to divide

the images into overlapping patches, and then input each pair of

patches into the network to obtain a score. However, considering

194 Y. Liu et al. / Information Fusion 36 (2017) 191–207

Fig. 1. Schematic diagram of the proposed CNN-based multi-focus image fusion algorithm. Data courtesy of M. Nejati [30] .

that there are a large number of repeated convolutional calcula-

tions since the patches are greatly overlapped, this patch-based

manner is very time consuming. Another approach is to input the

source images into the network as a whole without dividing them

into patches, as was applied in [39,43,45] , aiming to directly gener-

ate a dense prediction map. Since the fully-connected layers have

ﬁxed dimensions on input and output data, to make it possible,

the fully-connected layers should be ﬁrstly converted into convo-

lutional layers by reshaping parameters [39,43,45] (as mentioned

above, the full connection operation can be viewed as convolution

with the kernel size that equals to the spatial size of input data

[45] , so the oﬄine reshaping process is straightforward). After the

conversion, the network only consists of convolutional and max-

pooling layers, so it can process source images of arbitrary size as

a whole to generate dense predictions [39] . As a result, the output

of the network now is a score map, and each coeﬃcient within it

indicates the focus property of a pair of patches in source images.

The patch size equals to the size of training examples. When the

kernel stride of each convolutional layer is one pixel, the stride of

adjacent patches in source images will be just determined by the

number of max-pooling layers in the network. To be more speciﬁc,

the stride is 2

when there are totally k max-pooling layers and

each with a kernel stride of two pixels [39,43,45] .

In [47] , three types of CNN models are presented for patch

similarity comparison: siamese, pseudo-siamese and 2-channel . The

siamese network and pseudo-siamese network both have two

branches with the same architectures, and each branch takes one

image patch as input. The difference between these two networks

is the two branches in the former one share the same weights

while in the latter one do not. Thus, the pseudo-siamese net-

work is more ﬂexible than the siamese one. In the 2-channel

network, the two patches are concatenated as a 2-channel im-

age to be fed to the network. The 2-channel network just has

one trunk without branches. Clearly, for any solution of a siamese

or pseudo-siamese network, it can be reshaped to the 2-channel

manner, so the 2-channel network provides further more ﬂexibil-

ity [47] . All the above three types of networks can be adopted in

the proposed CNN-based image fusion method. In this work, we

choose the siamese one as our CNN model mainly for the follow-

ing two considerations. First, the siamese network is more natu-

ral to be explained in image fusion tasks. The two branches with

same weights demonstrate that the approach of feature extrac-

tion or activity level measure is exactly the same for two source

images, which is a generally recognized manner in most image

fusion methods. Second, a siamese network is usually easier to

be trained than the other two types of networks. As mentioned

above, the siamese network can be viewed as a special case of

the pseudo-siamese one and 2-channel one, so its solution space

is much smaller than those of the other two types, leading to an

easier convergence.

Another important issue in network design is the selection of

input patch size. When the patch size is set to 32 × 32, the clas-

siﬁcation accuracy of the network is usually higher since more im-

age contents are used. However, there are several defects which

cannot be ignored using this setting. As is well known, the max-

pooling layers have important signiﬁcance to the performance of a

convolutional network. When the patch size is 32 × 32, the num-

ber of max-pooling layers is not easy to determine. More specif-

ically, when there are two or even more max-pooling layers in a

branch, which means that the stride of patches is at least four

pixels, the fusion results tend to suffer from block artifacts. On

the other hand, when there is only one max-pooling layer in a

branch, the CNN model size is usually very large since the number

of weights in fully-connected layers signiﬁcantly increases. Further-

more, for multi-focus image fusion, the setting of 32 × 32 is often

not very accurate because a 32 × 32 patch is more likely to contain

both focused and defocused regions, which will lead to undesirable

results around the boundary regions in the fused image. When the

patch size is set to 8 × 8, the patches used to train a CNN model

is too small that the classiﬁcation accuracy cannot be guaranteed.

Based on the above considerations as well as experimental tests,

we set the patch size to 16 × 16 in this study.

Fig. 2 shows the CNN model used in the proposed fusion algo-

rithm. It can be seen that each branch in the network has three

convolutional layers and one max-pooling layer. The kernel size

and stride of each convolutional layer are set to 3 × 3 and 1,

respectively. The kernel size and stride of the max-pooling layer

are set to 2 × 2 and 2, respectively. The 256 feature maps ob-

tained by each branch are concatenated and then fully-connected

with a 256-dimensional feature vector. The output of the network

is a 2-dimensional vector that is fully-connected with the 256-

dimensional vector. Actually, the 2-dimensional vector is fed to a

2-way softmax layer (not shown in Fig. 2 ) which produces a proba-

bility distribution over two classes. In the test/fusion process, after

converting the two fully-connected layers into convolutional ones,

the network can be fed with two source images of arbitrary size

as a whole to generate a dense score map [39,43,45] . When the

source images are of size H × W , the size of the output score map

is (



H/ 2



− 8 + 1) × (



W/ 2



− 8 + 1) , where  ·  denotes the ceil-

ing operation. Fig. 3 shows the correspondence between the source

images and the obtained score map. Each coeﬃcient in the score

map keeps the output score of a pair of source image patches of

size 16 × 16 going forward through the network. In addition, the

stride of the adjacent patches in source images is two pixels be-

cause there is one max-pooling layer in each branch of the net-

work.

剩余16页未读，继续阅读

评论收藏

内容反馈

拼搏无悔

2018-01-15

文章内容很齐全，没有缺页现象

A1036857413

粉丝: 1
资源: 11

Multi-focus image fusion with a deep convolutional neural networ...

最新资源

Multi-focus image fusion with a deep convolutional neural networ...

Multi-focus image fusion with a deep convolutional neural network的源码

Multi-view Face Detection Using Deep Convolutional Neural Networks

Multi-Frame Video Super-Resolution Using Convolutional Neural Networks

Large-Scale Video Classification with Convolutional Neural Networks

Learning Transferable Architectures for Scalable Image Recognition

Multi-exposure image fusion via deep perceptual enhancement

Multi-focus image fusion with dense SIFT一文

Multi-focus image fusion with dense SIFT.pdf

Parallel multi-stage features fusion of deep convolutional neural networks for aerial scene classification

A new deep distortion convolutional neural network for semantic

imagenet-classification-with-deep-convolutional-neural-networks原版和翻译..rar

ImageNet Classification with Deep Convolutional Neural Network

ImageNet Classification with Deep Convolutional Neural Networks

AlexNet - ImageNet Classification with Deep Convolutional Neural Networks 译文

Multi-focus image fusion with alternating guided filtering

A robust deep convolutional neural network with batch-weighted loss for heartb

Multi-label convolutional neural network based pedestrian attribute classification

Multi-objective learningand mask-based post-processing for deep neural network based speech enhancement

image-fusion

R-CNN matlab 代码: Regions with Convolutional Neural Network Features

【4】Imagenet classification with deep convolutional neural networks.pdf

4824-imagenet-classification-with-deep-convolutional-neural-networks.rar

Deep Convolutional Neural Network-Based Early_neuralnetwork_

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

【5】Very deep convolutional networks for large-scale image recognition.pdf

A Robust Image Zero-watermarking Using Convolutional Neural Networks

ImageNet Classification with Deep Convolutional Neural Networks.pdf

《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》

最新资源