An Efficient Network Model For Visible and Infrared Image Fusion
An Efficient Network Model For Visible and Infrared Image Fusion
An Efficient Network Model For Visible and Infrared Image Fusion
ABSTRACT Visible and infrared image fusion (VIF) aims at remodeling an informative and panoramic
image for subsequent image processing or human vision. Due to the widespread application in military
and civil fields, the VIF technology has achieved considerable development in recent decades. However,
the assignment of weights and the selection of fusion rules seriously restrict the performance improvement
of most existing fusion algorithms. In response to this issue, an innovative and efficient VIF model based
on convolutional neural network (CNN) is proposed in this paper. Firstly, multi-layer convolution kernel
is performed on two source images with a multi-scale manner for extracting the salient image features.
Secondly, the extracted feature maps are concatenated along the number of channels. Finally, the fusion
feature maps are reconstructed to achieve the fusion images. The main innovation of this paper is to
adequately preserve meaningful details and adaptively integrate features information driven by source
image information in CNN learning model. In addition, in order to adequately train the network model,
we generate a large-scale and high-resolution image training dataset based on COCO dataset. Compared
with the existing fusion methods, experiment results indicate that the proposed method not only achieves
universally outstanding visual quality and objective metrics but also has some advantages in terms of runtime
efficiency compared to other neural network algorithms.
INDEX TERMS Convolutional neural network, multi-feature extraction, optimized network, visible and
infrared image fusion.
or upgrade mass filters to improve information extraction image fusion seemingly. Ren et al. [32] proposes a novel
ability, such as the Laplacian pyramid [14], discrete wavelet infrared and visible images fusion method based on improved
transform (DWT) [15], curvelet transform [16], dual- DenseNet, Max-Relevance and Min-Redundancy and zero
tree complex wavelet transform (DTCWT) [17] and non- phase component analysis. The fusion strategy is be opti-
subsampled contourlet transform(NSCT) [18], etc. Although mized by elaborating activity level maps based related feature
the fusion performance of these algorithms is observ- processing. Si et al. [33] proposed a dual fusion path genera-
ably improved with optimized filter banks, the fixed tive adversarial network for infrared and visible image fusion,
manual filter, be appropriate to specific structural fea- and implemented dual self-attention feature refine module
ture information, still missing some available character- (DSAM) on two fusion paths to refine feature maps in two
istic information from resource image. In addition, the fusion paths. This kind of targeted design improved distinctly
fusion rule selection is indispensable for assigning appro- fusion image contrast. Although they can obtain good fusion
priate weight to transform coefficients, which can eas- effects to a certain extent, the complexity of pre-processing
ily lead to feature information loss or distortion, and and the instability of fusion rules and limited their practi-
make the fusion image suffer from low-contrast or blur- calapplication. Based on the above several fusion cases, most
ring [19]. In contrast, the spatial-based algorithms directly of the current CCN based fusion algorithms is mainly based
integrate the gray information from resource images under on the idea of image classification or segmentation to achieve
a certain standard. These kinds of methods can easily information fusion. However, the characteristic of VI image
cause image distortion and make the fusion image quality and multi-focus image have obvious differences, VIF cannot
reduced. simply be regarded as an image classification task. Besides,
In recent years, deep learning has achieved rapid devel- with the lack of ground truth for visible and infrared images,
opment in the field of image processing, such as image the trained network model is limited to retain the enough use-
classification, image super-resolution, object identification, ful information of source images. In addition, it’s pretty obvi-
and so on [20], [21], [22], [23]. Attributing to its advancement ous that a multitude of of the existing CNN based methods are
and handleability, deep learning with convolutional neural universally need to design a special feature weight allocation
network (CNN) also yields brilliant results in the image method and fusion rule similar to the transform-based meth-
fusion task [24]. Li et al. [25] firstly introduced CNN into ods. Which is clearly not an easy task for sufficient features
image fusion task, and took the lead in applying deep learning fusion, because single manual weight coefficient and feature
for combining visible images with infrared images [26]. They map integration method is not always effective for com-
firstly decomposed the source image into basic parts and plex infrared and visible feature information and inevitably
content of details. Then the basic part was fused by weighted overshadows their image fusion performance. Furthermore,
average, and the details is merged by feat of deep neural net- a mass of details may be lost randomly and be difficult to
work. Finally, the two components are accumulated directly be preserved in the final fusion image due to the pooling
to obtain a comprehensive fusion image. Subsequently, mul- process.
tifarious CNN-based fusion models are proposed for special Responding to the above problem, an efficient visible and
purpose. Li et al. [27] extracted the deep features by ResNet, infrared image fusion network model is proposed with inge-
and used the normalizing deep features with Zero-phase nious network structure design in this paper. The proposed
component analysis (ZCA) to acquire the initial weight coef- network structure is shown generally in Figure 1, and the
ficient. Then, the weight map for integrating image feature main contributions and innovations of the paper can be sum-
information is refined by SoftMax operation. And then they marized as follows:
put it separately nest connection and spatial/channel attention 1) A multi-scale convolutional fusion model with an
models and an end-to-end residual fusion network for infrared improved residual block is proposed, which can explicitly
and visible images [28], [29], which boost immensely fusion integrate deep features without manual weight selection and
performance of algorithms by improving network structure. fusion rule design and have remarkable adaptivity.
Zhang et al. [30] proposed a general image fusion framework 2) Compared with the popular training dataset derived
based on the convolutional neural network. Inspired by the from low-resolution images or unrealistic ground truth fusion
transform-domain image fusion algorithms, they introduced images, this paper utilizes high-resolution multi-focus images
the concept of multi-scale to convolution kernel and achieved with ground truth images as the training dataset for infrared
comparable or even better image fusion results. However, the and visible image fusion, which optimizes the upper bound
elementwise fusion rules have been utilized to fuse the convo- of the network performance, and conduce to the loss function
lutional features of multiple inputs, which will undoubtedly constrain the network focus on the informative regions of
lead to the loss of some feature information. Tang et al. [31] source images effectively to facilitate the retention of the
designed a pixel convolutional neural network for multi-focus useful information.
image fusion, but the decision map for information integra- 3) An effective image reconstruction structure combining
tion was obtained by comparing the values of the two score affluent skip connections with multi-scale convolutional lay-
matrixes. This kind of method is effective for multi-focus ers is designed to supplement the lost image details in the
pooling process and improve the utilization rate of the image size 32 × 32. Lai selected about 45,000 detail-rich images
convolution features. from ILSVRC 2015 as the original dataset, and these images
Therefore, the proposed model is a fully convolutional neu- were then uniformly divided into image blocks of size
ral network, which is trained in an end-to-end manner without 128 × 128 for training. In fact, most of the datasets com-
preprocessing, and the complete trained deep CNN properly monly used for neural network training nowadays are com-
integrated learn feature extraction, fusion and reconstruction posed of small image blocks (32 × 32, 64 × 64) currently.
components together to produce reasonable fusion result for Although small image blocks are beneficial to improving
visible and infrared images. model training time, the model performance also is greatly
The rest of this paper is as follows. In Section II, the restricted for the reason that their resolution is low. There-
related work about deep learning including the image training fore, For the sake of promoting the performance of neural
dataset and the loss function is described. In Section III, model, the image block size is dataset to 256 × 256 in
according to the experiment results with the reference algo- this paper.
rithms, right-minded analysis is presented from both sides of Furthermore, because of lacking ground-truth fusion
subjective and objective evaluations. Section IV draws the images, searching for appropriate training dataset for VIF is
conclusions. challenging. In order to supervise the image fusion models
preferably, Zhang et al. [30] used multi-focused images as
II. RELATED WORK the training dataset and obtained desirable fusion images.
As we all know that training datasets and loss function Homoplastically, Fang et al. [34] selected multiple modal
selection are directly related to the accuracy of the neural images as training datasets and also achieved satisfac-
network model. The impact and selection mode of train- tory results in the field of VIF. The different out-of-focus
ing datasets and loss function will be introduced in this ways and sample richness of multi-focused images can
subsection. improve the stability of network structure [30]. There-
fore, the training dataset for multi-mode image fusion can
A. TRAINING DATASET be confirmed by handling felicitously natural images with
On account of that the quality of the training dataset ground-truth fusion images, such as multi-focused images.
often directly determines the upper bound of the model As described above, reasonably segmented multi-focus
performance [30], the more training samples and types image, which is more easily generated and has ground-truth
subsequently the higher the image quality of the training fusion images, is chosen as the training dataset in this
results. For this reason, Tang et al. [31] chose Cifar-10 as paper. The multi-focused images are produced by employ-
the training dataset, which contains 60,000 image blocks of ing 18,800 images deriving from the COCO dataset [35].
FIGURE 2. Several image training datasets. The source image is on the left. The right side is the homologous binary image. I1 and
I2 are multi-focused images, and Is is the ground truth fusion image.
The specific steps generating multi-focus images are shown the proportion of background points to the image, and
as follows: u1 represents the average gray value of the background
Step 1: The complete blurred image Ig is generated by points.
randomly blurring source image Is with a gaussian filter, Step 3: A pair of multi-focused images are generated based
which can be expressed as: on source image Is , blurred image Ig and focus map If . The
focus maps I1 and I2 can be fixed according to (4).
Ig = G ∗ Is (1)
(
here ‘∗’ denotes convolution operation, G denotes Gaussian I1 = Is • If + Ig • (1 − If )
(4)
kernel, and the random kernel radius is from 0 to 30 pixels I2 = Is • (1 − If ) + Ig • If
according to (2).
x 2 +y2
where 1 represents a matrix, whose size is consistent with
1 source image and all values are 1. ‘•’ denotes dot product
G(x, y) = √ e 2σ 2 (2)
2πσ operation between matrices.
here σ can express the standard deviation of gaussian filter. Because the focus area is more random in this paper, these
Step 2: The edge information of source image Is are generated focused images are more natural compared to those
acquired by Otsu algorithm [36]. The algorithm acquires that are synthesized by partial data as a whole. Figure 2 shows
the best threshold of the image by the inter-class variance several datasets of multi-focus images and their ground-truth
method, and distinguishes the background of the image from fusion images, the training dataset acquired by the above
the target edge information. Then the edge information is method possesses two advantages over other manners:
expanded into region blocks by morphological expansion, (1) higher image resolution; (2) more diverse blurring
and the focus map If is gained. The equation of Otsu styles.
algorithm to get the optimal threshold value is denoted as
follows. B. LOSS FUNCTION
The aim of image fusion is to reasonably combine salient
T = w0 w1 (u0 − u1 )2 (3)
feature information from source images into an informative
T is the optimal image threshold, w0 represents the pro- and comprehensive image. However, frequent difference pre-
portion of target points to the image, and u0 represents diction between output data and real data is executed by using
the average gray value of the target points. w1 represents loss function in the network training process, more likely
to lead to unexpected information loss in regression when multi-focus image fusion model. Due to the multi-focus
employing illogical loss function for a certain type of image. image has obvious blurred and clear areas, the above con-
Therefore, before implementing deep learning on visible and volutional neural networks based various methods have get
infrared images, it is necessary to ascertain appropriate loss astounding achievements in multi-focus image fusion. How-
functions to optimize the parameters of the neural network ever, there are obvious differences between infrared and visi-
model for grabbing more abundant textural features from ble images deriving from their imaging modes. For instance,
source images. visible images mainly exhibit the rich details and high spa-
Mean squared error (MSE) is universally used as loss func- tial resolution but weaken momentous target even silently.
tion to adjust the model predictions close to the truth output Whereas infrared images highlight salient target from back-
in various natural network algorithms. However, it causes ground but lack texture details. In addition, these features
a common problem that fusion results in fewer details or usually overlap in different areas between infrared and visible
is too smooth for visible and infrared images [25]. In view images. Therefore, the fusion task for infrared and visible
of that infrared image and visible image acquired from the image can’t just be seen as a simple image classification or
same scenario containing lots of similar structural infor- segmentation.
mation, we choose structural similarity loss (SSIM) [37] According to the above reasons, some deep learning meth-
as loss function to optimize the parameters of the natural ods suitable for infrared and visible image fusion are studied
network. on account of network structure design and image features
(2µx µy + C1 )(2σxy + C2 ) analysis. For example, Li et al. [27] extracted more deep fea-
SSIM (x, y) = 2 (5) tures by residual network, and used the normalizing deep fea-
(µx + µ2y + C1 )(σx2 + σy2 + C2 )
tures with Zero-phase component analysis (ZCA) to acquire
where x is the real image, y is the predicted image, µx , the initial weight coefficient. Then, the weight map for inte-
µy is mean, σx , σy is variance, and σxy is covariance. grating image feature information is refined by SoftMax
C1 =(Lk1 )2 , C2 =(Lk2 )2 are stable constants. L is the dynamic operation. And then they put it separately nest connection and
range of pixel values, k1 = 0.01, k2 = 0.03. spatial/channel attention models and an end-to-end residual
when the two images are converged, SSIM gets close to 1. fusion network for infrared and visible images [28], [29],
conversely, SSIM is near to 0. Thus, the loss function is which boost immensely fusion performance of algorithms
defined as follow. by improving network structure. Jian et al. [38] overcome
the information redundancy by a symmetric encoder-decoder
ISSIM = 1 − SSIM (x, y) (6)
block network but the middle layer information is ignored.
In order to retain significant infrared targets, Ma et al. [39]
C. CNN FOR IMAGE FUSION proposed an image fusion network based on the salient target
Through literature review, it is found that the existing CNN detection, and the target regions could be marked by the
image fusion technology has many applications in infrared salient target mask similar to a classifier. Inspired by the
and visible image [27], [28], [29], but it is more aimed at transform-domain image fusion algorithms, Zhang et al. [30]
multi-focus image fusion [20], [21], [24], [25], [31], [35]. introduced the concept of multi-scale to convolutional neural
The main reason for this phenomenon is CNN is easily network and achieved comparable or even better image fusion
applied to image classification or segmentation task in image results. However, the elementwise fusion rules have been
analysis based on its special convolutional characteristic. utilized to fuse the convolutional features of multiple inputs,
It’s well known that the key point in multi-focus image which will undoubtedly lead to the loss of some feature
fusion methods is seeking the optimized focus measure (FM,) information. Ren et al. [32] proposes a novel infrared and
which can be regarded as a classification problem that dis- visible images fusion method based on improved DenseNet,
criminates focused and defocused maps. Enlightened on this Max-Relevance and Min-Redundancy and zero phase com-
ideal, some researchers used the convolution property of ponent analysis. The fusion strategy is be optimized by
CNN to learn the effective FM for elaborate focus map, elaborating activity level maps based related feature pro-
and greatly improved multi-focus image fusion performance. cessing. Si et al. [33] proposed a dual fusion path generative
For example, Liu et al. [25] introduced CNN as a sorting adversarial network for infrared and visible image fusion,
task to fuse multi-focus images at the first time. Tang et al. and implemented dual self-attention feature refine module
subsequently learned a CNN model joined activity level mea- (DSAM) on two fusion paths to refine feature maps in two
surement and fusion rule to combine multi-focus images [31]. fusion paths. This kind of the instability of fusion rules and
In order to refining the focus map without post-processing, limited their practicalapplication.
Zhang et al. designed an end-to-end fully convolutional neu- Based on the above-mentioned representation, the current
ral network and achieved state-of-the-art results [30]. Amin CCN-based fusion model for infrared and visible image is
et al. integrate three CNNs models to construct the optimized universally needs to design a special feature weight alloca-
segmented decision map for multi-focus image fusion [35]. tion method and fusion rule similar to the transform-based
Du and Gao [20] introduced segmentation ideal to construct methods. Which is clearly not an easy task for sufficient
FIGURE 3. General convolutional layer structure. Fi is the input image, Fi +1 is the output image.
Wi is the convolutional kernel size, and Di is the number of channels.
features fusion, because single manual weight and fusion reconstruction strategy, as shown in Figure 1. The specific
rule is not always effective for complex infrared and visible details of each strategy in MLCNN are explained in subse-
feature information and inevitably overshadows their image quent subsections.
fusion performance. Meanwhile, some feature information
is easily lost because these network models lack sufficient
feature extraction and retention ability. Therefore, we pro- A. MULTI-SCALES FEATURE EXTRACTION STRATEGY
pose a new CNN-based fusion method (MLCNN), which As shown in Figure 3, The convolutional layer, which is the
introduce improved residual block to multi-scale convolu- core of CNN, can pick out the feature information of image
tional fusion model for confirming the weight map and with the help of training dataset. Therefore, reasonable and
fusion mode adaptivity, and combine skip connection with appropriate convolution kernel (CK) is critical for feature
multi-scale convolutional layer adequately to supplement the extraction. It is interesting to note that CK with small size is
lost image details in the pooling process and improve the sensitive to low-frequency and small detail information, and
utilization rate of the image convolution feature information. CK with large size is favorable for capturing high frequency
Based on this, The proposed model can fully demonstrate and large detail information [41]. According to the above
the data mining capabilities of convolutional neural net- facts, a multi-features extraction block (MFE), multiple sizes
works to extract enough deep features and preserve more CK are inserted in one convolutional layer, is introduced in
meaningful details in model training, simultaneously realize this paper. The specific structure of CK is shown simplisti-
the integration of depth features without manual fusion rule cally in Figure 4.
adaptively. In order to extract the low and high features dividually
and specifically, the sizes of CK are set as 3 × 3, 5 × 5 and
7 × 7 in our network model, respectively. The feature maps of
III. PROPOSED FUSION METHOD individual input source images are subsequently concatenated
As reported in previous literatures, multi-layers convolutional along the number of channels. Longitudinal well-known, the
filters own superior ability to traditional multi-scale filters CK with large size need the network to train more parameters,
in feature information extraction [40]. Exhilarating, weight which means more time and slowdown algorithm speed. For
coefficients for integrating source images can be acquired the sake of improving execution efficiency, the convolutions
and optimized adaptively by convolutional filters. Oppo- with sizes 7 × 7 and 5 × 5 are converted into three connected
sitely, weigh coefficients only be fixed stiffly through pre-set 3 × 3 convolutions and two connected 3 × 3 convolutions
fusion rule in transformation domain. Therefore, inspired by severally, which can greatly reduce the number of parameters
the idea of multi-scales decomposition and the resounding and speed up the network training efficiency [42]. For more
success of IFCNN, an efficient visible and infrared image refined extraction of image features, four MFEs are incor-
fusion model based on multi-layers convolutional neural net- porated, and one bias-corrected linear unit (ReLU) layer is
work (abbreviated as MLCNN) is proposed, which is end-to- added after each convolution, which increases the nonlinear
end fully convolutional structures without preprocessing and relationship between the layers and reduces the dependence
has great adaptability for determining weight coefficients. between parameters in this paper. In consideration of the fact
Similar to the image fusion process based on multi-scales that the input training images are large, a max-pooling layer
decomposition, MLCNN can be divided into three compo- of size 2×2 is carried out after each MFE to decrease training
nents roughly according to the role of each part: multi-scales parameters by simplifying image size to optimize the model
feature extraction strategy, feature fusion strategy and image performance.
FIGURE 4. Multi-features extraction structure. Fi is the input image, Fi +1 is the output image. W1 , W2 , W3 are different
size convolution kernels, and Di is the number of channels.
B. MULTI-SCALES FEATURE FUSION STRATEGY they may lead inevitably to submerge or smooth some use-
After the features exaction operation is complete, two ful important features from source image, which makes the
columns of multi-dimensional and multi-scale features from fusion image appear halo or jitter [26]. In view of the diversity
infrared and visible images are identified, respectively. of image backgrounds and details, the adaptivity of this tactic
Although the two sub-networks own similar architecture, is not ensured. Hence, to prevent the artificially selected
their corresponded feature maps are different. Therefore, fusion strategy from degrading the performance and adaptive-
it requires a reasonable method to integrate the sub-networks ness of the proposed model potentially, concatenation method
with convolutional features of two images. In the field of is utilized to integrate the extractive convolutional features in
CNN based information fusion, researchers usually adopt this paper.
the below two tactics to integrate these convolutional fea- When reducing the dimension of the feature map after
tures: (1) the same layer convolutional features from dif- series connection, it will cause partial feature information
ferent images are firstly concatenated along the channel being lost or overwhelmed. In order to refrain from the
dimension, and then the convolutional features after stack- above problems in the training process, ResNet network [27],
ing of dimensions are consolidated by a proper convolution, as shown in Figure 5, is employed in this paper. The intro-
(2) the same layer convolutional features from different duction of residual blocks can achieve stable cross-channel
images are straightway confirmed by the elementwise fusion information fusion to a certain extent, which is in favor of
rules (such as elementwise-maximum, elementwise-sum and reducing information loss. Furthermore, with the help of
elementwise-mean) [30]. Although the elementwise fusion ResNet network, the input information can be directly flowed
rules are used widely in CNN based information fusion, from any low layer to high layer in forwarding propagation,
FIGURE 6. Feature fusion structure. M is the image size, W1 , W2 are different size convolution kernels, D1 and D2 are
the number of image channels, respectively.
which can be beneficial to avert network degradation. Mean- The transposed convolutional layer can only restore the
while, the error information can be directly transferred to the source size of the output image, but cannot recover the image
lower layer without any intermediate weight matrix trans- pixel values. To solve such problems, the network structure
formation in backpropagation, which can avail against the is optimized by skipping connection linking the multi-scales
gradient disappearance problem heavily. To sum up, it can feature extraction layer with the reconstruction layer, which
be concluded that ResNet network makes the forward and is conducive to supplementing missing details in the pooling
backward propagation more unhindered and makes the ability process and reserves the edge information from the source
to capture deep feature information stronger. image, as well as avoiding gradient disappearance.
The specific feature fusion structure with ResNet network
is shown in Figure 6. a module similar to the residual block IV. EXPERIMENTAL RESULTS AND ANALYSIS
structure is added to the multi-feature fusion strategy, which A. EXPERIMENT PREPARATION
not only can be well to avoid the problem of feature detail
In order to validate adequately the effectiveness of the pro-
loss, but also can deal with the problem of network gra-
posed convolutional neural network model, twelve groups
dient disappearance. The structure of the residual block is
of infrared and visible images as shown in Figure 8
expressed as:
are used to test the proposed algorithm. These tested
8i+1 = g((Wi + 1) ∗ 8i + bi ) (7) images are acquired in different experimental environments,
which can be sufficiently used to demonstrate the sta-
where Wi and bi denote convolution kernel and weight of bility and adaptability of the proposed algorithm. Mean-
the i-layer, respectively. 8i is the output feature map of the while, substantial subjective and objective analyses are given
i-convolution-layer, and g(·) represents the activation func- with eight state-of-art referenced image fusion algorithms.
tion. ‘∗’ denotes the convolution operation. These comparison algorithms respectively are deep learn-
ing (DLF) [26], residual neural networks (ResNet) [27],
C. IMAGE RECONSTRUCTION STRATEGY RfnNet [28], NestNet [29], convolutional neural networks
The image reconstruction strategy as shown in Figure 7 con- (CNN) [41], guided filtering (GFF) [43], gradient transfor-
sists of four trainable convolutional layers. In consideration mation and variance minimization (GTF) [44], anisotropic
of the fact that the max-pooling layer can reduce the size of diffusion (ADF) [45], multi-resolution singular value decom-
the image during the training process of multi-scales feature position (MSVD) [46], salience-based method (TIF) [47],
extraction, an up-sampling operation, gradually restore the and hybrid model (VSMWL) [48]. The parameter setting
pooling layer to the source image size, is performed on of all reference algorithms is strictly consistent with the
the fusion layers by means of the transposed convolution original literature. All algorithms used in this paper are
layer. Transposed convolution layer equations are derived as executed on the same computer with Intel i5-1035G1 CPU
(8)–(10), shown at the bottom of the next page. (1 GHz) and 2 GB GPU. The proposed fusion model for
X , Y represent the input and output image matrices (square short MLCNN is achieved by Pytorch 1.8.1 based on Python
matrix), m and n represent the matrix scales, and m=n/2. 3.9.4. 18800 pairs of multi-focused images are trained with an
K represents the convolution kernel parameters of the trans- image size of 256 × 256 and a batch size of 32 in the training
posed convolution layer. C represents the sparse matrix of K process, and the learning rate was set to 0.001 using the Adam
and CT represents the matrix transpose. optimizer [39].
B. OBJECTIVE EVALUATION METRICS adopted to reveal the quality of various fusion results. The
Objective evaluation is important measure to evaluate image larger evaluated values of the above five metrics illustrate
fusion quality besides subjective visual analysis, it can that the corresponding fusion results contain more valuable
effectively make quantitative comparisons based on the char- information.
acteristics of fusion images. At present, plentiful objective
evaluation criteria have been proposed in allusion to dif- C. RESULTS AND DISCUSSION
ferent types of image quality analysis. In consideration of Limited to the paper length, the fusion results of four
the fact that ground-truth fusion image for the visible and groups of images with obvious feature differences as shown
infrared image fusion task does not exist, in order to reveal in Figure 9, ‘‘Car’’ (a1 and a2), ‘‘Human’’ (b1 and b2),
details and other characteristic information of the fusion ‘‘Wilderness’’ (c1 and c2), and ‘‘Factory’’ (d1 and d2) under
images and verify the performance of the proposed fusion the algorithm mentioned above will be discussed in detail
model, five objective image metrics, such as average gradi- in this section. According to the fusion results acquired
ent (AG), information entropy (IE), space infrequency (SF), from various algorithms, this paper presents a comparative
edge information retention (QAB/F ) and Piella [44], [49] are analysis from the perspective of visual effects and objective
x11 . . .
x1n
w11 w12 w13
.. . . ..
X = . . . K = w21 w22 w23 (8)
xn1 · · · xnn w31 w32 w33
n×n
w11 w12 w13 0 w22 w23 0 w31 w32 w33
w21 0 0 0 0 0
0 w11 w12 w13 0
w21 w22 w23 0 w31 w32 w33 0 0 0 0
C =
(9)
0 0 0 0 w12 w13 0 w21 w22 w23
w11 0 w31 w32 w33 0
0 0 0 0 0
w11 w12 w13 0 w21 w22 w23 0 w31 w32 w33
y11 x11
y11 . . . .. ..
y1m
. .
Y = ... . . . ..
. C T
× .. = .. (10)
. .
ym1 · · · ymm m×m
ymm xnn
FIGURE 8. Test images. IR indicates infrared images and IV indicates visible images.
FIGURE 9. Visible and infrared original image. a1 is the infrared image ‘‘Car’’; a2 is the visible image ‘‘Car’’; b1 is the
infrared image ‘‘Human’’; b2 is the visible image ‘‘ Human ’’; c1 is the infrared image ‘‘ Wilderness ’’; c2 is the visible
image ‘‘ Wilderness ’’; d1 is the infrared image ‘‘Factory’’; d2 is the visible image ‘‘Factory.’’
evaluation. The best values, the second-best values and the The fusion results of twelve algorithms are shown respec-
third-best values are indicated in bold, red and italic and blue tively in Figure10-13 to validate the effectiveness of the
and italic in Table 1- Table 5, respectively. proposed CNN-based method. Firstly, we compare the
fusion performance of different algorithms from the perspec- Figure10 exhibits the difference between the proposed
tive of visual effects and objective evaluation in Figure10. algorithm and the reference algorithms. It is intuitive that
the proposed method owns the best visual effect with har- and Figure10 (d), (h), and (i) lost lots of visible details and
monized visible details and infrared features. Conversely, have obvious halo phenomena in some areas which severely
there are some visible details lost or infrared target anni- affects human vision.
hilation in Figure10 (a)-(e) and (g)-(k). Taking the car The values in Table 1 corresponding to various fusion
in the red box, for example, the car’s infrared signa- results in Figure 10 can fairly reveal the fusion performance
ture is preserved reasonably in the visible bright light of various algorithms from objective perspective cooperated
Figure10 (f) and (j). However, the auto-target has almost with subjective visuals. It can be observed that the proposed
disappeared in Figure10 (a), (b), (c), (e), (g), (i), (j) and (k), method is significantly better than the reference algorithms
in terms of AG, and SF values, which indicates that the which causes some visible details to be obscured. Although
fusion result of the proposed method possesses rich texture GFF and VSMWL can merge the mutual information among
details compared to other algorithms. Although the model the original images in Figure11(d) and (g), it is visually unnat-
has fewer IE, Piella and QAB/F values than CNN, GFF and ural due to information distortion in some areas. Obviously,
TIF, the values that the proposed method achieves are accept- the fusion results presented in Figure11(h) tend to be rayless
able, which denotes that appropriate information fidelity and on account that lots of visible details and infrared futures
evident edge information. By reason of the foregoing, the pro- are equalization. The fusion result based on RfnNet shows
posed algorithm acquires better fusion performance for visi- dim visible background and also arise distinct halo around
ble and infrared image fusion of ‘‘Car’’ in a comprehensive the edge of the person, so the visual effects were severely
perspective. affected. On the whole, the proposed method provide the most
The fusion results of various algorithms with ‘‘Human’’ observable fusion result, with abundant visible details and
as resource images are displayed in Figure11. The per- infrared features as shown in Figure11(l).
fect fusion results should include distinctive human features As can be noticed from the objective metrics in Table 2, the
from the infrared images and clear background details from proposed method achieved excellent values although some
the visible images. Figure11 (a) (c), (f) and (i) show that values are lower than others ostensibly. For example, TIF
ADF, DLF, TIF and ResNet can integrate effectively the gain greater evaluation value than the proposed method in
available information from resource images to a reasonable AG and IE, and CNN get the best SF value, which is mainly
extent, but there is a distinct halo around the edge of the caused by the inconsistent fluctuation of image gray level as
person, and the infrared signature tends to dim compared shown in Figure11. Although the homogeneous values of the
to the source image. Figure11(b), (e) and (j) make clear proposed method are not optimal, they are obviously higher
that is invalid when processing background information, than those of other reference methods including five kinds of
classical neural network algorithms. It exhibits that the fusion source images, whose visual effect is as good as the results
result of the proposed method is equipped with rich detailed in Figure12(b), (d), (f), (g) and (j).
information from resource images. In addition, the proposed Similar to the objective values in Table 1 and Table 2,
method dedicates the best values of QAB/F compared with the proposed method in Table 3 achieved accredited evalu-
all of the reference algorithms, which illustrates that the pro- ation with subjective vision although some values seem to be
posed method can preserve the edge information of the target lower than other reference methods. The proposed algorithm
from source images well. Meanwhile, the Piella value in seizes the best Piella value and SF value, which indicates
Table 2 reflects that the fusion result of the proposed method that the fusion image has the highest correlation with the
is highly correlated with original images and gets minimal source images and the optimal brightness and contrast. TIF
brightness distortion and contrast distortion. In summary, it is and CNN obtains the maximum value in AG and IE, which
evident that the proposed model has clear advantage over the is basically consistent with visual detail perception as shown
reference algorithm in terms of ‘‘Human’’ VIF. in Figure12(b) and (f). Although the values of AG, IE and
Figure12 shows the fusion results with various methods QAB/F in NestNet are better than that in the proposed method,
on ‘‘ Wilderness ’’. The source visible image has abun- the image fusion performance of the proposed method is
dant texture information such as grass piles, trees and perfectly acceptable. Therefore, it is shown that the proposed
houses, and the matched infrared image reveals prominent model has excellent information integration performance in
target. As shown in the red box of various fusion results, terms of ‘‘ Wilderness’’ VIF in general.
infrared targets are almost lost and visible details are weak- The final specified comparison experiment, as shown
ened badly in Figure12(e). Although the visible details are in Figure13, takes the visible and infrared images of the
reserved to a certain extent in Figure12(a), (c), (i) and (k), ‘‘Factory’’ as the object. It is obvious in terms of visual
the infrared target tends to be dim. In general, the pro- perception that the proposed method acquires the splendid
posed method properly integrates the information from the fusion effect. Figure13(a), (c), (h) and (i) miss some details
FIGURE 13. Visible and infrared fusion image of the ‘‘Factory’’. Red boxes indicate highlighted areas of detail.
and suppress the infrared signature. The original infrared To more sufficiently analyze the performance of various
information is well preserved in Figure13(b), (g) and (j), but algorithms, Figure14 represents the fusion results of eight
their visible detail background information is tendency to groups of tested images for further comparison. It is fully
dim which results in inharmonious visual effects. Apparently, illustrative that the proposed method displays excellent and
serious fusion failure occurs in Figure13(d), (e) and (k), these acceptable fusion results compared to other reference meth-
fusion images result in the loss of important information and ods, which reflects the proposed algorithm has excellent
severe unnatural distortions. As for Figure 13 (f), the infrared fusion performance and substantial stability. Meanwhile, the
information of the person in the red box appears to display a average objective metrics for the twelve groups of fusion
non-uniform distribution. In conclusion, the proposed method results using different fusion strategies are listed in Table 5
gains the optimal visual perception, which supplies abundant and Figure15. The values marked in bold are the best val-
visible details and remarkable infrared features unaffectedly ues in all evaluation criteria. Similar to the objective val-
as shown in Figure13(l). ues from Table 1 to Table 4, the proposed algorithm gains
As shown in Table 4, The proposed algorithm still seizes the best Piella value, secondary SF value and AG value,
the best Piella value and SF value, which indicates that the which declares that the proposed method can hold the cor-
fusion image is highly correlated with the source images and relation between fusion result and source images, and can
its edge information is more abundant. Although NestNet maintain the brightness and contrast of the corresponding
and VSMWL acquire better value in QAB/F and AG than the fusion result. Although the partial reference algorithms obtain
proposed method, it is mainly caused by unreasonable visible better objective values, this phenomenon is mainly caused
information loss as shown in Figure12(j) and (g). A similar by the incongruity and irrationality in their fusion images,
situation exists in DLF, GFF and GTF. As a whole, the such as the visual perception of the fusion images in CNN,
proposed algorithm has better capabilities and more obvious GFF, and RfnNet, etc. Therefore, it can be summarized that
advantages in the VIF of ‘‘Factory’’. the proposed model has excellent fusion performance and
TABLE 6. Total running time (RT) based on testing images (unit: seconds). TABLE 7. Number of network model parameters (NP) (unit: MB).
FIGURE 15. Quantitative comparison using mean value of each metric (a) IE, (b) IE, (c) SF, (d) QAB/F , and (e) Piella.
V. CONCLUSION REFERENCES
In this paper, a novel and efficient visible and infrared image [1] J. Ji, Y. Zhang, Z. Lin, Y. Li, C. Wang, Y. Hu, F. Huang, and J. Yao, ‘‘End
fusion network model based on CNN is proposed. The model to end infrared and visible image fusion with texture details and contrast
information,’’ IEEE Access, vol. 10, pp. 92410–92425, 2022.
has three main advantages compared with current CNN [2] H. Adeel, M. M. Riaz, and S. S. Ali, ‘‘De-fencing and multi-focus fusion
based VIF methods: (1) A multi-scale convolutional fusion using Markov random field and image inpainting,’’ IEEE Access, vol. 10,
model with an improved residual block is proposed, which pp. 35992–36005, 2022.
[3] D. Lei, M. Bai, L. Zhang, and W. Li, ‘‘Convolution neural network with
can explicitly integrate deep features without manual weight edge structure loss for spatiotemporal remote sensing image fusion,’’ Int.
selection and fusion rule design. (2) In order to better train the J. Remote Sens., vol. 43, no. 3, pp. 1015–1036, Feb. 2022.
proposed model, this paper uses the COCO dataset to reason- [4] L. Ren, Z. Pan, J. Cao, J. Liao, and Y. Wang, ‘‘Infrared and visi-
ble image fusion based on weighted variance guided filter and image
ably generate a training dataset by means of high-resolution
contrast enhancement,’’ Infr. Phys. Technol., vol. 114, May 2021,
large-scale multi-focus images with ground-truth fusion Art. no. 103662.
images. It is significant to optimize the image fusion model [5] X. Jin, Q. Jiang, S. Yao, D. Zhou, R. Nie, J. Hai, and K. He, ‘‘A survey of
in regression, and conduce to the loss function constrain the infrared and visual image fusion methods,’’ Infr. Phys. Technol., vol. 85,
pp. 478–501, Sep. 2017.
network focus on the informative regions of source images [6] Z. Liu, E. Blasch, and V. John, ‘‘Statistical comparison of image
effectively to facilitate the retention of the useful information. fusion algorithms: Recommendations,’’ Inf. Fusion, vol. 36, pp. 251–260,
(3) An effective image reconstruction structure combining Jul. 2017.
[7] J. Ma, Y. Ma, and C. Li, ‘‘Infrared and visible image fusion meth-
affluent skip connections with multi-scale convolutional lay- ods and applications: A survey,’’ Inf. Fusion, vol. 45, pp. 153–178,
ers is designed to supplement the lost image details in the Jan. 2019.
pooling process. The model is fully convolutional, so it can [8] D. Xu, Y. Wang, X. Zhang, N. Zhang, and S. Yu, ‘‘Infrared and visible
image fusion using a deep unsupervised framework with perceptual loss,’’
be trained in an end-to-end manner without a pre-processing IEEE Access, vol. 8, pp. 206445–206458, 2020.
process. It has been verified by numerous experiments that [9] R. Chen, S. Liu, Z. Miao, and F. Li, ‘‘GFSNet: Generalization-friendly
the proposed model owns progressive execution performance Siamese network for thermal infrared object tracking,’’ Infr. Phys. Technol.,
for infrared and visible image fusion problems compared with vol. 123, Jun. 2022, Art. no. 104190.
[10] X. Liu, J. Li, X. Yang, and H. Huo, ‘‘Infrared and visible image fusion
the current neural networks-based and popular multi-scale based on cross-modal extraction strategy,’’ Infr. Phys. Technol., vol. 124,
transformation-based methods. Aug. 2022, Art. no. 104205.
[11] Q. Pan, L. Zhao, S. Chen, and X. Li, ‘‘Fusion of low-quality visible and [34] A. Fang, X. Zhao, J. Yang, B. Qin, and Y. Zhang, ‘‘A light-weight, effi-
infrared images based on multi-level latent low-rank representation joint cient, and general cross-modal image fusion network,’’ Neurocomputing,
with Retinex enhancement and multi-visual weight information,’’ IEEE vol. 463, pp. 198–211, Nov. 2021.
Access, vol. 10, pp. 2140–2153, 2022. [35] M. Amin-Naji, A. Aghagolzadeh, and M. Ezoji, ‘‘Ensemble of CNN for
[12] D. Zhu, W. Zhan, Y. Jiang, X. Xu, and R. Guo, ‘‘MIFFuse: A multi-level multi-focus image fusion,’’ Inf. Fusion, vol. 51, pp. 201–214, Nov. 2019.
feature fusion network for infrared and visible images,’’ IEEE Access, [36] Y. Liu, J. Sun, H. Yu, Y. Wang, and X. Zhou, ‘‘An improved grey wolf
vol. 9, pp. 130778–130792, 2021. optimizer based on differential evolution and OTSU algorithm,’’ Appl. Sci.,
[13] Y. Zhang, X. Bai, and T. Wang, ‘‘Boundary finding based multi-focus vol. 10, no. 18, p. 6343, Sep. 2020.
image fusion through multi-scale morphological focus-measure,’’ Inf. [37] L. Li, Z. Xia, H. Han, G. He, F. Roli, and X. Feng, ‘‘Infrared and visible
Fusion, vol. 35, pp. 81–101, May 2017. image fusion using a shallow CNN and structural similarity constraint,’’
[14] J. Chen, X. Li, L. Luo, X. Mei, and J. Ma, ‘‘Infrared and visible image IET Image Process., vol. 14, no. 14, pp. 3562–3571, Dec. 2020.
fusion based on target-enhanced multiscale transform decomposition,’’ Inf. [38] L. Jian, X. Yang, Z. Liu, G. Jeon, M. Gao, and D. Chisholm, ‘‘SEDRFuse:
Sci., vol. 508, pp. 64–78, Jan. 2020. A symmetric encoder–decoder with residual block network for infrared
[15] X. Wang, J. Yin, K. Zhang, S. Li, and J. Yan, ‘‘Infrared weak-small targets and visible image fusion,’’ IEEE Trans. Instrum. Meas., vol. 70, pp. 1–15,
fusion based on latent low-rank representation and DWT,’’ IEEE Access, 2021.
vol. 7, pp. 112681–112692, 2019. [39] J. Ma, H. Zhang, Z. Shao, P. Liang, and H. Xu, ‘‘GANMcC: A generative
[16] Y. Yang, S. Tong, S. Huang, P. Lin, and Y. Fang, ‘‘A hybrid method for adversarial network with multiclassification constraints for infrared and
multi-focus image fusion based on fast discrete curvelet transform,’’ IEEE visible image fusion,’’ IEEE Trans. Instrum. Meas., vol. 70, pp. 1–14,
Access, vol. 5, pp. 14898–14913, 2017. 2021.
[17] M. Asikuzzaman, H. Mareen, N. Moustafa, K. R. Choo, and [40] H. Yan, X. Yu, Y. Zhang, S. Zhang, X. Zhao, and L. Zhang, ‘‘Single image
M. R. Pickering, ‘‘Blind camcording-resistant video watermarking in depth estimation with normal guided scale invariant deep convolutional
the DTCWT and SVD domain,’’ IEEE Access, vol. 10, pp. 15681–15698, fields,’’ IEEE Trans. Circuits Syst. Video Technol., vol. 29, no. 1, pp. 80–92,
2022. Jan. 2019.
[18] P. Singh, M. Diwakar, V. Singh, S. Kadry, and J. Kim, ‘‘A new local [41] Y. Liu, X. Chen, J. Cheng, H. Peng, and Z. Wang, ‘‘Infrared and visible
structural similarity fusion-based thresholding method for homomorphic image fusion with convolutional neural networks,’’ Int. J. Wavelets, Mul-
ultrasound image despeckling in NSCT domain,’’ J. King Saud Univ. tiresolution Inf. Process., vol. 16, no. 3, May 2018, Art. no. 1850018.
Comput. Inf. Sci., vol. 35, no. 7, Jul. 2023, Art. no. 101607. [42] K. Wang, D. J. Dou, Q. Kemao, J. Di, and J. Zhao, ‘‘Y-Net: A one-to-
[19] S. Li, B. Yang, and J. Hu, ‘‘Performance comparison of different multi- two deep learning framework for digital holographic reconstruction,’’ Opt.
resolution transforms for image fusion,’’ Inf. Fusion, vol. 12, no. 2, Lett., vol. 44, pp. 4765–4768, Oct. 2019.
pp. 74–84, Apr. 2011. [43] S. Li, X. Kang, and J. Hu, ‘‘Image fusion with guided filtering,’’ IEEE
[20] C. Du and S. Gao, ‘‘Image segmentation-based multi-focus image fusion Trans. Image Process., vol. 22, no. 7, pp. 2864–2875, Jul. 2013.
through multi-scale convolutional neural network,’’ IEEE Access, vol. 5, [44] J. Ma, C. Chen, C. Li, and J. Huang, ‘‘Infrared and visible image fusion
pp. 15750–15761, 2017. via gradient transfer and total variation minimization,’’ Inf. Fusion, vol. 31,
pp. 100–109, Sep. 2016.
[21] R. Lai, Y. Li, J. Guan, and A. Xiong, ‘‘Multi-scale visual attention deep
convolutional neural network for multi-focus image fusion,’’ IEEE Access, [45] D. P. Bavirisetti and R. Dhuli, ‘‘Fusion of infrared and visible sensor
vol. 7, pp. 114385–114399, 2019. images based on anisotropic diffusion and karhunen-loeve transform,’’
IEEE Sensors J., vol. 16, no. 1, pp. 203–209, Jan. 2016.
[22] T. Yao, Y. Luo, J. Hu, H. Xie, and Q. Hu, ‘‘Infrared image super-resolution
[46] V. P. S. Naidu, ‘‘Image fusion technique using multi-resolution singular
via discriminative dictionary and deep residual network,’’ Infr. Phys. Tech-
value decomposition,’’ Defence Sci. J., vol. 61, no. 5, p. 479, Sep. 2011.
nol., vol. 107, Jun. 2020, Art. no. 103314.
[47] D. P. Bavirisetti and R. Dhuli, ‘‘Two-scale image fusion of visible and
[23] Z. Qu, S.-Y. Wang, L. Liu, and D.-Y. Zhou, ‘‘Visual cross-image fusion
infrared images using saliency detection,’’ Infr. Phys. Technol., vol. 76,
using deep neural networks for image edge detection,’’ IEEE Access, vol. 7,
pp. 52–64, May 2016.
pp. 57604–57615, 2019.
[48] J. Ma, Z. Zhou, B. Wang, and H. Zong, ‘‘Infrared and visible image fusion
[24] Y. Liu, X. Chen, Z. Wang, Z. J. Wang, R. K. Ward, and X. Wang,
based on visual saliency map and weighted least square optimization,’’ Infr.
‘‘Deep learning for pixel-level image fusion: Recent advances and future
Phys. Technol., vol. 82, pp. 8–17, May 2017.
prospects,’’ Inf. Fusion, vol. 42, pp. 158–173, Jul. 2018.
[49] G. Piella and H. Heijmans, ‘‘A new quality metric for image fusion,’’ in
[25] Y. Liu, X. Chen, H. Peng, and Z. Wang, ‘‘Multi-focus image fusion with
Proc. Int. Conf. Image Process., vol. 3, Sep. 2003, p. 173.
a deep convolutional neural network,’’ Inf. Fusion, vol. 36, pp. 191–207,
Jul. 2017.
[26] H. Li, X.-J. Wu, and J. Kittler, ‘‘Infrared and visible image fusion using
a deep learning framework,’’ in Proc. 24th Int. Conf. Pattern Recognit. ZHU PAN received the Ph.D. degree in optical
(ICPR), Aug. 2018, pp. 2705–2710. engineering from Tianjin University, in 2017. He is
[27] H. Li, X.-J. Wu, and T. S. Durrani, ‘‘Infrared and visible image fusion with mainly engaged in research in the fields of pho-
ResNet and zero-phase component analysis,’’ Infr. Phys. Technol., vol. 102, toelectric detection, image acquisition, and pro-
Nov. 2019, Art. no. 103039. cessing. As the project leader, he has undertaken a
[28] H. Li, X.-J. Wu, and J. Kittler, ‘‘RFN-nest: An end-to-end residual fusion national project and a provincial research projects.
network for infrared and visible images,’’ Inf. Fusion, vol. 73, pp. 72–86, He has published more than seven SCI papers of
Sep. 2021. Infrared Physics and Technology, Measure, Optics
[29] H. Li, X.-J. Wu, and T. Durrani, ‘‘NestFuse: An infrared and visible image & Laser Technology, and Optical Review.
fusion architecture based on nest connection and spatial/channel attention
models,’’ IEEE Trans. Instrum. Meas., vol. 69, no. 12, pp. 9645–9656,
Dec. 2020.
[30] Y. Zhang, Y. Liu, P. Sun, H. Yan, X. Zhao, and L. Zhang, ‘‘IFCNN: A
WANQI OUYANG received the bachelor’s degree
general image fusion framework based on convolutional neural network,’’
Inf. Fusion, vol. 54, pp. 99–118, Feb. 2020. in optoelectronic information science and engi-
[31] H. Tang, B. Xiao, W. Li, and G. Wang, ‘‘Pixel convolutional neural net- neering from the Hubei University of Science and
work for multi-focus image fusion,’’ Inf. Sci., vols. 433–434, pp. 125–141, Technology, in 2019. He is currently pursuing the
Apr. 2018. master’s degree in instrument science and tech-
[32] K. Ren, D. Zhang, M. Wan, X. Miao, G. Gu, and Q. Chen, ‘‘An infrared and nology with the Wuhan University of Science and
visible image fusion method based on improved DenseNet and mRMR- Technology. He has participated in the writing of
ZCA,’’ Infr. Phys. Technol., vol. 115, Jun. 2021, Art. no. 103707. several journals/conference papers. His research
[33] S. Yi, J. Li, and X. Yuan, ‘‘DFPGAN: Dual fusion path generative adver- interests include image processing and deep learn-
sarial network for infrared and visible image fusion,’’ Infr. Phys. Technol., ing algorithms.
vol. 119, Dec. 2021, Art. no. 103947.