CTFusion: CNN-transformer-based self-supervised learning for infrared and visible image fusion

Keying Du; Liuyang Fang; Jie Chen; Dongdong Chen; Hua Lai; Keying Du; Liuyang Fang; Jie Chen; Dongdong Chen; Hua Lai

doi:10.3934/mbe.2024294

Mathematical Biosciences and Engineering

2024, Volume 21, Issue 7: 6710-6730. doi: 10.3934/mbe.2024294

Previous Article Next Article

Research article Special Issues

CTFusion: CNN-transformer-based self-supervised learning for infrared and visible image fusion

1.
Yunnan Key Laboratory of Digital Communications, Yunnan Communications Investment & Construction Group Company Limited, Kunming, China
2.
Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, China
3.
CMA Meteorological Observation Centre, Beijing, China

Received: 02 January 2024 Revised: 13 May 2024 Accepted: 04 June 2024 Published: 30 July 2024

Infrared and visible image fusion (IVIF) is devoted to extracting and integrating useful complementary information from muti-modal source images. Current fusion methods usually require a large number of paired images to train the models in supervised or unsupervised way. In this paper, we propose CTFusion, a convolutional neural network (CNN)-Transformer-based IVIF framework that uses self-supervised learning. The whole framework is based on an encoder-decoder network, where encoders are endowed with strong local and global dependency modeling ability via the CNN-Transformer-based feature extraction (CTFE) module design. Thanks to the development of self-supervised learning, the model training does not require ground truth fusion images with simple pretext task. We designed a mask reconstruction task according to the characteristics of IVIF, through which the network can learn the characteristics of both infrared and visible images and extract more generalized features. We evaluated our method and compared it to five competitive traditional and deep learning-based methods on three IVIF benchmark datasets. Extensive experimental results demonstrate that our CTFusion can achieve the best performance compared to the state-of-the-art methods in both subjective and objective evaluations.

Keywords:

Citation: Keying Du, Liuyang Fang, Jie Chen, Dongdong Chen, Hua Lai. CTFusion: CNN-transformer-based self-supervised learning for infrared and visible image fusion[J]. Mathematical Biosciences and Engineering, 2024, 21(7): 6710-6730. doi: 10.3934/mbe.2024294

Related Papers:

[1]	Yue Zhang, Haitao Gan, Furong Wang, Xinyao Cheng, Xiaoyan Wu, Jiaxuan Yan, Zhi Yang, Ran Zhou . A self-supervised fusion network for carotid plaque ultrasound image classification. Mathematical Biosciences and Engineering, 2024, 21(2): 3110-3128. doi: 10.3934/mbe.2024138
[2]	Yongmei Ren, Xiaohu Wang, Jie Yang . Maritime ship recognition based on convolutional neural network and linear weighted decision fusion for multimodal images. Mathematical Biosciences and Engineering, 2023, 20(10): 18545-18565. doi: 10.3934/mbe.2023823
[3]	Jianguo Xu, Cheng Wan, Weihua Yang, Bo Zheng, Zhipeng Yan, Jianxin Shen . A novel multi-modal fundus image fusion method for guiding the laser surgery of central serous chorioretinopathy. Mathematical Biosciences and Engineering, 2021, 18(4): 4797-4816. doi: 10.3934/mbe.2021244
[4]	Dehua Feng, Xi Chen, Xiaoyu Wang, Xuanqin Mou, Ling Bai, Shu Zhang, Zhiguo Zhou . Predicting effectiveness of anti-VEGF injection through self-supervised learning in OCT images. Mathematical Biosciences and Engineering, 2023, 20(2): 2439-2458. doi: 10.3934/mbe.2023114
[5]	Zhaoyu Chen, Hongbo Fan, Meiyan Ma, Dangguo Shao . FECFusion: Infrared and visible image fusion network based on fast edge convolution. Mathematical Biosciences and Engineering, 2023, 20(9): 16060-16082. doi: 10.3934/mbe.2023717
[6]	Tian Ma, Boyang Meng, Jiayi Yang, Nana Gou, Weilu Shi . A half jaw panoramic stitching method of intraoral endoscopy images based on dental arch arrangement. Mathematical Biosciences and Engineering, 2024, 21(1): 494-522. doi: 10.3934/mbe.2024022
[7]	Duolin Sun, Jianqing Wang, Zhaoyu Zuo, Yixiong Jia, Yimou Wang . STS-TransUNet: Semi-supervised Tooth Segmentation Transformer U-Net for dental panoramic image. Mathematical Biosciences and Engineering, 2024, 21(2): 2366-2384. doi: 10.3934/mbe.2024104
[8]	Jin Zhang, Nan Ma, Zhixuan Wu, Cheng Wang, Yongqiang Yao . Intelligent control of self-driving vehicles based on adaptive sampling supervised actor-critic and human driving experience. Mathematical Biosciences and Engineering, 2024, 21(5): 6077-6096. doi: 10.3934/mbe.2024267
[9]	Xiao Zou, Jintao Zhai, Shengyou Qian, Ang Li, Feng Tian, Xiaofei Cao, Runmin Wang . Improved breast ultrasound tumor classification using dual-input CNN with GAP-guided attention loss. Mathematical Biosciences and Engineering, 2023, 20(8): 15244-15264. doi: 10.3934/mbe.2023682
[10]	Jingyao Liu, Qinghe Feng, Yu Miao, Wei He, Weili Shi, Zhengang Jiang . COVID-19 disease identification network based on weakly supervised feature selection. Mathematical Biosciences and Engineering, 2023, 20(5): 9327-9348. doi: 10.3934/mbe.2023409

Abstract

1. Introduction

Due to the hardware limitations of imaging, single sensor types or setups are often unable to fully represent imaging scenes ^[1,2]. For example, visible images contain rich texture details, but they are susceptible to extreme environments and occlusion, leading to target loss in scenes. In contrast, infrared sensors are capable of imaging by capturing the thermal radiation information emitted by the objects, which effectively highlights pedestrians, vehicles, and other significant targets, but lack detail description for the scenes ^[3]. In order to represent the scene accurately and effectively, image fusion has been pushed forward to integrate the complementary features of multiple source views in the same scene, thus generating a high-quality image for the downstream high-level tasks or human perception ^[4]. Specifically, infrared and visible image fusion (IVIF) aims to integrate complementary information from the source images and generates a high-contrast fusion image that can both highlight salient objects and contain rich texture details ^[5]. In the early years, traditional methods usually used related mathematical transformations and manual design of the fusion rules to realize the image fusion ^[3], including wavelet ^[6], pyramid ^[7], and sparse representation ^[8]. However, these manually designed feature representation approaches and fusion rules poorly understand the inherent information of images, which limits the ability to mine statistical characteristics of large samples and the generalizability.

Recently, deep learning has dominated the development of computer vision with its powerful feature extraction and expression capabilities. It has been used in all kinds of fields such as image classification ^[9], object detection ^[10,11,12], semantic segmentation ^[13], and image fusion ^[14]. In order to overcome the shortcomings of traditional algorithms, researchers have explored a large number of image fusion methods based on deep learning, which can be divided into convolutional neural network (CNN)-based and generative adversarial network (GAN)-based IVIF frameworks. Nevertheless, these models exhibit a major weakness: the lack of ground-truth fused images. Some algorithms ^[15,16] choose to generate ground-truth fusion results with existing state-of-the-art (SOTA) fusion methods, whose fusion quality cannot be promised since it highly depends on the quality of the produced "ground-truth". Moreover, most existing deep learning-based IVIF methods utilize CNNs for feature extraction, but CNNs fail to model longrange dependencies owing to their small receptive field, which is an inherent limitation ^[17].

In a word, there are two major problems in most existing IVIF tasks. On the one hand, there is the lack of ground-truth fused images. As mentioned above, though some methods try to generate so-called ground-truth fused images using other SOTA IVIF approaches, the quality of those produced ground-truths cannot be promised, thus largely affecting the consequent fusion results. On the other hand, failure to capture long-range dependencies of CNNs due to their small receptive field becomes a weakness in most deep learning based IVIF methods. To address the aforementioned issues, we propose a CNN-Transformer-based IVIF framework utilizing self-supervised learning, dubbed CTFusion. The pretext task using masked image reconstruction helps better extract features of source images. The CNN-Transformer-based encoder structure can utilize both local and global information. Inspired by ^[18], we adopt a specific image augmentation strategy that will mask some patches of the original images ${\mathit{\boldsymbol{I}}}_{ir}$ and $\mathit{\boldsymbol{I_{vis}}}$ with noise to generate two "source images", ${\mathit{\boldsymbol{\widetilde I}} _{ir}}$ and ${\mathit{\boldsymbol{\widetilde I}}_{vis}}$ . Afterwards, they are fed into the CNN-Transformer-based encoders to excavate the intrinsic features ${{\mathit{\boldsymbol{f}}}_{ir}}$ and ${{\mathit{\boldsymbol{f}}}_{vis}}$ in the source images. We then apply two decoders $\mathit{\boldsymbol{D_{ir}}}$ and ${{\mathit{\boldsymbol{D}}}_{vis}}$ to produce the repaired images of ${\mathit{\boldsymbol{\widetilde I}}_{ir}}$ and ${\mathit{\boldsymbol{\widetilde I}}_{vis}}$ . In addition, a self-cross perceptual feature fusion (S-CPFF) strategy is elaborated, by which we combine features ${{\mathit{\boldsymbol{f}}}_{ir}}$ and ${{\mathit{\boldsymbol{f}}}_{vis}}$ together to generate the fusion result of ${{\mathit{\boldsymbol{I}}}_{ir}}$ and ${{\mathit{\boldsymbol{I}}}_{vis}}$ . The idea of our proposed method can be applied to common image fusion scenarios, since the abovementioned two problems are universal in the image processing field.

The main contributions of this paper are summarized as follows:

● We propose a self-supervised IVIF framework by designing a mask reconstruction task which no longer needs ground-truth to better excavate intrinsic information lying in infrared and visible source images.

● To compensate for the defect in establishing long-range dependencies in CNN-based architectures, we design an encoder that combines a CNN-Block with a Transformer-Block, which enables the network to utilize both local and global information during feature extraction.

● The S-CPFF module is devised to help enhance the extracted modality-specific and modality-common features, obtaining a final fusion result with high quality.

● Extensive experiments conducted on three publicly available datasets demonstrate the effectiveness of our method, as well as show its superior performance when compared with other state-of-the-art (SOTA) models.

2. Related works

2.1. Traditional image fusion methods

Traditional fusion frameworks usually realize image fusion in the transform domain and spatial domain through designing appropriate feature extraction details and fusion rules, which generally contain two major categories, multi-scale transform-based methods ^[19,20] and sparse representation-based methods ^{[21,22,23,24,25]}.

Multi-scale transform-based methods first decompose source images into several levels, as the feature extraction, then fuse corresponding layers with particular rule, and reconstruct the target images accordingly, where popular transforms used for decomposition and reconstruction include wavelet ^[26], pyramid ^[27], curvelet ^[28], and their revised versions. However, these methods typically tend to leave out image details in the fused results and lead to halos or undesirable artifacts in the fused result due to the fixed bases used in the multi-scale transform-based methods. The key of sparse representation-based methods is to build over complete dictionaries from a large number of natural images to possibly represent the source images with linear combinations of sparse bases. Although sparse representation-based methods have achieved promising performance, a limited number of dictionaries cannot reflect the full information of input images, obscuring details such as edges and textures in the source images.

2.2. Deep learning-based image fusion methods

In deep learning-based algorithms, two source images from different modalities are directly input into a fusion network, and then the network outputs the fused image. Specifically, Liu et al. ^[29] proposed a method based on convolutional neural networks, which can deal with activity level measurement and weight assignment in IVIF as a whole to overcome the difficulty of manual design. To get more useful features from source images, Li and Wu ^[15] presented a novel encoding network combined with convolutional layers, a fusion layer, and a dense block in which the output of each layer is connected to every other layer. With the development of generative adversarial networks (GANs), more and more GAN-based IVIF methods appeared. Although CNNs have made great achievements in the field of supervised learning, they still have not progressed much the unsupervised learning. In order to fill the gap between supervised learning and unsupervised learning of CNNs, Ma et al. ^[30] established an adversarial game between a generator and a discriminator, which enabled the final fused image to simultaneously keep the thermal radiation in an infrared image and the textures in a visible image. Meanwhile, generic image fusion frameworks also achieved surprising performance. Li et al. ^[31] proposed a meta learning-based deep framework for the fusion of infrared and visible images which can accept source images of different resolutions and generate a fused image of arbitrary resolution with just a single learned model. Zhang and Ma ^[32] proposed a squeeze-and-decomposition network named SDNet to realize multi-modal and digital photography image fusion in real time. Xu et al. ^[33] used feature extraction and information measurement to automatically estimate the importance of corresponding source images and came up with adaptive information preservation degrees, solving different fusion problems. Recently, Tang et al. ^[34] and Liu et al. ^[35] bridged the gap between image fusion and high-level vision tasks, facilitating the high-level vision tasks with the proposed frameworks.

Though existing approaches devise complicated fusion rules and loss functions to achieve effective fusion, they still fail to effectively learn the characteristics of infrared and visible images for not devising specific task to explore intrinsic features in source images. It is a consensus that feature extraction is a pivotal step in image fusion. If the extracted features cannot comprehensively represent rich characteristics of source images, the quality of fusion results will definitely be worse. In contrast, we design a self-supervised mask reconstruction task to deeply excavate the intrinsic characteristics of infrared and visible images so that our network is able to achieve high-quality IVIF fusion.

2.3. Vision transformer

Transformer was first proposed by Vaswani et al. ^[36] for machine translation. Its ability to extract features from the global level and effectively depict the correlations between features at different locations has attracted wide attention in the community. Later, researchers made great success in introducing Transformer into computer vision tasks, such as image processing ^[37], object detection ^[38], semantic segmentation ^[39], etc. Dosovitskiy et al. ^[40] introduced Transformer to an image classification task for the first time, proposing Vision Transformer (ViT). Based on ViT, a series of ViT variants were proposed to improve the performance ^[41,42]. Particularly, in the field of image fusion, Vs et al. ^[43] proposed a Transformer based IVIF method. This method used Transformer's encoder to extract image features, obtained the fused features with Spatial-Transformer, and finally reconstructed the fused image through Transformer's decoder.

Since Transformer has a stronger ability to model long-range dependencies, it is suitable to extract global image features. In contrast, CNNs are apt to capture local image features and describe low-level visual features such as structure and texture details, for they extract image features through convolutional kernels, whose receptive fields are limited. To integrate their advantages, we design an encoder that combines a CNN-Block with a Transformer-Block, which enables the network to utilize both local and global information during feature extraction.

3. Method

3.1. Overview

As mentioned above, lack of ground-truth fused images remains a tricky problem in IVIF tasks. Therefore, we propose CTFusion to achieve fusion in a self-supervised way, with a pretext reconstruction task aiming at image understanding is elaborated. More specifically, a CNN-Transformer-based encoder is devised to compensate for the defect in establishing long-range dependencies in CNN-based architectures. After the parameters in encoders are optimized, we fix them during the fusion phase and then fuse the extracted feature maps through tailored fusion net S-CPFF, to generate the fused image.

The Pipeline of the encoder training process is shown in . We use the network to perform the self-supervised image reconstruction task to enable the encoders to extract intrinsic features lying in the source images, i.e., to reconstruct the original image from the masked input image. Concretely, given an original image ${{\mathit{\boldsymbol{I}}}_{in(in \in \left\{ {ir, vis} \right\})}} \in {\mathbb{R}^{H \times W \times 3}}$ , masked image ${\mathit{\boldsymbol{\widetilde I}}_{in}}$ is generated by masking several non-overlapping patches with noise. Then, we feed the masked images into respective encoders ${{\mathit{\boldsymbol{E}}}_{ir}}$ and ${{\mathit{\boldsymbol{E}}}_{vis}}$ to obtain corresponding embeddings, which consists of a CNN-Transformer feature extraction (CTFE) module and a feature enhancement (FE) module. The CTFEBlock integrates the advantages of the CNN and the Transformer to model both global and local dependencies. FE aggregates and enhances the features extracted from the CNN-Block and the Transformer-Block. Finally, the image features extracted by the encoders are sent to respective decoders ${{\mathit{\boldsymbol{D}}}_{ir}}$ and ${{\mathit{\boldsymbol{D}}}_{vis}}$ to reconstruct images ${{\mathit{\boldsymbol{I}}}_{recon\_ir}}$ , ${{\mathit{\boldsymbol{I}}}_{recon\_vis}} \in {\mathbb{R}^{H \times W \times 3}}$ .

Figure 1. Overall framework of our proposed method. (a) The proposed self-supervised masked image reconstruction network. (b) The image fusion architecture.

DownLoad: Full-Size Img PowerPoint

After training the encoders, we then use them for image fusion with their parameters fixed, as shown in . Specifically, two source images ${{\mathit{\boldsymbol{I}}}_{ir}}$ and ${{\mathit{\boldsymbol{I}}}_{vis}}$ are first input to the trained encoders ${{\mathit{\boldsymbol{E}}}_{ir}}$ and ${{\mathit{\boldsymbol{E}}}_{vis}}$ to extract features, and then the fusion result is obtained by fusing the extracted features using the well-designed fusion network S-CPFF.

3.2. Self-supervised mask reconstruction task

In general, the goal of image fusion is to integrate complementary information from different source images into a synthetic image. Moreover, feature dependency excavation is also the key in image fusion, since the relation understanding is important in feature extraction. Features with rich semantic and structural information are ideal to obtain high-quality fusion results. We divide the input image into non-overlapping patches and then use a random mask $M$ and Gaussian noise $n$ to force encoders to excavate intrinsic information lying in source images.

$\begin{equation} \begin{aligned} {\mathit{\boldsymbol{\widetilde I}}_{in}} = M({{\mathit{\boldsymbol{I}}}_{in}}) + \overline M (n)(in \in \{ ir, vis\} ), \end{aligned} \end{equation}$

(3.1)

where $\overline M (\cdot)$ is the logical negation operator of mask $M$ .

For each source image pair ${{\mathit{\boldsymbol{I}}}_{ir}}$ and ${{\mathit{\boldsymbol{I}}}_{vis}}$ , they share parts of the scene information with each other while retaining some unique information. By randomly masking and filling the remaining with random noise, the encoders are forced to extract more information in source images, better understanding the relations between pixels. After pre-training, the encoders are able to extract more comprehensive features, which can be directly used for the following image fusion task.

3.3. CNN-Transformer-based encoder-decoder framework

Given source images ${{\mathit{\boldsymbol{I}}}_{ir}}$ and ${{\mathit{\boldsymbol{I}}}_{vis}}$ , we first randomly mask subregions and fill the remaining with noise to form ${\mathit{\boldsymbol{\widetilde I}}_{ir}}$ and ${\mathit{\boldsymbol{\widetilde I}}_{vis}}$ , which will be sent to the CNN-Transformer-based encoders ${{\mathit{\boldsymbol{E}}}_{ir}}$ and ${{\mathit{\boldsymbol{E}}}_{vis}}$ . Each encoder contains a CTFE module and a FE module, whose detailed architectures are shown in Figure 2.

Figure 2. Detailed structures of CNN-Transformer-based encoder, ConvBlock, and Transformer.

DownLoad: Full-Size Img PowerPoint

Given that CNNs are adept at modeling local dependencies in images, while the Transformer specializes in modeling global dependencies, we propose CTFE, which combines the CNN and Transformer architectures to model both local and global dependencies in images. Specifically, the CNN-Block contains a residual dense block following the residual dense network ^[44]. As for the Transformer-Block, the masked image ${\mathit{\boldsymbol{\widetilde I}}_{in(in \in \left\{ {ir, vis} \right\})}} \in {\mathbb{R}^{H \times W \times 3}}$ is first divided into a total of $N$ patches with size $\frac{H}{P} \times \frac{W}{P}$ , where $N = \frac{{HW}}{{{P^2}}}$ and $P$ is the size of the patches. Passing the patches through a patch embeddings linear projection and $L$ Transformer layers, we can obtain Transformer-embedded feature ${\mathit{\boldsymbol{f}}}_{in(in \in \{ ir, vis\})}^{tf}$ . illustrates the architecture of one Transformer layer, which consists of a multi-head attention (MSA) block and a multi-layer perceptron (MLP) block, where layer normalization (LN) is applied before every block, and residual connections are applied after every block. The MLP block consists of two linear layers with a Gaussian Error Linear Unit (GELU) activation function. In order to better integrate the local and global features extracted by CNN and Transformer blocks, we devise the FE module to aggregate and enhance the feature maps ${\mathit{\boldsymbol{f}}}_{in(in \in \{ ir, vis\})}^{cnn}$ and ${\mathit{\boldsymbol{f}}}_{in(in \in \{ ir, vis\})}^{tf}$ . Concretely, we concatenate the two feature maps from the CNN-Block and the Transformer-Block in CTFE and send them into four sequentially connected ConvBlock layers to achieve FE, as shown in Figure 2.

$\begin{equation} \begin{aligned} {\mathit{\boldsymbol{f}}}_{in}^{en} = {(ConvBlock([{\mathit{\boldsymbol{f}}}_{in}^{cnn}, {\mathit{\boldsymbol{f}}}_{in}^{tf}]))_{ \times 4}}, in \in \{ ir, vis\}, \end{aligned} \end{equation}$

(3.2)

where each ConvBlock consists of two convolutional layers with a kernel size of $3 \times 3$ , a padding of 1, and two Rectified Linear Unit (ReLU) activation layers, and $[\cdot]$ denotes channel-wise concatenation.

We then feed the obtained feature maps ${\mathit{\boldsymbol{f}}}_{ir}^{en}$ and ${\mathit{\boldsymbol{f}}}_{vis}^{en}$ to decoders ${\mathit{\boldsymbol{D}}}_{ir}$ and ${\mathit{\boldsymbol{D}}}_{vis}$ , each composed of two convolutional layers with a kernel size of $3 \times 3$ , a padding of 1, and one ReLU activation layer, to reconstruct the corresponding image.

In the mask reconstruction task, we encourage the network to not only learn the pixel-level image reconstruction but also capture the structural and gradient information in the image. The loss of the reconstruction task in each branch can be formalized as follows:

$\begin{equation} \begin{aligned} {\ell _{reconstruction}} = {\ell _{pixel}} + {\lambda _1}{\ell _{structure}} + {\lambda _2}{\ell _{TV}}, \end{aligned} \end{equation}$

(3.3)

where $\ell _{pixel}$ is the L1 loss function, $\ell _{structure}$ is the structural similarity (SSIM) loss function, and $\ell _{TV}$ is the total variation loss function. $\lambda _1$ and $\lambda _2$ are two hyperparameters empirically set to 20.

$\ell _{pixel}$ ensures pixel-level reconstruction

$\begin{equation} \begin{aligned} {\ell _{pixel}} = {{\mathit{\boldsymbol{I}}}_{recon\_in}} - {{\mathit{\boldsymbol{I}}}_{in}}, in \in \{ir, vis\}, \end{aligned} \end{equation}$

(3.4)

where ${\mathit{\boldsymbol{I}}}_{recon\_in}$ is the output reconstructed image, and ${\mathit{\boldsymbol{I}}}_{in}$ represents the input unmasked source image.

To better help the model learn structural information from images, we use the structure loss:

$\begin{equation} \begin{aligned} {\ell _{structure}} = 1 - SSIM({{\mathit{\boldsymbol{I}}}_{recon\_in}}, {{\mathit{\boldsymbol{I}}}_{in}}), in \in \{ir, vis\}. \end{aligned} \end{equation}$

(3.5)

Furthermore, $\ell _{TV}$ in VIFNet ^[45] is used to facilitate gradient preservation in the source images and eliminate noise. It is formulated as follows:

$\begin{equation} {\ell _{TV}} = \sum\limits_{x, y} {||R\left({x, \left. {y + 1} \right)} \right. - R\left( {x, \left. y \right)} \right.|{|_2} + ||R\left( {x + 1, \left. y \right)} \right. - R\left( {x, \left. y \right)} \right.|{|_2}}. \end{equation}$

(3.6)

where $R(x, y) = {{\mathit{\boldsymbol{I}}}_{recon\_in}}(x, y) - {{\mathit{\boldsymbol{I}}}_{in}}(x, y)$ ( $in \in \{ir, vis\}$ ) denotes the difference between the input image and the reconstructed image, $||\cdot |{|_2}$ is the L2 norm, and x, y represent the horizontal and vertical coordinates of the image's pixels, respectively.

3.4. Self-cross perceptual feature fusion

How to effectively fuse feature maps to obtain final fusion results remains challenging. Here, we propose S-CPFF, which is able to highlight both self and mutually interested parts in feature maps, naturally improving the qualities of fused images. Concretely, as shown in Figure 3, the S-CPFF has two self-attention (SA) and two cross-attention (CA) modules. The detailed structures of the SA and CA modules are shown in Figure 4. Mathematically, the SA process is denoted as:

$\begin{equation} \begin{aligned} {\mathit{\boldsymbol{f}}}_{ir} & = {\mathop{\rm softmax}\nolimits} (\frac{{{{\bf{Q}}_{ir}}{{({{\bf{K}}_{ir}})}^T}}}{{\sqrt d }}){{\bf{V}}_{ir}}, \\ {\mathit{\boldsymbol{f}}}_{vis} & = {\mathop{\rm softmax}\nolimits} (\frac{{{{\bf{Q}}_{vis}}{{({{\bf{K}}_{vis}})}^T}}}{{\sqrt d }}){{\bf{V}}_{vis}}, \end{aligned} \end{equation}$

(3.7)

where ${{\bf{Q}}_{ir}}$ , ${{\bf{Q}}_{vis}} \in {\mathbb{R}^{H \times W \times C}}$ , ${{\bf{K}}_{ir}}$ , ${{\bf{K}}_{vis}} \in {\mathbb{R}^{H \times W \times C}}$ , and ${{\bf{V}}_{ir}}$ , ${{\bf{V}}_{vis}} \in {\mathbb{R}^{H \times W \times C}}$ are the results of ${\mathit{\boldsymbol{f}}}_{ir}^{en}$ and ${\mathit{\boldsymbol{f}}}_{vis}^{en}$ passed through $1\times 1$ convolution, respectively. ${\sqrt d }$ is a normalization factor, and $T$ is the transpose operation. The CA process is:

$\begin{equation} \begin{aligned} {\mathit{\boldsymbol{f}}}_{ir \to vis} & = {\mathop{\rm softmax}\nolimits} (\frac{{{{\bf{Q}}_{vis}}{{({{\bf{K}}_{ir}})}^T}}}{{\sqrt d }}){{\bf{V}}_{ir}}, \\ {\mathit{\boldsymbol{f}}}_{ir \to vis} & = {\mathop{\rm softmax}\nolimits} (\frac{{{{\bf{Q}}_{ir}}{{({{\bf{K}}_{vis}})}^T}}}{{\sqrt d }}){{\bf{V}}_{vis}}, \end{aligned} \end{equation}$

(3.8)

where $vis \to ir$ denotes information flow from visible modal to infrared modal. Then, we concatenate the results after respective SA and CA modules to obtain the self-cross perceptual feature:

$\begin{equation} \begin{aligned} {{\mathit{\boldsymbol{f}}}_{ir}}^{\prime} & = [{\mathit{\boldsymbol{f}}}_{ir}, {\mathit{\boldsymbol{f}}}{_{vis \to ir}}], \\ {{\mathit{\boldsymbol{f}}}_{vis}}^{\prime} & = [{\mathit{\boldsymbol{f}}}_{vis}, {\mathit{\boldsymbol{f}}}{_{ir \to vis}}]. \end{aligned} \end{equation}$

(3.9)

Note that in the fusion phase, we fix the parameters of encoders ${{\mathit{\boldsymbol{E}}}_{ir}}$ and ${{\mathit{\boldsymbol{E}}}_{vis}}$ learned in the mask reconstruction task. We then concatenate ${{\mathit{\boldsymbol{f}}}_{ir}}^{\prime}$ and ${{\mathit{\boldsymbol{f}}}_{vis}}^{\prime}$ and feed the concatenated result to a $3 \times 3$ convolution to integrate all the features. Then, the integrated feature is sent to the decoder ${{\mathit{\boldsymbol{D}}}_{fuse}}$ to attain the fused image:

$\begin{equation} \begin{aligned} {{\mathit{\boldsymbol{I}}}_{fuse}} = {{\mathit{\boldsymbol{D}}}_{fuse}}(Con{v_{3 \times 3}}([{{\mathit{\boldsymbol{f}}}_{ir}}^{\prime}, {{\mathit{\boldsymbol{f}}}_{vis}}^{\prime}])), \end{aligned} \end{equation}$

(3.10)

where the detailed structure of ${{\mathit{\boldsymbol{D}}}_{fuse}}$ is shown in Figure 1.

Figure 3. Detailed framework of S-CPFF.

DownLoad: Full-Size Img PowerPoint

Figure 4. Detailed structures of SA and CA.

DownLoad: Full-Size Img PowerPoint

To retain rich edge and texture information in the fused image, we adopt joint gradient loss $\ell _{JGrad}$ , which is formulated as

$\begin{equation} \begin{aligned} {\ell _{JGrad}} = ||{\mathop{\rm O}\nolimits} (\max (\left| {\nabla {{\mathit{\boldsymbol{I}}}_{ir}}} \right|, \left| {\nabla { I_{vis}}} \right|)) - \nabla I_{fused}|{|_1}, \end{aligned} \end{equation}$

(3.11)

where $\nabla$ is the Laplacian gradient operator. $\max(\cdot)$ denotes taking the maximum value. ${\mathop{\rm O}\nolimits} (\left| x \right|) = x$ denotes finding the original gradient value before taking its absolute value.

We also introduce the intensity loss to preserve the saliency targets in two input images, which can be expressed as

$\begin{equation} \begin{aligned} {\mathit{\boldsymbol{\omega}}_{ir} } & = {{\mathit{\boldsymbol{S}}}_{{ I_{ir}}}} /({{\mathit{\boldsymbol{S}}}_{ I_{ir}}} - {{\mathit{\boldsymbol{S}}}_{{ I_{vis}}}}), {\mathit{\boldsymbol{\omega}}_{vis}} = 1 - {\mathit{\boldsymbol{\omega}}_{ir}}, \hfill \\ {\ell _{int}} & = ||({\mathit{\boldsymbol{\omega}}_{ir}} \odot {{\mathit{\boldsymbol{I}}}_{ir}} + {\mathit{\boldsymbol{\omega}}_{vis}} \odot {{\mathit{\boldsymbol{I}}}_{vis}})-{ {\mathit{\boldsymbol{I}}}_{fused}}|{|_1}, \hfill \\ \end{aligned} \end{equation}$

(3.12)

where ${\mathit{\boldsymbol{S}}}_{ I_{ir}}$ and ${\mathit{\boldsymbol{S}}}_{ I_{vis}}$ denote saliency matrices of ${\mathit{\boldsymbol{I}}}_{ir}$ and ${\mathit{\boldsymbol{I}}}_{vis}$ , which can be computed according to ^[46]. ${\mathit{\boldsymbol{\omega}}_{ir}}$ and ${ \mathit{\boldsymbol{\omega}}_{vis}}$ are the weight maps for ${\mathit{\boldsymbol{I}}}_{ir}$ and ${\mathit{\boldsymbol{I}}}_{vis}$ , respectively. $\odot$ represents the element-wise multiplying operation.

The overall fusion loss is computed by

$\begin{equation} \begin{aligned} {\ell _{fuse}} = {\ell _{int}} + {\lambda _{JG}}{\ell _{JGrad}} \end{aligned}, \end{equation}$

(3.13)

where ${\lambda _{JG}}$ is the hyper-parameter set to 20.

4. Experiments and results

In this section, we evaluate the qualitative and quantitative performance of our proposed method by comparing to five SOTA methods, including IFCNN ^[47], PMGI ^[48], CrossFuse ^[49], RFN-Nest ^[50], and FusionGAN ^[30]. We also implement several ablation studies to validate the effectiveness of the proposed modules.

4.1. Experimental configurations

Dataset: First, 300 multi-modality images from the M3FD ^[35] benchmark are selected and cropped to 360k patches with $256 \times 256$ pixels by random cropping and augmented as the training set in this study. M3FD is a multi-modal dataset with multiple scenarios, where 4,200 aligned image pairs are divided into four typical types, i.e., Day, Cloudy, Night, and Challenge. We perform qualitative and quantitative experiments on three datasets (i.e., Roadscene, TNO, and MSRS). RoadScene is a wildly-used dataset for cross-modality image fusion. The TNO dataset contains multi-spectral nighttime imagery of various military-relevant scenarios in grayscale. The MSRS dataset contains 1444 pairs of aligned infrared and visible images with high quality.

Evaluation metrics: For quantitative evaluation, five statistical metrics are selected to objectively assess the fusion performance, including correlation coefficient (CC) ^[51], cross entropy (CE), ${Q^{CV}}$ ^[52], the sum of correlations of differences (SCD) ^[53], and structural similarity (SSIM) ^[54]. CC evaluates the degree of linear correlation between the fused image and source images. CE reflects the difference of grayscale information between the fusion image and source images. The smaller the CE value is, the smaller the difference between images, which indicates better fusion quality. ${Q^{CV}}$ uses the Sobel operator to extract the edge information of the source images and the fusion result to obtain the edge intensity map G. Smaller ${Q^{CV}}$ is more in line with human visual perception. SCD reflects the correlation level between information transmitted to the fused image and corresponding source images. SSIM approximates image distortion. In addition, a fusion algorithm with larger CC, SCD, and SSIM indicates better fusion performance.

Implementation details: The Adam optimizer ^[55] ( ${\beta _1}$ = 0.9, and ${\beta _2}$ = 0.999) is responsible for updating the network parameters with initial learning rate of 0.001, which decreases to $10^{-4}$ after 100 epochs. The epochs of the self-supervised mask reconstruction task and training of S-CPFF are both set to 300 with batch size of 4. Our framework is implemented on PyTorch with an NVIDIA 3090 GPU. Note that source images in all the abovementioned datasets are converted to gray to achieve fusion.

4.2. Comparative experiment

4.2.1. Experiments on RoadScene dataset

Qualitative results on the RoadScene benchmark are reported in Figure 5, where we highlight two regions in each example. As can be seen, RFN-Nest and FusionGAN suffer blurred edges and background, and IFCNN, PMGI, and CrossFuse all lose texture details to some extent. Instead, our method has the best image contrast and clear structure information.

Figure 5. Vision quality comparison of our method with five SOTA fusion methods on the RoadScene dataset.

DownLoad: Full-Size Img PowerPoint

Quantitative comparisons are shown in , where we use five metrics, i.e., CC, CE, $Q^{CV}$ , SCD, and SSIM, to evaluate all comparison methods. Our method ranks first in CC, CE, SCD, and SSIM, indicating that the generated fused results are of higher similarity to the source images. For $Q^{CV}$ , our method also achieves comparable results, which implies that the fusion images of our method are more real. Our model utilizes the Laplacian gradient operator to detect the edges of images, while $Q^{CV}$ uses the Sobel gradient operator, which might be the reason why our method performs suboptimally in $Q^{CV}$ .

Table 1. Quantitative results of five SOTA methods and ours on 50 image pairs from RoadScene ^[33] dataset. Bold: best. Italic: second best.

Methods	CC	CE	$Q^{CV}$	SCD	SSIM
IFCNN ^[47]	0.622	0.976	589.5	1.245	0.693
PMGI ^[48]	0.596	1.328	1019.6	1.218	0.644
CrossFuse ^[49]	0.614	1.397	943.3	1.397	0.687
RFN-Nest ^[50]	0.582	0.922	983.2	1.373	0.603
FusionGAN ^[30]	0.577	2.308	1371.2	0.889	0.615
Ours	0.641	0.763	820.9	1.447	0.702

| Show Table

DownLoad: CSV

4.2.2. Experiments on TNO dataset

We select five pairs of infrared and visible images to visually observe the fusion performance of different algorithms on the TNO dataset. The visualized results are shown in Figure 6. As shown in the first column of Figure 6, our method keeps the best image contrast. It can be seen from the zoomed-in areas that our methods combine complementary as well as modality-common information in source images to the most extent. Meanwhile, the edges of salient targets from infrared images are clear, and texture details from visible images are well kept in our fusion results.

Figure 6. Vision quality comparison of our method with five SOTA fusion methods on the TNO dataset.

DownLoad: Full-Size Img PowerPoint

For example, in the second row of Figure 6, though the image contrast of CrossFuse is better than ours, it still loses cloud information. In the first and third rows, people in the foreground are bright in our fused results, while FusionGAN has blurred target edges.

Quantitative performance of our method on the TNO dataset was similar to that of RoadScene, which is shown in Table 2.

Table 2. Quantitative results of five SOTA methods and ours on 30 image pairs from TNO ^[56] dataset. Bold: best. Italic: second best.

Methods	CC	CE	$Q^{CV}$	SCD	SSIM
IFCNN ^[47]	0.643	1.734	392.5	1.215	0.695
PMGI ^[48]	0.651	1.781	496.1	1.222	0.676
CrossFuse ^[49]	0.669	1.694	943.3	1.349	0.682
RFN-Nest ^[50]	0.622	1.792	533.6	1.361	0.689
FusionGAN ^[30]	0.554	2.380	968.7	1.451	0.628
Ours	0.675	1.499	433.6	1.476	0.701

| Show Table

DownLoad: CSV

4.2.3. Experiments on MSRS dataset

So as to visually evaluate the fusion performances of different algorithms on the MSRS dataset, three pairs of infrared and visible images are selected, depicted in Figure 7. As illustrated in the red and green boxes in the image, our proposed method favorably maintains textures of source images while keeping the clearest salient edges.

Figure 7. Vision quality comparison of our method with five SOTA fusion methods on the MSRS dataset.

DownLoad: Full-Size Img PowerPoint

We conduct quantitative comparisons on 40 image pairs from the MSRS dataset to verify the effectiveness of our method, which is presented in . It can be seen that our method ranks first in four metrics and second in the CC metric. The CE, $Q^{CV}$ , SCD, and SSIM metrics demonstrate that our results contain more realistic information. As for CC, it directly matches images by their intensity, without using any analysis of the image structure. Hence, CC is sensitive to intensity changes in the image. In general, image noise, changes in lighting intensity during imaging, and the use of different imaging equipment all cause changes in image intensity, which will further affect CC.

Table 3. Quantitative results of five SOTA methods and ours on 40 image pairs from MSRS ^[34] dataset. Bold: best. Italic: second best.

Methods	CC	CE	$Q^{CV}$	SCD	SSIM
IFCNN ^[47]	0.551	0.936	873.9	1.219	0.657
PMGI ^[48]	0.488	1.127	1292.3	1.336	0.628
CrossFuse ^[49]	0.527	1.248	865.7	1.245	0.663
RFN-Nest ^[50]	0.495	0.891	823.2	1.328	0.614
FusionGAN ^[30]	0.463	2.185	1587.8	0.713	0.605
Ours	0.537	0.766	743.5	1.423	0.675

| Show Table

DownLoad: CSV

In conclusion, our method is fully capable of excavating inherent important features in source images and integrating them into fused images. Thereby, our method is superior to other SOTA approaches and obtains high-quality fused images.

4.3. Analysis of generalization ability

To validate the generalization ability of our method, we conduct experiments on datasets for other image fusion tasks, including LLVIP ^[57] for color image fusion and CT-MRI ^[58] for medical image fusion. Fusion results are shown in Figure 8. From the qualitative results we can see that our proposed model perfectly completes other fusion tasks, which strongly proves the generalization ability of our method.

Figure 8. Vision effect of our method on the LLVIP and CT-MRI datasets. (a)–(c) are our fusion results on the LLVIP dataset, and (d)–(f) are our fusion results on the CT-MRI dataset.

DownLoad: Full-Size Img PowerPoint

4.4. Analysis of computational complexity

As shown in Table 4, a complexity evaluation is introduced to evaluate the efficiency of our method from two aspects, i.e., training parameters and runtime. It is worth pointing out that though our method does not perform the best in terms of the model complexity and inference time due to subtle design of various modules, the proposed CTFusion and the best SOTA method are still equal. This indicates the efficiency of our CTFusion, which can serve practical vision tasks well with better visual performance.

Table 4. Computational efficiency comparison of five SOTA methods. The value is tested on GPU.

Methods	IFCNN ^[47]	PMGI ^[48]	CrossFuse ^[49]	RFN-Nest ^[50]	FusionGAN ^[30]	Ours
SIZE(M)	0.084	0.042	1.161	30.097	0.926	2.119
TIME(s)	0.013	0.052	1.076	0.358	1.179	0.019

| Show Table

DownLoad: CSV

5. Ablation studies

In the ablation study, we demonstrate the effectiveness of the self-supervised mask reconstruction task, CNN-Transformer-based encoder, and the proposed S-CPFF. The experimental results are shown in Figure 9 and Table 5.

Figure 9. Vision quality comparison of the ablation study on proposed modules. From left to right, infrared image, visible image, and the results without self-supervised mask reconstruction task, results without Transformer-Block, results without S-CPFF, and results of our CTFusion.

DownLoad: Full-Size Img PowerPoint

Table 5. Quantitative evaluation results of ablation study on 30 pairs of infrared and visible images from TNO dataset.

Configuration	CC	CE	$Q^{CV}$	SCD	SSIM
w/o mask	0.653	1.571	485.9	1.427	0.626
w/o Transformer-Block	0.669	1.534	499.5	1.413	0.651
w/o S-CPFF	0.655	1.522	479.8	1.388	0.665
Ours	0.675	1.499	433.6	1.476	0.701

| Show Table

DownLoad: CSV

First, we remove the mask reconstruction pretext task, simply training a complete encoder-decoder framework. The results show that our proposed self-supervised mask reconstruction task can improve the ability of the framework to excavate intrinsic information. To verify the effectiveness of the CNN-Transformer-based encoder, we conduct an ablation study where the encoders only contain the CNN-Block. From the results we can see that regardless of whether the proposed self-supervised mask reconstruction task is used, adding the Transformer-Block in the encoders always improves the fusion performance. To further prove that our proposed S-CPFF is effective, we replace the fusion net with a simple feature concatenation operation. The ablation study results also show that S-CPFF highlights salient regions in source images and further promises the enhancement of texture details.

6. Discussion

Source images in this paper are all registered before fusion, which is a common data preprocessing step in an IVIF task. However, in practical scenarios, although the source images can be aligned to a certain extent by carefully adjusting the installation positions of infrared and visible light sensors, it stays impossible to achieve accurate alignment directly by manual installation ^[16,59]. In other words, images captured by different sensors are difficult to strictly align on a pixel level. In the future, we will focus research on misaligned IVIF. Since the source images are misaligned and of different modalities, we need to reduce the modality discrepancy between them, so that the feature alignment can be achieved more easily. Once the features are aligned, the fusion process will not be a problem.

7. Conclusions

In this paper, we present CTFusion, a CNN-Transformer-based IVIF framework using self-supervised mask reconstruction. The CNN-Transformer-based encoder integrates the advantages of both CNN and transformer so that the network can focus on both local and global information, better understanding dependencies in images. In addition, the designed mask reconstruction task is naturally adaptive to the intrinsic information excavation requirement in IVIF. Extensive experiments on three infrared-visible image datasets demonstrate the effectiveness of the proposed method.

Use of AI tools declaration

The authors declare they have not used artificial intelligence (AI) tools in the creation of this article. All authors reviewed the manuscript.

Acknowledgments

This work was supported by the 2023 Opening Research Fund of Yunnan Key Laboratory of Digital Communications (YNJTKFB-20230686, YNKLDC-KFKT-202301).

Conflict of interest

The authors declare there is no conflict of interest.

References

[1]	Y. Liu, X. Chen, Z. Wang, Z. Wang, R. K. Ward, X. Wang, Deep learning for pixel-level image fusion: Recent advances and future prospects, Inform. Fusion, 42 (2018), 158–173. https://doi.org/10.1016/j.inffus.2017.10.007 doi: 10.1016/j.inffus.2017.10.007
[2]	H. Zhang, H. Xu, X. Tian, J. Jiang, J. Ma, Image fusion meets deep learning: A survey and perspective, Inform. Fusion, 76 (2021), 323–336. https://doi.org/10.1016/j.inffus.2021.06.008 doi: 10.1016/j.inffus.2021.06.008
[3]	J. Ma, Y. Ma, C. Li, Infrared and visible image fusion methods and applications: A survey, Inform. Fusion, 45 (2019), 153–178. https://doi.org/10.1016/j.inffus.2018.02.004 doi: 10.1016/j.inffus.2018.02.004
[4]	C. Yang, J. Zhang, X. Wang, X. Liu, A novel similarity based quality metric for image fusion, Inform. Fusion, 9 (2008), 156–160. https://doi.org/10.1016/j.inffus.2006.09.001 doi: 10.1016/j.inffus.2006.09.001
[5]	X. Zhang, P. Ye, G. Xiao, VIFB: a visible and infrared image fusion benchmark, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), (2020), 468–478. https://doi.org/10.1109/CVPRW50498.2020.00060
[6]	L. J. Chipman, T. M. Orr, L. N. Graham, Wavelets and image fusion, in International Conference on Image Processing, (1995), 248–251. https://doi.org/10.1109/ICIP.1995.537627
[7]	A. V. Vanmali, V. M. Gadre, Visible and NIR image fusion using weight-map-guided Laplacian-Gaussian pyramid for improving scene visibility, Sādhanā, 42 (2017), 1063–1082. https://doi.org/10.1007/s12046-017-0673-1 doi: 10.1007/s12046-017-0673-1
[8]	L. Sun, Y. Li, M. Zheng, Z. Zhong, Y. Zhang, MCnet: Multiscale visible image and infrared image fusion network, Signal Process., 208 (2023), 108996. https://doi.org/10.1016/j.sigpro.2023.108996 doi: 10.1016/j.sigpro.2023.108996
[9]	K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), 770–778. https://doi.org/10.1109/CVPR.2016.90
[10]	J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), 779–788. https://doi.org/10.1109/CVPR.2016.91
[11]	Y. Zhang, Y. Wang, H. Li, S. Li, Cross-compatible embedding and semantic consistent feature construction for sketch re-identification, in Proceedings of the 30th ACM International Conference on Multimedia, (2022), 3347–3355. https://doi.org/10.1145/3503161.3548224
[12]	H. Li, N. Dong, Z. Yu, D. Tao, G. Qi, Triple adversarial learning and multi-view imaginative reasoning for unsupervised domain adaptation person re-identification, IEEE Trans. Circuits Syst. Video Technol., 32 (2022), 2814–2830. https://doi.org/10.1109/TCSVT.2021.3099943 doi: 10.1109/TCSVT.2021.3099943
[13]	O. Ronneberger, P. Fischer, T. Brox, U-Net: Convolutional networks for biomedical image segmentation, in Medical Image Computing and Computer-Assisted Intervention-MICCAI 2015, 9351 (2015), 234–241. https://doi.org/10.1007/978-3-319-24574-4_28
[14]	L. Tang, H. Huang, Y. Zhang, G. Qi, Z. Yu, Structure-embedded ghosting artifact suppression network for high dynamic range image reconstruction, Knowl.-Based Syst., 263 (2023), 110278. https://doi.org/10.1016/j.knosys.2023.110278 doi: 10.1016/j.knosys.2023.110278
[15]	H. Li, X. Wu, DenseFuse: A fusion approach to infrared and visible images, IEEE Trans. Image Process., 28 (2019), 2614–2623. https://doi.org/10.1109/TIP.2018.2887342 doi: 10.1109/TIP.2018.2887342
[16]	H. Li, J. Liu, Y. Zhang, Y. Liu, A deep learning framework for infrared and visible image fusion without strict registration, Int. J. Comput. Vision, (2024), 1625–1644. https://doi.org/10.1007/s11263-023-01948-x doi: 10.1007/s11263-023-01948-x
[17]	L. Qu, S. Liu, M. Wang, Z. Song, Transmef: A transformer-based multi-exposure image fusion framework using self-supervised multi-task learning, preprint, arXiv: 2112.01030.
[18]	K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick, Masked autoencoders are scalable vision learners, in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2022), 15979–15988. https://doi.org/10.1109/CVPR52688.2022.01553
[19]	S. Li, X. Kang, L. Fang, J. Hu, H. Yin, Pixel-level image fusion: A survey of the state of the art, Inform. Fusion, 33 (2017), 100–112. https://doi.org/10.1016/j.inffus.2016.05.004 doi: 10.1016/j.inffus.2016.05.004
[20]	H. Li, X. Qi, W. Xie, Fast infrared and visible image fusion with structural decomposition, Knowl.-Based Syst., 204 (2020), 106182. https://doi.org/10.1016/j.knosys.2020.106182 doi: 10.1016/j.knosys.2020.106182
[21]	Q. Zhang, Y. Liu, R. S. Blum, J. Han, D. Tao, Sparse representation based multi-sensor image fusion for multi-focus and multi-modality images: A review, Inform. Fusion, 40 (2018), 57–75. https://doi.org/10.1016/j.inffus.2017.05.006 doi: 10.1016/j.inffus.2017.05.006
[22]	M. Xie, J. Wang, Y. Zhang A unified framework for damaged image fusion and completion based on low-rank and sparse decomposition, Signal Process.: Image Commun., 98 (2021), 116400. https://doi.org/10.1016/j.image.2021.116400 doi: 10.1016/j.image.2021.116400
[23]	H. Li, Y. Wang, Z. Yang, R. Wang, X. Li, D. Tao, Discriminative dictionary learning-based multiple component decomposition for detail-preserving noisy image fusion, IEEE Trans. Instrum. Meas., 69 (2020), 1082–1102. https://doi.org/10.1109/TIM.2019.2912239 doi: 10.1109/TIM.2019.2912239
[24]	W. Xiao, Y. Zhang, H. Wang, F. Li, H. Jin, Heterogeneous knowledge distillation for simultaneous infrared-visible image fusion and super-resolution, IEEE Trans. Instrum. Meas., 71 (2022), 1–15. https://doi.org/10.1109/TIM.2022.3149101 doi: 10.1109/TIM.2022.3149101
[25]	Y. Zhang, M. Yang, N. Li, Z. Yu, Analysis-synthesis dictionary pair learning and patch saliency measure for image fusion, Signal Process., 167 (2020), 107327. https://doi.org/10.1016/j.sigpro.2019.107327 doi: 10.1016/j.sigpro.2019.107327
[26]	Y. Niu, S. Xu, L. Wu, W. Hu, Airborne infrared and visible image fusion for target perception based on target region segmentation and discrete wavelet transform, Math. Probl. Eng., 2012 (2012), 1–10. https://doi.org/10.1155/2012/275138 doi: 10.1155/2012/275138
[27]	D. M. Bulanon, T. F. Burks, V. Alchanatis, Image fusion of visible and thermal images for fruit detection, Biosyst. Eng., 103 (2009), 12–22. https://doi.org/10.1016/j.biosystemseng.2009.02.009 doi: 10.1016/j.biosystemseng.2009.02.009
[28]	M. Choi, R. Y. Kim, M. Nam, H. O. Kim, Fusion of multispectral and panchromatic satellite images using the curvelet transform, IEEE Geosci. Remote Sens. Lett., 2 (2005), 136–140. https://doi.org/10.1109/LGRS.2005.845313 doi: 10.1109/LGRS.2005.845313
[29]	Y. Liu, X. Chen, J. Cheng, H. Peng, Z. Wang, Infrared and visible image fusion with convolutional neural networks, Int. J. Wavelets, Multiresolution Inf. Process., 16 (2018), 1850018. https://doi.org/10.1142/S0219691318500182 doi: 10.1142/S0219691318500182
[30]	J. Ma, W. Yu, P. Liang, C. Li, J. Jiang, FusionGAN: A generative adversarial network for infrared and visible image fusion, Inform. Fusion, 48 (2019), 11–26. https://doi.org/10.1016/j.inffus.2018.09.004 doi: 10.1016/j.inffus.2018.09.004
[31]	H. Li, Y. Cen, Y. Liu, X. Chen, Z. Yu, Different input resolutions and arbitrary output resolution: A meta learning-based deep framework for infrared and visible image fusion, IEEE Trans. Image Process., 30 (2021), 4070–4083. https://doi.org/10.1109/TIP.2021.3069339 doi: 10.1109/TIP.2021.3069339
[32]	H. Zhang, J. Ma, SDNet: A versatile squeeze-and-decomposition network for real-time image fusion, Int. J. Comput. Vision, 129 (2021), 2761–2785. https://doi.org/10.1007/s11263-021-01501-8 doi: 10.1007/s11263-021-01501-8
[33]	H. Xu, J. Ma, J. Jiang, X. Guo, H. Ling, U2fusion: A unified unsupervised image fusion network, IEEE Trans. Pattern Anal. Mach. Intell., 44 (2020), 502–518. https://doi.org/10.1109/TPAMI.2020.3012548 doi: 10.1109/TPAMI.2020.3012548
[34]	L. Tang, J. Yuan, J. Ma, Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network, Inform. Fusion, 82 (2022), 28–42. https://doi.org/10.1016/j.inffus.2021.12.004 doi: 10.1016/j.inffus.2021.12.004
[35]	J. Liu, X. Fan, Z. Huang, G. Wu, R. Liu, W. Zhong, et al., Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection, in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2022), 5792–5801. https://doi.org/10.1109/CVPR52688.2022.00571
[36]	A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., Attention is all you need, in Proceedings of the 31st International Conference on Neural Information Processing Systems, (2017), 6000–6010.
[37]	H. Chen, Y. Wang, T. Guo, C, Xu, Y. Deng, Z. Liu, et al., Pre-trained image processing transformer, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021), 12294–12305. https://doi.org/10.1109/CVPR46437.2021.01212
[38]	X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable DETR: Deformable transformers for end-to-end object detection, preprint, arXiv: 2010.04159.
[39]	S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, et al., Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021), 6877–6886. https://doi.org/10.1109/CVPR46437.2021.00681
[40]	A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, et al., An image is worth 16 $\times$ 16 words: Transformers for image recognition at scale, in International Conference on Learning Representations ICLR 2021, (2021).
[41]	K. Han, A. Xiao, E. Wu, J. Guo, C. Xu, Y. Wang, Transformer in transformer, in 35th Conference on Neural Information Processing Systems (NeurIPS 2021), (2021), 1–12.
[42]	C. Chen, R. Panda, Q. Fan, RegionViT: Regional-to-local attention for vision transformers, preprint, arXiv: 2106.02689.
[43]	V. Vs, J. M. J. Valanarasu, P. Oza, V. M. Patel, Image fusion transformer, in 2022 IEEE International Conference on Image Processing (ICIP), (2022), 3566–3570. https://doi.org/10.1109/ICIP46576.2022.9897280
[44]	Y. Zhang, Y. Tian, Y. Kong, B. Zhong, Y. Fu, Residual dense network for image restoration, IEEE Trans. Pattern Anal. Mach. Intell., 43 (2021), 2480–2495. https://doi.org/10.1109/TPAMI.2020.2968521 doi: 10.1109/TPAMI.2020.2968521
[45]	R. Hou, D. Zhou, R. Nie, D. Liu, L. Xiong, Y. Guo, et al., VIF-Net: An unsupervised framework for infrared and visible image fusion, IEEE Trans. Comput. Imaging, 6 (2020), 640–651. https://doi.org/10.1109/TCI.2020.2965304 doi: 10.1109/TCI.2020.2965304
[46]	J. Liu, Y. Wu, Z. Huang, R. Liu, X. Fan, SMoA: Searching a modality-oriented architecture for infrared and visible image fusion, IEEE Signal Process. Lett., 28 (2021), 1818–1822. https://doi.org/10.1109/LSP.2021.3109818 doi: 10.1109/LSP.2021.3109818
[47]	Y. Zhang, Y. Liu, P. Sun, H. Yan, X. Zhao, L. Zhang, IFCNN: A general image fusion framework based on convolutional neural network, Inform. Fusion, 54 (2020), 99–118. https://doi.org/10.1016/j.inffus.2019.07.011 doi: 10.1016/j.inffus.2019.07.011
[48]	H. Zhang, H. Xu, Y. Xiao, X. Guo, J. Ma, Rethinking the image fusion: A fast unified image fusion network based on proportional maintenance of gradient and intensity, in Proceedings of the AAAI Conference on Artificial Intelligence, 34 (2020), 12797–12804. https://doi.org/10.1609/aaai.v34i07.6975
[49]	H. Li, X. Wu, CrossFuse: A novel cross attention mechanism based infrared and visible image fusion approach, Inform. Fusion, 103 (2024), 102147. https://doi.org/10.1016/j.inffus.2023.102147 doi: 10.1016/j.inffus.2023.102147
[50]	H. Li, X. Wu, J. Kittler, RFN-Nest: An end-to-end residual fusion network for infrared and visible images, Inform. Fusion, 73 (2021), 72–86. https://doi.org/10.1016/j.inffus.2021.02.023 doi: 10.1016/j.inffus.2021.02.023
[51]	M. Deshmukh, U. Bhosale, Image fusion and image quality assessment of fused images, Int. J. Image Process., 4 (2010), 484–508.
[52]	H. Chen, P. K. Varshney, A human perception inspired quality metric for image fusion based on regional information, Inform. Fusion, 8 (2007), 193–207. https://doi.org/10.1016/j.inffus.2005.10.001 doi: 10.1016/j.inffus.2005.10.001
[53]	V. Aslantas, E. Bendes, A new image quality metric for image fusion: The sum of the correlations of differences, AEU-International J. Electron. Commun., 69 (2015), 1890–1896. https://doi.org/10.1016/j.aeue.2015.09.004 doi: 10.1016/j.aeue.2015.09.004
[54]	Z. Wang, A. C. Bovik, A universal image quality index, IEEE Signal Process. Lett., 9 (2002), 81–84. https://doi.org/10.1109/97.995823 doi: 10.1109/97.995823
[55]	D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, preprint, arXiv: 1412.6980.
[56]	A. Toet, The TNO multiband image data collection, Data Brief, 15 (2017), 249–251. https://doi.org/10.6084/m9.figshare.1008029.v2 doi: 10.6084/m9.figshare.1008029.v2
[57]	X. Jia, C. Zhu, M. Li, W. Tang, W. Zhou, LLVIP: A visible-infrared paired dataset for low-light vision, in 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), (2021), 3489–3497. https://doi.org/10.1109/ICCVW54120.2021.00389
[58]	K. A. Johnson, J. A. Becker, The Whole Brain, 2024. Available from: http://www.med.harvard.edu/AANLIB/home.html.
[59]	H. Li, J. Zhao, J. Li, Z. Yu, G. Liu, Feature dynamic alignment and refinement for infrared–visible image fusion: Translation robust fusion, Inform. Fusion, 95 (2023), 26–41. https://doi.org/10.1016/j.inffus.2023.02.011 doi: 10.1016/j.inffus.2023.02.011

Reader Comments

Your name:*

Email:*
© 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Mathematical Biosciences and Engineering

3.9

Metrics

Article views(1109) PDF downloads(80) Cited by(0)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Figures and Tables

Figures(9) / Tables(5)

Mathematical Biosciences and Engineering

CTFusion: CNN-transformer-based self-supervised learning for infrared and visible image fusion