A novel dictionary learning-based approach for Ultrasound Elastography denoising

Yihua Song; Chen Ge; Ningning Song; Meili Deng; Yihua Song; Chen Ge; Ningning Song; Meili Deng

doi:10.3934/mbe.2022537

Mathematical Biosciences and Engineering

2022, Volume 19, Issue 11: 11533-11543. doi: 10.3934/mbe.2022537

Previous Article Next Article

Research article Special Issues

A novel dictionary learning-based approach for Ultrasound Elastography denoising

1.
School of Artificial Intelligence and Information Technology, Nanjing University of Chinese Medicine, Nanjing 210023, China
2.
Shandong Vocational and Technical University of Engineering, Jinan 250200, China
3.
Nanjing First Hospital, Nanjing 210000, China
4.
China United Network Communications Corporation, Nanjing 210000, China

Academic Editor: Simon James Fong

Received: 08 June 2022 Revised: 25 July 2022 Accepted: 11 August 2022 Published: 11 August 2022

Ultrasound Elastography is a late-model Ultrasound imaging technique mainly used to diagnose tumors and diffusion diseases that can't be detected by traditional Ultrasound imaging. However, artifact noise, speckle noise, low contrast and low signal-to-noise ratio in images make disease diagnosing a challenging task. Medical images denoising, as the first step in the follow-up processing of medical images, has been concerned by many people. With the widespread use of deep learning technique in the research field, dictionary learning method are once again receiving attention. Dictionary learning, as a traditional machine learning method, requires less sample size, has high training efficiency, and can describe images well. In this work, we present a novel strategy based on K-clustering with singular value decomposition (K-SVD) and principal component analysis (PCA) to reduce noise in Ultrasound Elastography images. At this stage of dictionary training, we implement a PCA method to transform the way dictionary atoms are updated in K-SVD. Finally, we reconstructed the image based on the dictionary atoms and sparse coefficients to obtain the denoised image. We applied the presented method on datasets of clinical Ultrasound Elastography images of lung cancer from Nanjing First Hospital, and compared the results of the presented method and the original method. The experimental results of subjective and objective evaluation demonstrated that presented approach reached a satisfactory denoising effect and this research provides a new technical reference for computer aided diagnosis.

Keywords:

Citation: Yihua Song, Chen Ge, Ningning Song, Meili Deng. A novel dictionary learning-based approach for Ultrasound Elastography denoising[J]. Mathematical Biosciences and Engineering, 2022, 19(11): 11533-11543. doi: 10.3934/mbe.2022537

Related Papers:

[1]	Xiaowen Jia, Jingxia Chen, Kexin Liu, Qian Wang, Jialing He . Multimodal depression detection based on an attention graph convolution and transformer. Mathematical Biosciences and Engineering, 2025, 22(3): 652-676. doi: 10.3934/mbe.2025024
[2]	Ziyue Wang, Junjun Guo . Self-adaptive attention fusion for multimodal aspect-based sentiment analysis. Mathematical Biosciences and Engineering, 2024, 21(1): 1305-1320. doi: 10.3934/mbe.2024056
[3]	Xue Li, Huibo Zhou, Ming Zhao . Transformer-based cascade networks with spatial and channel reconstruction convolution for deepfake detection. Mathematical Biosciences and Engineering, 2024, 21(3): 4142-4164. doi: 10.3934/mbe.2024183
[4]	Zhijing Xu, Jingjing Su, Kan Huang . A-RetinaNet: A novel RetinaNet with an asymmetric attention fusion mechanism for dim and small drone detection in infrared images. Mathematical Biosciences and Engineering, 2023, 20(4): 6630-6651. doi: 10.3934/mbe.2023285
[5]	Musiri Kailasanathan Nallakaruppan, Chiranji Lal Chowdhary, SivaramaKrishnan Somayaji, Himakshi Chaturvedi, Sujatha. R, Hafiz Tayyab Rauf, Mohamed Sharaf . Comparative analysis of GAN-based fusion deep neural models for fake face detection. Mathematical Biosciences and Engineering, 2024, 21(1): 1625-1649. doi: 10.3934/mbe.2024071
[6]	Wanru Du, Xiaochuan Jing, Quan Zhu, Xiaoyin Wang, Xuan Liu . A cross-modal conditional mechanism based on attention for text-video retrieval. Mathematical Biosciences and Engineering, 2023, 20(11): 20073-20092. doi: 10.3934/mbe.2023889
[7]	Ning Huang, Zhengtao Xi, Yingying Jiao, Yudong Zhang, Zhuqing Jiao, Xiaona Li . Multi-modal feature fusion with multi-head self-attention for epileptic EEG signals. Mathematical Biosciences and Engineering, 2024, 21(8): 6918-6935. doi: 10.3934/mbe.2024304
[8]	Jun Wu, Xinli Zheng, Jiangpeng Wang, Junwei Wu, Ji Wang . AB-GRU: An attention-based bidirectional GRU model for multimodal sentiment fusion and analysis. Mathematical Biosciences and Engineering, 2023, 20(10): 18523-18544. doi: 10.3934/mbe.2023822
[9]	Jiaming Ding, Peigang Jiao, Kangning Li, Weibo Du . Road surface crack detection based on improved YOLOv5s. Mathematical Biosciences and Engineering, 2024, 21(3): 4269-4285. doi: 10.3934/mbe.2024188
[10]	Zhijing Xu, Yang Gao . Research on cross-modal emotion recognition based on multi-layer semantic fusion. Mathematical Biosciences and Engineering, 2024, 21(2): 2488-2514. doi: 10.3934/mbe.2024110

Abstract

1. Introduction

As social networks expand their scope, more fake news emerges in online communities, which is detrimental to community stability and growth. On social media, fake news is a general information statement in some forms whose veracity is not quickly or ever confirmed ^[1]. This fake news frequently exists in the form of fake news in politics, economics, and public safety, which is tremendously hazardous to society. As an illustration, the COVID-19 outbreak resulted in the deaths of almost 800 people due to the widespread misconception that consuming large amounts of alcohol might disinfect the body ^[2]. With the amount of online fake news on the rise, traditional manual methods can no longer handle the increasingly large volume of data. As a result, automated detection methods are gaining attention from academia and industry alike.

The traditional approaches are designed for text-only news. However, the prevalence of fake news with images on social media has sparked interest in methods that take multimodal inputs ^[3]. Many such methods have been proposed ^[3,4,5,6,7] in recent years, as fake news has evolved from text-only posts to multimedia posts with photos or videos ^[8]. Still, the following issues remain with current detection methods: 1) A lot of multimodal detection models ^[6,9,10,11] currently extract visual features from news using pre-trained VGG-19 ^[12] on ImageNet ^[13] or ResNet-50 ^[14], which limits their ability to generate high-quality intermediate features and location information, resulting in unsatisfactory detection results. 2) Current multimodal fake news detection methods ^[5,6,15] essentially detect fake news by simply concatenating text and image features, ignoring the importance of different modalities to the news. 3) Existing multimodal fake news detection approaches ^[16,17,18] neglect the degree of feature similarity between multiple modalities despite considering the joint influence of different modal features.

To address the aforementioned challenges, we propose a novel multimodal framework called TGA that leverages transformer ^[19] to fully capture visual features, fuse multimodal features effectively, and make use of feature similarity between multi-modalities. Specifically, we use transformer and vision transformer to respectively extract text and image features, which are then fused by an attention mechanism to obtain the news representation. Finally, we put the news representation into our detector to detect rumors. To improve the performance of rumor detection, we also map the feature vectors of the two modalities to the same space for alignment and compute feature similarity between the two modalities. If the feature similarity between two modalities in a news is less than some threshold, we believe that the feature between two modalities in this news mismatch, so as to increase the probability that this news is Fake news. Thus we adjust the detection outcome of our detector according to the feature similarity between two modalities in a news, thereby improving detection performance. The main contributions of this paper are as follows:

● We introduce the TGA model, which utilizes various types of transformers to extract and represent visual and text features of news.

● We use attention mechanisms to fuse the representations of different modalities and calculate the degree of feature similarity between different modalities to obtain more robust representations and improve the performance of TGA.

● We evaluate the effectiveness of TGA on public datasets, demonstrating its superior performance compared to other state-of-the-art methods in detecting fake news.

2. Related work

In the field of fake news detection, existing methods mainly include three categories: 1) textual content-based, 2) visual content-based and 3) multimodal-based. In this section, we briefly review the work in recent years and explain the novelty of our method accordingly.

2.1. Textual content-based fake news detection

The textual content-based supervised fake news detection method uses textual content from the news as input to detect fake news. Ma et al. ^[20] first applied deep learning technology to fake news detection by feeding textual content into RNNs, LSTMs, and GRUs. Yu et al. ^[21] first used a convolutional neural network to model news. Ma et al. ^[22] applied the idea of multi-tasking for the first time, trained a multi-task model and position classification with the help of RNN. Ma et al. ^[23] used adversarial learning to detect fake news, improving the robustness and classification accuracy of the model. Vaibhav et al. ^[24] modeled article sentences as graphs, utilizing GCN to detect fake news and achieve positive results. Cheng et al. ^[25] used a variational autoencoder (VAE) to self-encode textual content to obtain an embedded representation of news and performed multi-task learning on the obtained news vectors to improve the model. ^[26] considering the temporal characteristics of rumors, this paper detects rumors using graph neural networks by leveraging the dynamic propagation structure of rumors. Moreover, many false news detection methods now utilize time temporal graphs ^[27] to construct graph structures. ^[28] used graph neural networks to extract text features, while utilizing user information and interaction information. However, previous works applied traditional RNN-based models to extract text features, which cannot be parallelized, and the physical meaning of feature extraction is unclear. In order to solve above problems, our work uses a transformer to extract text features, which not only achieves parallelization but also has a stranger explanatory model with its self-attention mechanism ^[19].

2.2. Visual content-based fake news detection

The news contains textual and visual content, such as images and videos. Recently, visual content has been demonstrated to be an essential indicator for detecting fake news ^[29,30]. Traditional statistical-based methods detect fake news using the number of additional images, image popularity, and image type. With the rise of deep learning, many models use CNN, ResNet, and AlexNet to extract news features. However, traditional convolutional neural network models can only recognize pixel-level features of images. They cannot identify the semantic features of images, so they cannot detect whether images have been manipulated. Given that fake and real images can be very different in both the physical and semantic aspects, literature ^[31] proposes a fake image discriminator MVNN, which can effectively detect fake images. The best way to model image features has yet to be studied well in past studies, and most of it has focused on extracting text features. Many works use CNN-based models for image feature extraction, while our work introduces the vision transformer to fake news detection for the first time. Vision Transformer obtains global features from shallow layers, retains more spatial information than ResNet, and thus has a more vital ability to extract image features.

2.3. Multimodal-based fake news detection

Currently, more and more works consider using textual and visual content to detect fake news. With the rise of deep learning, many powerful feature extractors have emerged, such as text feature extractors RNN, Bert, and Transformer, and image feature extractors CNN, ResNet, and AlexNet. Text feature extractors can be used to extract text features, and image feature extractors can extract visual features, which are then fused for fake news detection.

Most of the works ^{[6,9,10,16,27]} directly concatenated the textual and image features obtained from the extractor to detect fake news. For instance, literature ^[9] utilized VGG19 to extract visual content and XLNET to extract text content. Literature ^[10] used LSTM to model text content and text content in images, and used VGG to model visual content, Literature ^[6] used VGG to extract visual features and Text-CNN to extract visual features. Literature ^[16] extracts image features using VGG and text features via bi-directional LSTM.

Some work ^[27,32] used the contrast between modalities to detect fake news. It has been asserted that news is fake if the visual and text content does not match. Based on this assumption, some people encoded the image and text information of the news and then calculated the similarity between the two. If the similarity is high, the news's text and visual information match and is real news. If the similarity is low, it means that the news's text and visual information do not match, and it is fake news. For example, literature ^[33] maps textual and visual information into the same vector space to compare the similarity to detect false news. Literature ^[32] uses BERT to model textual information and ResNet to model visual information to calculate the similarity between them. Inspired by the above-mentioned related works, we map the feature vectors into a new space to calculate similarity after obtaining the feature vectors of the two modalities.

There are also some works ^{[17,27,34,35,36,37,38]} that use multimodal information enhancement to detect fake news, where textual information can help to understand visual information and visual information can help to understand textual information. The mutual enhancement between the two modalities can be applied to detect fake news. For example, literature ^[17] first proposed using attention between modalities to enhance information between modalities; literature ^[27] employed the attention mechanism to obtain an enhanced visual representation of textual information to understand multimodal information better. Literature ^[35] designed a two-layer image-text co-attention to fuse visual information and textual information better, and literature ^[36] utilized the co-attention approach to learn more robust feature representations incorporating textual and visual information to enhance each other. However, these works ignore that different modalities have varying effects on fake news detection. Therefore, we should make the model pay attention to those significant modal information sources to improve its detection ability, according to ^[37] After using BERT to extract text features, this paper further utilizes BERT to extract both text and visual features, so as to enhance the mutual reinforcement between the two modal features. ^[38] use a cross-modal alignment module to transform the heterogeneous unimodality features into a shared semantic space. Inspired by the fusion of different modal features in the literature ^[17], our model uses an attention mechanism to stitch the two modal features together in late fusion.

3. Methodology

In this section, we will introduce the TGA model proposed in this paper. TGA is a transformer-based multimodal approach consisting of four key components: text feature extractor, image feature extractor, late fusion, and classifier. These components work together to extract and fuse text and image features, generating a comprehensive representation of the news that is then passed to the classifier for the task of rumor detection.

3.1. Model overview

The framework of TGA is illustrated in Figure 1. We start by obtaining the word embedding using Glove and then use transformer to generate the original vector set for the text. Next, we extract the original vector set for the image by processing different regions of the news image. We then pool the original vectors of text and image and employ an attention mechanism to fuse the guidance vectors of the two modalities, resulting in a final representation of the news. Finally, the news representation is fed into the MLP while mapping the guidance vectors of both modalities to a new target space to predict feature matching. The output of the MLP, combined with the feature similarity value weighted appropriately, obtains the final prediction result.

Figure 1. The overall framework of TGA. First, text features and visual features of each news are extracted by different types of transformers. Then, the features of two modal are fused by attention mechanism. Finally, multimodal feature similarity is added for further detection of fake news.

DownLoad: Full-Size Img PowerPoint

3.2. Text feature extractor

We obtain word embedding by the pre-trained model Glove ^[9] after utilizing the Jieba lexicon to segment the news texts. Given the transformer encoder's effectiveness in aggregating text features, we use it to extract text features. The word vectors obtained from the GloVe model are used as input for the transformer encoder. When encoding, we add position embeddings ( $PE$ ) to the word vectors of each word. Specifically, we use sine and cosine position encoding, generated by applying sine and cosine functions of different frequencies to each position and then adding them to the corresponding word vectors. The calculation formula for $PE$ is as follows:

$\begin{equation} PE(pos,2i) = sin(\frac{pos}{1000^{\frac{2i}{d_{model}}}}) \end{equation}$

(1)

$\begin{equation} PE(pos,2i+1) = cos(\frac{pos}{1000^{\frac{2i}{d_{model}}}}) \end{equation}$

(2)

where $PE\epsilon R^{L*d_{model}}$ , $L$ is the sentence length, which is 75 in this paper, $d_{model}$ denotes the dimension size of the word vector, which is 512 in this paper, $pos$ denotes the absolute position of the word in the sentence, and $pos$ = 0, 1, 2, ..., $i$ indicates which dimension in the word vector. The input word vector is added to the position embedding, and the calculation formula is as follows:

$\begin{equation} X = GloveEmbedding(X)+PE \end{equation}$

(3)

where $X\epsilon R^{L*d_{model}}$ denotes the word embedding of a news article, and GloveEmbedding is the operation to obtain the word embedding by the Glove model. After obtaining the word embedding from Eq (3), it is used as input for the transformer encoder. The transformer encoder comprises N block structures, as illustrated in the Figure 2. Each block consists of a multi-headed attention layer, residual connection layer, normalization layer, feedforward layer, residual connection layer, and normalization layer. In the first step, the calculation formula for word embedding in the multi-head attention layer is as follows:

$\begin{equation} Q = XW_{Q}, K = XW_{K}, V = XW_{V} \end{equation}$

(4, 5, 6)

$\begin{equation} X_{a} = SelfAttention(Q,K,V) \end{equation}$

(7)

$\begin{equation} SelfAttention(Q,K,V) = softmax(\frac{Q^{T}K}{\sqrt{d_{k}}})V \end{equation}$

(8)

Figure 2. The structure of Transformer Encoder.

DownLoad: Full-Size Img PowerPoint

where $W_{Q}$ , $W_{K}$ , $W_{V}$ is the matrix of three weights, $d_{k}$ denotes the dimension of the matrix $W_{K}$ , and $Q^{T}$ is the transpose of $Q$ . In the second step, take the residual connection of $X_{a}$ obtained from Eq (7) is connected with $X$ , then perform regularization, the calculation is as follows:

$\begin{equation} X_{a} = X+X_{a} \end{equation}$

(9)

$\begin{equation} X_{a} = LayerNorm(X_{a}) \end{equation}$

(10)

where LayerNorm denotes the regularization operation. In the third step, pass the regularized word embeddings to the input forward propagation layer. This layer consists of two linear connections and an activation function, and the calculation formula is as follows:

$\begin{equation} X_{h} = Activate(Linear(Linear(X_{a}))) \end{equation}$

(11)

where $Activate$ denotes the activation function, $Linear$ denotes the fully connected layer. In the fourth step, the output of the forward propagation layer is then fed into the residual connection and regularization layer to obtain the final output $X_{h}\epsilon R^{L*d_{model}}$ of an encoding block, which is calculated as follows:

$\begin{equation} X_{h} = X_{a}+X_{h} \end{equation}$

(12)

$\begin{equation} X_{h} = LayerNorm(X_{h}) \end{equation}$

(13)

Equations (4)–(13) are repeated $N$ times, which in the text, $N$ = 6. This paper refers to the hidden state vectors at different time points as the original vector set of the text. As mentioned earlier, the text guidance vector $V_{text}$ is the result of pooling the original vector set of the text. Equation (14) shows the specific operation process.

$\begin{equation} V_{text} = \sum\limits_{i = 1}^{L}X_{hidden}^{i} \end{equation}$

(14)

where $L$ is the sentence length, which is set to 75 in this paper.

3.3. Image feature extractor

The supplied image is resized to 448*448 and sliced into 196 regions, each measuring 14*14 pixels. The regions are denoted by $I_{i}$ ( $i$ = 1, 2, ... 196). We employ ViT to fully extract the visual elements of news (Vision Transformer ^[39]). As a pre-trained model, ViT outperforms state-of-the-art image classification models on various image classification datasets and is relatively cost-effective. Moreover, when pre-trained on large-scale datasets and migrated to classification tasks on smaller and medium-sized datasets, ViT outperforms CNNs. ^[39] Therefore, we use the pre-trained ViT model to obtain the feature vector $V_{region_{i}}$ of each region $I_{i}$ , as shown in Eq (15). The ViT calculation process is the same as for the transformer encoder, and $N$ = 12 is used in ViT. These region feature vectors are referred to as the original vector set of the image.

$\begin{equation} V_{region_{i}} = ViT(I_{i}) \end{equation}$

(15)

As mentioned earlier, the image guidance vector is the result of pooling all the original vectors, as shown in Eq (16).

$\begin{equation} V_{image} = \frac{\sum_{i = 1}^{N_{r}}V_{region_{i}}}{N_{r}} \end{equation}$

(16)

where $N_{r}$ is the number of regions, which is set to 196 in this paper.

3.4. Late fusion

To obtain final feature representation for news, we need to fuse the feature representation of different modality. Instead of simply concatenating the representations of different modalities, we employ an attention mechanism to fully integrate textual and visual representation into a multimodal representation. The attention mechanism has become a widely used component in deep learning to emphasize the most important information for the current task among several inputs, while ignoring insignificant information. To be specific, we compute the attention weights for each modality and create the final representation of the news via weighted averaging. To calculate the attention weights for modality $m$ , we use a two-layer feedforward network with the following formula:

$\begin{equation} \widetilde{\alpha _{m}} = softmax(W_{m_{2}}\cdot tanh(W_{m_{1}}\cdot v_{m}+b_{m_{1}})+b_{m_{2}}) \end{equation}$

(17)

where $v_{m}\epsilon \left\{V_{text}, V_{image} \right\}$ represents the feature representation of modality $m$ , $\widetilde{\alpha _{m}}$ represents the attention weight of modality $m$ , $W_{m_{1}}$ , $W_{m_{2}}$ represents the weight matrix, and $b_{m_{1}}$ , $b_{m_{2}}$ represents the bias term. The feature presentation of modality $m$ is then converted into a fixed-length form $v_{m}^{'}$ with the following formula:

$\begin{equation} v_{m}^{'} = tanh(W_{m_{2}}\cdot v_{m}+b_{m_{2}}) \end{equation}$

(18)

The news feature representation $v_{f}$ is then created by averaging and weighting the feature representations of all modes using the formula below:

$\begin{equation} v_{f} = \sum\limits_{m\epsilon \left\{ text, image\right\}}^{}\widetilde{\alpha _{m}}v_{m}^{'} \end{equation}$

(19)

3.5. Classifier

The classifier is a three-layer MLP that takes the news featue representation $v_{f}$ obtained by late fusion as input for final classification. We denote the classifier as $G_{r}\left (v_{f}, \Theta _{r} \right)$ , where $\Theta _{r}$ denotes all parameters in the classifier and the output of the classifier $\widetilde{y_{f}}$ is the probability that the news is fake news.

$\begin{equation} \widetilde{y_{f}} = G_{r}\left ( v_{f},\Theta _{r} \right ) \end{equation}$

(20)

The sigmoid activation function is utilized in the output layer to restrict the output values to 0 and 1. Through the examination of a significant amount of fake news detection data, we discovered that many fake news texts and images are not related. This is because many fake news writers use captivating images that have nothing to do with the text to attract readers. Therefore, we believe that computing the similarity of features across different modalities would enhance the detection of fake news due to the considerable differences in features between text and images found in such cases. To determine the degree of feature similarity, we map the feature representations of text and images to a new target space via calculation as follows:

$\begin{equation} S\left ( V_{text},V_{image} \right ) = \left\| M_{1}\left ( V_{text} \right )-M_{2}\left ( V_{image} \right )\right\| \end{equation}$

(21)

where $S$ is the Euclidean distance of two modal features in the target space, $M_1\left (V_{text} \right)$ and $M_2\left (V_{image} \right)$ are two mapping functions, both of which consist of two layers of MLPs that map text and image feature representations to the new target space. We denote the final predicted values as:

$\begin{equation} \widetilde{y_{f}} = \begin{cases} \widetilde{y_{f}}+\alpha S\left ( V_{text},V_{image} \right ) & \text{ if } S\left ( V_{text},V_{image} \right ) > \beta \\ \widetilde{y_{f}} & \text{ if } S\left ( V_{text},V_{image} \right )\leq \beta \end{cases} \end{equation}$

(22)

If the Euclidean distance between the two modalities is grater than the threshold $\beta$ value, the result predicted by the classifier plus $\alpha$ times $S\left (V_{text}, V_{image} \right)$ is used as a reference. Where $\beta$ and $\alpha$ are hyperparameters. The most effective parameter values we get through the experiments are $\beta$ = 0.65 and $\alpha$ = 0.1. If the final prediction value is greater than or equal to 0.5, we predict it as fake news, otherwise, we predict it as true news. Therefore, to calculate the classification loss, we use cross entropy, which is calculated as follows:

$\begin{equation} L_{r}\left ( \Theta _{r} \right ) = -ylog\widetilde{y_{f}}-\left ( 1-y \right )log\left ( 1-\widetilde{y_{f}} \right ) \end{equation}$

(23)

where $y$ denotes the ground truth.

4. Experiments

In this section, we first introduce the dataset and parameter settings. Then we compare our proposed model TGA with several baselines and analyze the results of comparative experiments. Finally, we verify the effectiveness of each module of TGA by ablation study and dissect the impact of hyper-parameters by parameter sensitivity experiments.

4.1. Dataset and pre-treatment

4.1.1. Weibo dataset

The Weibo dataset utilized in this paper is retrieved from the DataFountain website (datafountain.cn). The multi-modal dataset is provided by the Beijing Municipal Bureau of Economy and Information Technology and the Big Data Expert Committee of the Chinese Computer Society and includes various fields such as Weibo texts, comments, images, and labels for three categories: "no judgment required", "fake news", and "real news". We selected only two labels: "fake news" and "real news". To clean up the dataset, we preserved only the Chinese characters of the Weibo text and removed content like emojis and meaningless symbols.

We also removed duplicate and low-quality images to ensure the dataset's quality. In this work, we focused on studying text and images, so text-only tweets were deleted, and only one image was kept for tweets with multiple images. After processing, 17,848 pieces of data totaled real and false news in eight categories: science and technology, politics, the military, finance and business, social life, sports and entertainment, medical and health, education, and examination. Due to a limited amount of data in the last four fields, we used data from the first four fields only. All the data in the first four categories were merged and randomly split into a training set (80%), a validation set (10%), and a test set (10%), totaling 16,417 items. Table 1 displays the dataset's specifics.

Table 1. Dataset statistics.

	Domain/Statistics	Fake news	Real news	Total
Weibo	Finance	428	350	778
	Society	5642	5409	11,051
	Entertainment	556	733	1299
	Health	1756	1533	3289
Twitter	Training set	6840	5007	11,847
Twitter	Test set	564	427	991

| Show Table

DownLoad: CSV

4.1.2. Twitter

The Twitter ^[40] dataset was released for Verifying Multimedia Use task at MediaEval. In experiments, we keep the same data split scheme as the benchmark ^[40]. The training set contains 6840 real tweets and 5007 fake tweets, and the test set contains 991 posts, including 564 real tweets and 427 fake tweets. In experiments, we follow the same steps in weibo dataset to remove the duplicated and low-quality images to ensure the quality of the entire dataset.

4.2. Parameter settings

The text feature extractor and image feature extractor produce output dimensions of 256 and 1024, respectively. The mapping function generates output dimensions of 128 for both text and image features. Furthermore, the text transformer implements multi-headed attention with eight heads, while the image transformer utilizes 16 heads. During training, we employ a batch size of 32, a learning rate of 0.001, and optimize the loss function using the Adam optimizer. To achieve faster convergence, we use a dynamic learning rate method. We record the F1-Score after each epoch and adjust the learning rate to 80% of the previous epoch's rate if the F1-Score does not improve from the previous epoch. Finally, we evaluate model performance using precision, recall, accuracy, and F1-Score.

4.3. Baselines

To verify the effectiveness of our multimodal model, we compare it with the following baselines:

${\boldsymbol{Unimodal \ Models}}$

● CNN ^[41]: A CNN-based model which uses CNN to extract image features and employs a three-layer neural network for classification.

● LSTM ^[42]: A textual model which using LSTM to extract text features of news.

${\boldsymbol{Multimodal \ Models}}$

● EANN ^[6]: A model uses a CNN-based extractor to extract text features and a VGG-19 network to extract image features.

● MVAE ^[4]: A model extracts text and image features of news and reconstructs the original image and text from the hidden layer vectors.

● Spotfake+ ^[9]: A multimodal model that utilizes transfer learning to capture semantic and contextual information from news texts and their associated images.

● Att-RNN ^[17]: Combine textual, visual, and social contextual features by attention mechanism.

● MCAN ^[35]: An end-to-end model which using multiple co-attention layers to fuse image and text features, which can learn the interdependencies between multiple modalities.

● HMCAN ^[43]: Model multimodal features of news by a multimodal contextual attention network so that information from different modalities complements each other.

4.4. Comparative experiment

Table 2 shows the performance of baselines and our model; we can obtain the following points from the experimental results:

Table 2. Comparative experiments.

	Model	Precision	Recall	Accuracy	F1-Score
Weibo	CNN	0.592	0.806	0.556	0.683
	LSTM	0.757	0.590	0.647	0.663
	EANN	0.925	0.736	0.740	0.820
	MVAE	$\underline{0.931}$	0.708	0.723	0.804
	SpotFake+	0.871	0.871	0.870	0.871
	att-RNN	0.741	0.777	0.772	0.723
	MCAN	0.899	${\boldsymbol{0.899}}$	$\underline{0.902}$	$\underline{0.900}$
	HMCAN	0.888	0.885	0.885	0.885
	TGA	${\boldsymbol{0.969}}$	$\underline{0.886}$	${\boldsymbol{0.922}}$	${\boldsymbol{0.925}}$
Twitter	CNN	0.452	0.539	0.425	0.479
	LSTM	0.554	0.431	0.511	0.523
	EANN	0.745	0.748	0.745	0.744
	MVAE	0.697	0.627	0.688	0.639
	att-RNN	0.691	0.692	0.662	0.682
	MCAN	0841	0.847	$\underline{0.889}$	0.849
	HMCAN	$\underline{0.876}$	${\boldsymbol{0.888}}$	0.878	$\underline{0.875}$
	TGA	${\boldsymbol{0.912}}$	$\underline{0.854}$	${\boldsymbol{0.918}}$	${\boldsymbol{0.918}}$

| Show Table

DownLoad: CSV

● Multi-modal models perform significantly better than Unimodal models, which indicates the effectiveness of detecting fake news using multi-modal information.

● Spotfake+ outperforms att-RNN while utilizing pre-trained feature extractors for feature extraction because pre-training typically improves a model's capabilities for generalization and expedites its convergence to the target task.

● HMCAN is superior to Spotfake+ after modality augmentation with a contextual attention network, it indicates the effectiveness of the attention mechanism in fake news detection.

● We can observe that on both datasets, the performance of MCAN is noticeably better than HMCAN. Because MCAN uses two feature extractors to fully extract image features not only highlights the significance of attention mechanisms in multi-modal fusion but also emphasizes the massive contribution of image features to rumor detection.

● Our proposed model TGA outperforms the best baseline model MCAN, although MCAN uses co-attention in multimodal fusion, it ignores the importance of the degree of feature similarity between different modalities for rumor detection, so MCAN does not detect as well as our model TGA. This not only further proves that our feature extractor is superior to traditional CNN and traditional RNN-based feature extractors but also illustrates the significant role of multimodal feature similarity in rumor detection.

For a more visual representation of the comparison experiment results, we plotted the line graphs depicted in Figures 3 and 4, where the horizontal axis shows the comparison models and the vertical axis represents the values of the four evaluation metrics.

Figure 3. Comparison of the four assessment results of the experiment (Weibo).

DownLoad: Full-Size Img PowerPoint

Figure 4. Comparison of the four assessment results of the experiment (Twitter).

DownLoad: Full-Size Img PowerPoint

4.5. Ablation study

In order to verify the effectiveness of each module of TGA, we compare each of the following variants with TGA:

● TGA-T: Only text is used, the image feature part is deleted.

● TGA-I: Only the image is used, the text feature part is deleted.

● TGA-A: The part based on attention mechanism fusion is removed and directly concatenates the features of the two modalities.

● TGA-L: The transformer is replaced with LSTM in the text feature extractor.

● TGA-R: The VIT is replaced with ResNet-50 in the image feature extractor.

● TGA-M: The impact of feature similarity calculation results is removed from the experiment.

Table 3 shows the experimental results of several variants and we can obtain the following points:

Table 3. Ablation experiments.

	Model	Precision	Recall	Accuracy	F1-Score
Weibo	TGA-T	0.669	0.670	0.660	0.664
	TGA-I	0.895	0.867	0.880	0.880
	TGA-A	$\underline{0.914}$	${\boldsymbol{0.914}}$	$\underline{0.906}$	$\underline{0.914}$
	TGA-L	0.905	0.878	0.891	0.892
	TGA-R	0.790	0.823	$\underline{0.906}$	0.806
	TGA-M	0.886	0.859	0.867	0.872
	TGA	${\boldsymbol{0.969}}$	$\underline{0.886}$	${\boldsymbol{0.922}}$	${\boldsymbol{0.925}}$
Twitter	TGA-T	0.548	0.557	0.548	0.550
	TGA-I	0.745	0.769	0.772	0.776
	TGA-A	0.887	${\boldsymbol{0.891}}$	0.884	$\underline{0.896}$
	TGA-L	$\underline{0.896}$	0.847	$\underline{0.897}$	0.902
	TGA-R	0.735	0.796	0.870	0.756
	TGA-M	0.842	0.793	0.814	0.857
	TGA	${\boldsymbol{0.912}}$	$\underline{0.854}$	${\boldsymbol{0.918}}$	${\boldsymbol{0.918}}$

| Show Table

DownLoad: CSV

● TGA outperforms all variants, which indicates the effectiveness of each module of TGA.

● TGA-T and TGA-I have the worst performance among all variants proving that multimodal detection is superior to unimodal.

● TGA-I is superior to TGA-T, which illustrates the image-based modality model is more effective than the text-based modality model. This is because that it is difficult to distinguish between true and false news according to the text content as they usually contain many similar field-specific terms. However, fake news is often artificially created by using images unrelated to the content to attract attention. When the images contained in a news do not match the field to which the news belongs, the news will easily be identified as Fake news. For this reason, image-based detection is often more effective than text-based detection when dealing with fake news in the same field.

● TGA outperforms TGA-A which indicates that the effectiveness of attention mechanism in multimodal fusion. Due to attention mechanism can help model find the most inportant information.

● TGA-L is inferior to TGA, indicating that Transformer is better than the traditional RNN-based feature extractor for extracting text features. Similarly, TGA-R is inferior to TGA, which proves that VIT is better than the traditional CNN-based feature extractor in extracting image features.

● TGA outperforms TGA-M, because multimodal feature similarity provides the degree of matching between modalities to enhance the model's capability, also demonstrating that the level of semantic matching between multiple modalities significantly impacts news detection.

Additionally, we utilized bar charts, as presented in Figures 5 and 6, to illustrate the results of the ablation experiment in a clearer manner.

Figure 5. Results of ablation experiments on four benchmarks (Weibo).

DownLoad: Full-Size Img PowerPoint

Figure 6. Results of ablation experiments on four benchmarks (Twitter).

DownLoad: Full-Size Img PowerPoint

4.6. Parameter sensitivity experiments

The results of our experiments are highly sensitive to the chosen hyperparameters. To provide insights into their effects on the experimental outcomes, we showcase selected hyperparameter results in Figures 7–9. Notably, we conducted all hyperparameter experiments exclusively on the Weibo dataset.

Figure 7. Effect of

$\alpha$ on model performance.

DownLoad: Full-Size Img PowerPoint

Figure 8. Effect of

$\beta$ on model performance.

DownLoad: Full-Size Img PowerPoint

Figure 9. Effect of word embedding dimension on model performance.

DownLoad: Full-Size Img PowerPoint

shows the impact of the threshold $\alpha$ value on the experimental results. The fraction of the feature similarity degree of the two modal features in the experimental results is measured using a threshold $\alpha$ value. The final prediction result is calculated using the classifier's prediction result plus $\alpha$ times the feature similarity value. The experimental results show that the more significant $\alpha$ , the greater the influence of the feature similarity degree. Setting $\alpha$ to 0.1 permits us to achieve the best performance for the model.

shows the impact of the threshold $\beta$ value on the experiment results. Our experiments report that the optimal performance is achieved when $\beta$ is set to 0.65. When the feature similarity value of the two modalities outweighs $\beta$ . In that case, We believe that there is a significant disparity in the similarity between features from the two modalities, and we will add the feature similarity value of $\alpha$ times the feature similarity value to the classifier prediction result to evaluate whether the news is fake.

In Figure 9, we illustrate the impact of the word embedding dimension on our experimental outcomes. We observed that a word embedding dimension of 32 yields the best results for our model. Our analysis suggests that when the word embedding dimension is below 32, the vector representation of the words is insufficient to capture word features accurately. As we increase the word embedding dimension beyond 32, the language's inherent ambiguity amplifies, leading to overfitting.

5. Conclusions

In this paper, we propose a transformer-based multi-modal model TGA to study the problem of detecting multi-modal fake news. Specifically, we use a different type of transformer to extract textual and image features and employ attention mechanisms to fuse multi-modal features in the late stage. In addition, we calculate the semantic matching degree of multiple features to improve the detection effect. Experimental results on real datasets show that our proposed model outperforms existing multi-modal models. We will consider improving the TGA for cross-domain news detection in future work.

Use of AI tools declaration

The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

Acknowledgments

This work was supported by the Natural Science Foundation of Heilongjiang Province in China (No. LH2020F043).

Conflict of interest

The authors declare there is no conflict of interest.

References

[1]	C. G. Wang, W. Liu, D. Q. Liu, Noise suppression method for ultrasound strain imaging based on coded excitation, Appl. Res. Comput., 30 (2013), 1596-1600. https://doi.org/10.3969/j.jssn.1001-3695.2013.05.083 doi: 10.3969/j.jssn.1001-3695.2013.05.083
[2]	Y. He, S. Cao, H. Zhang, H. Sun, L. Lu, Dynamic PET image denoising with deep learning-based joint filtering, IEEE Access, 9 (2021), 41998-42012. https://doi.org/10.1109/ACCESS.2021.3064926 doi: 10.1109/ACCESS.2021.3064926
[3]	P. Liu, M. D. E. Basha, Y. Li, Y, Xiao, P. C. Sanelli, R. Fang, Deep evolutionary networks with expedited genetic algorithms for medical image denoising, Med. Image Anal., 54 (2019), 306-315. https://doi.org/10.1016/j.media.2019.03.004 doi: 10.1016/j.media.2019.03.004
[4]	C. Broaddus, A. Krull, M. Weigert, U. Schmidt, G. Myers, Removing structured noise with self-supervised blind-spot networks, in IEEE 17th International Symposium on Biomedical Imaging (ISBI), IEEE, (2020), 159-163. https://doi.org/10.1109/ISBI45749.2020.9098336
[5]	M. Green, E. M. Marom, E. Konen, N. Kiryati, A. Mayer, Learning real noise for ultra-low dose lung CT denoising, in Patch-Based Techniques in Medical Imaging Patch-MI 2018 (eds. W. Bai, G. Sanroma, G. Wu, B. Munsell, Y. Zhan, P. Coupé), Springer, Cham, (2018), 3-11. https://doi.org/10.1007/978-3-030-00500-9_1
[6]	L. Tao, C. Zhu, G. Xiang, Y. Li, H. Jia, X. Xie, Llcnn: A convolutional neural network for low-light image enhancement, in 2017 IEEE Visual Communications and Image Processing (VCIP), IEEE, (2018), 1-4. https://doi.org/10.1109/VCIP.2017.8305143
[7]	D. Wu, H. Ren, Q. Li, Self-supervised dynamic CT perfusion image denoising with deep neural networks, IEEE Trans. Radiat. Plasma Med. Sci., 5 (2021), 350-361. https://doi.org/10.48550/arXiv.2005.09766 doi: 10.48550/arXiv.2005.09766
[8]	A. Ouahabi, Signal and Image Multiresolution Analysis, John Wiley & Sons, 2012. https://doi.org/10.1002/9781118568767
[9]	A. Ouahabi, A review of wavelet denoising in medical imaging, in 2013 8th International Workshop on Systems, Signal Processing and their Applications (WoSSPA), IEEE, (2013), 19-26. https://doi.org/10.1109/WoSSPA.2013.6602330
[10]	H. Jomaa, R. Mabrouk, N. Khlifa, F. Morain-Nicolier, Denoising of dynamic pet images using a multi-scale transform and non-local means filter, Biomed. Signal Process. Control, 41 (2017), 69-80. https://doi.org/10.1016/j.bspc.2017.11.002 doi: 10.1016/j.bspc.2017.11.002
[11]	A. Gupta, V. Bhateja, A. Srivastava, A. Gupta, S. C. Satapathy, Speckle noise suppression in Ultrasound images by using an improved non-local mean filter, in Soft Computing and Signal Processing, Springer, Singapore, (2019), 13-19. https://doi.org/10.1007/978-981-13-3393-4_2
[12]	F. Baselice, G. Ferraioli, V. Pascazio, A. Sorriso, Denoising of MR images using Kolmogorov-Smirnov distance in a non local framework, Magn. Reson. Imaging, 57 (2019), 176-193. https://doi.org/10.1016/j.mri.2018.11.022 doi: 10.1016/j.mri.2018.11.022
[13]	M. Xu, X. Xie, An efficient feature-preserving PDE algorithm for image denoising based on a spatial-fractional anisotropic diffusion equation, preprint, arXiv: 2101.01496.
[14]	H. Wang, S. Cao, K. Jiang, H. Wang, Q. Zhang, Seismic data denoising for complex structure using BM3D and local similarity, J. Appl. Geophys., 170 (2019), 103759. https://doi.org/10.1016/j.jappgeo.2019.04.018 doi: 10.1016/j.jappgeo.2019.04.018
[15]	C. Feng, D. Zhao, M. Huang, Image segmentation using CUDA accelerated non-local means denoising and bias correction embedded fuzzy c-means (BCEFCM), Signal Process., 122 (2016), 164-189. https://doi.org/10.1016/j.sigpro.2015.12.007 doi: 10.1016/j.sigpro.2015.12.007
[16]	C. Feng, M. Huang, D. Zhao, Segmentation of longitudinal brain MR images using bias correction embedded fuzzy c-means with non-locally spatio-temporal regularization, J. Visual Commun. Image Represent., 38 (2016), 517-529. https://doi.org/10.1016/j.jvcir.2016.03.027 doi: 10.1016/j.jvcir.2016.03.027
[17]	C. Feng, W. Li, J. Hu, K. Yu, D. Zhao, BCEFCM_S: Bias correction embedded fuzzy c-means with spatial constraint to segment multiple spectral images with intensity inhomogeneities and noises, Signal Process., 168 (2020), 107347. https://doi.org/10.1016/j.sigpro.2019.107347 doi: 10.1016/j.sigpro.2019.107347
[18]	S. Valiollahzadeh, H. Firouzi, M. Babaie-Zadeh, C. Jutten, Image denoising using sparse representations, in International Conference on Independent Component Analysis and Signal Separation, (2009), 557-564. https://doi.org/10.1007/978-3-642-00599-2_70
[19]	H. R. Shahdoosti, S. M. Hazavei, A new compressive sensing based image denoising method using block-matching and sparse representations over learned dictionaries, Multimedia Tools Appl., 78 (2018), 12561-12582. https://doi.org/10.1007/s11042-018-6818-3 doi: 10.1007/s11042-018-6818-3
[20]	F. I. Miertoiu, B. Dumitrescu, Sparse representation and denoising for images affected by generalized Gaussian noise, U.P.B. Sci. Bull., Ser. C, 84 (2022), 75-86.
[21]	L. Nasser, T. Boudier, A novel generic dictionary-based denoising method for improving noisy and densely packed nuclei segmentation in 3D time-lapse fluorescence microscopy images, Sci. Rep., 9 (2019), 1-13. https://doi.org/10.1038/s41598-019-41683-3 doi: 10.1038/s41598-019-41683-3
[22]	H. Haneche, A. Ouahabi, B. Boudraa, New mobile communication system design for Rayleigh environments based on compressed sensing-source coding, IET Commun., 13 (2019), 2375-2385. https://doi.org/10.1049/iet-com.2018.5348 doi: 10.1049/iet-com.2018.5348
[23]	A. E. Mahdaoui, A. Ouahabi, M. S. Moulay, Image denoising using a compressive sensing approach based on regularization constraints, Sensors, 22 (2022), 2199. https://doi.org/10.3390/s22062199 doi: 10.3390/s22062199
[24]	H. Zhu, L. Han, R. Chen, Seismic data denoising method combining principal component analysis and dictionary learning, Global Geol., 39 (2020), 656-662. https://doi.org/10.3969/j.issn.1004-5589.2020.03.015 doi: 10.3969/j.issn.1004-5589.2020.03.015

This article has been cited by:

1.	Yiqiao Jin, Mohit Chandra, Gaurav Verma, Yibo Hu, Munmun De Choudhury, Srijan Kumar, 2024, Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries, 9798400701719, 2627, 10.1145/3589334.3645643
2.	Iman Qays Abduljaleel, Israa H. Ali, Deep Learning and Fusion Mechanism-based Multimodal Fake News Detection Methodologies: A Review, 2024, 14, 1792-8036, 15665, 10.48084/etasr.7907
3.	Yuhan Yan, Haiyan Fu, Fan Wu, Multimodal Social Media Fake News Detection Based on 1D-CCNet Attention Mechanism, 2024, 13, 2079-9292, 3700, 10.3390/electronics13183700
4.	Yuxuan Zhang, Song Huang, Automatic rumor recognition for public health and safety: A strategy combining topic classification and multi-dimensional feature fusion, 2024, 36, 13191578, 102087, 10.1016/j.jksuci.2024.102087
5.	Vallidevi Krishnamurthy, Varshini Balaji, Yours Truly: A Credibility Framework for Effortless LLM-Powered Fact Checking, 2024, 12, 2169-3536, 195152, 10.1109/ACCESS.2024.3520187
6.	Hu Luo, Mingshu Zhang, Facheng Yan, 2024, Rumor Detection Model Based on GAT and Multimodal Feature Codingmes, 979-8-3315-2891-1, 799, 10.1109/ICAICE63571.2024.10864133
7.	A. Vineela, Ashapu Bhavani, B. Vamsi Krishna, A. Bhavani Sankar, An artful multimodal exploration in discerning fake news through text and image harmony, 2025, 1573-7721, 10.1007/s11042-025-20695-4
8.	Sudha Patel, Shivangi Surati, 2025, Chapter 20, 978-3-031-86298-4, 273, 10.1007/978-3-031-86299-1_20
9.	Maged Nasser, Noreen Izza Arshad, Abdulalem Ali, Hitham Alhussian, Faisal Saeed, Aminu Da'u, Ibtehal Nafea, A systematic review of multimodal fake news detection on social media using deep learning models, 2025, 26, 25901230, 104752, 10.1016/j.rineng.2025.104752

Reader Comments

Your name:*

Email:*
© 2022 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Mathematical Biosciences and Engineering

3.9

Metrics

Article views(2610) PDF downloads(104) Cited by(1)

Preview PDF

Download XML

Export Citation

Article outline

Show full outline

Mathematical Biosciences and Engineering

A novel dictionary learning-based approach for Ultrasound Elastography denoising

Related Papers:

Abstract

1. Introduction

2. Related work

2.1. Textual content-based fake news detection

2.2. Visual content-based fake news detection

2.3. Multimodal-based fake news detection

3. Methodology

3.1. Model overview

3.2. Text feature extractor

3.3. Image feature extractor

3.4. Late fusion

3.5. Classifier

4. Experiments

4.1. Dataset and pre-treatment

4.1.1. Weibo dataset

4.1.2. Twitter

4.2. Parameter settings

4.3. Baselines

4.4. Comparative experiment

4.5. Ablation study

4.6. Parameter sensitivity experiments

5. Conclusions

Use of AI tools declaration

Acknowledgments

Conflict of interest

References

This article has been cited by:

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog

Abstract

1. Introduction

2. Related work

2.1. Textual content-based fake news detection

2.2. Visual content-based fake news detection

2.3. Multimodal-based fake news detection

3. Methodology

3.1. Model overview

3.2. Text feature extractor

3.3. Image feature extractor

3.4. Late fusion

3.5. Classifier

4. Experiments

4.1. Dataset and pre-treatment

4.1.1. Weibo dataset

4.1.2. Twitter

4.2. Parameter settings

4.3. Baselines

4.4. Comparative experiment

4.5. Ablation study

4.6. Parameter sensitivity experiments

5. Conclusions

Use of AI tools declaration

Acknowledgments

Conflict of interest

References