
The number and field of researches on the application of Multi-Indicator Comprehensive Evaluation (MICE) are increasing. It is important to reflect on the understanding of the MICE method systematically and the issues implied behind it. This paper compares the core concepts and methodological elements of the three papers that systematically study the MICE method. It is found that the views of the three papers on the core issue are consistent and mutually supportive, but there are differences in the step division and sequence of the evaluation content. In addition, this paper considers the historical status of the MICE and holds that the key to solving the quality of weight lies in the "equivalent conversion" problem in the MICE. Taking the Human Development Index as an example, this paper illustrates the absoluteness of the "equivalent conversion" relationship. In addition, there are multiple processing methods for the MICE from the spatial dimension and multiple evaluation results accordingly, therefore, the results of the MICE need to be used carefully. Finally, based on the systematic summary and reflection of the MICE method, three suggestions are given for the application of the MICE method.
Citation: Dong Qiu, Tingyi Liu. Multi-indicator comprehensive evaluation: reflection on methodology[J]. Data Science in Finance and Economics, 2021, 1(4): 298-312. doi: 10.3934/DSFE.2021016
[1] | Xiaowen Jia, Jingxia Chen, Kexin Liu, Qian Wang, Jialing He . Multimodal depression detection based on an attention graph convolution and transformer. Mathematical Biosciences and Engineering, 2025, 22(3): 652-676. doi: 10.3934/mbe.2025024 |
[2] | Ziyue Wang, Junjun Guo . Self-adaptive attention fusion for multimodal aspect-based sentiment analysis. Mathematical Biosciences and Engineering, 2024, 21(1): 1305-1320. doi: 10.3934/mbe.2024056 |
[3] | Xue Li, Huibo Zhou, Ming Zhao . Transformer-based cascade networks with spatial and channel reconstruction convolution for deepfake detection. Mathematical Biosciences and Engineering, 2024, 21(3): 4142-4164. doi: 10.3934/mbe.2024183 |
[4] | Zhijing Xu, Jingjing Su, Kan Huang . A-RetinaNet: A novel RetinaNet with an asymmetric attention fusion mechanism for dim and small drone detection in infrared images. Mathematical Biosciences and Engineering, 2023, 20(4): 6630-6651. doi: 10.3934/mbe.2023285 |
[5] | Musiri Kailasanathan Nallakaruppan, Chiranji Lal Chowdhary, SivaramaKrishnan Somayaji, Himakshi Chaturvedi, Sujatha. R, Hafiz Tayyab Rauf, Mohamed Sharaf . Comparative analysis of GAN-based fusion deep neural models for fake face detection. Mathematical Biosciences and Engineering, 2024, 21(1): 1625-1649. doi: 10.3934/mbe.2024071 |
[6] | Wanru Du, Xiaochuan Jing, Quan Zhu, Xiaoyin Wang, Xuan Liu . A cross-modal conditional mechanism based on attention for text-video retrieval. Mathematical Biosciences and Engineering, 2023, 20(11): 20073-20092. doi: 10.3934/mbe.2023889 |
[7] | Ning Huang, Zhengtao Xi, Yingying Jiao, Yudong Zhang, Zhuqing Jiao, Xiaona Li . Multi-modal feature fusion with multi-head self-attention for epileptic EEG signals. Mathematical Biosciences and Engineering, 2024, 21(8): 6918-6935. doi: 10.3934/mbe.2024304 |
[8] | Jun Wu, Xinli Zheng, Jiangpeng Wang, Junwei Wu, Ji Wang . AB-GRU: An attention-based bidirectional GRU model for multimodal sentiment fusion and analysis. Mathematical Biosciences and Engineering, 2023, 20(10): 18523-18544. doi: 10.3934/mbe.2023822 |
[9] | Jiaming Ding, Peigang Jiao, Kangning Li, Weibo Du . Road surface crack detection based on improved YOLOv5s. Mathematical Biosciences and Engineering, 2024, 21(3): 4269-4285. doi: 10.3934/mbe.2024188 |
[10] | Zhijing Xu, Yang Gao . Research on cross-modal emotion recognition based on multi-layer semantic fusion. Mathematical Biosciences and Engineering, 2024, 21(2): 2488-2514. doi: 10.3934/mbe.2024110 |
The number and field of researches on the application of Multi-Indicator Comprehensive Evaluation (MICE) are increasing. It is important to reflect on the understanding of the MICE method systematically and the issues implied behind it. This paper compares the core concepts and methodological elements of the three papers that systematically study the MICE method. It is found that the views of the three papers on the core issue are consistent and mutually supportive, but there are differences in the step division and sequence of the evaluation content. In addition, this paper considers the historical status of the MICE and holds that the key to solving the quality of weight lies in the "equivalent conversion" problem in the MICE. Taking the Human Development Index as an example, this paper illustrates the absoluteness of the "equivalent conversion" relationship. In addition, there are multiple processing methods for the MICE from the spatial dimension and multiple evaluation results accordingly, therefore, the results of the MICE need to be used carefully. Finally, based on the systematic summary and reflection of the MICE method, three suggestions are given for the application of the MICE method.
As social networks expand their scope, more fake news emerges in online communities, which is detrimental to community stability and growth. On social media, fake news is a general information statement in some forms whose veracity is not quickly or ever confirmed [1]. This fake news frequently exists in the form of fake news in politics, economics, and public safety, which is tremendously hazardous to society. As an illustration, the COVID-19 outbreak resulted in the deaths of almost 800 people due to the widespread misconception that consuming large amounts of alcohol might disinfect the body [2]. With the amount of online fake news on the rise, traditional manual methods can no longer handle the increasingly large volume of data. As a result, automated detection methods are gaining attention from academia and industry alike.
The traditional approaches are designed for text-only news. However, the prevalence of fake news with images on social media has sparked interest in methods that take multimodal inputs [3]. Many such methods have been proposed [3,4,5,6,7] in recent years, as fake news has evolved from text-only posts to multimedia posts with photos or videos [8]. Still, the following issues remain with current detection methods: 1) A lot of multimodal detection models [6,9,10,11] currently extract visual features from news using pre-trained VGG-19 [12] on ImageNet [13] or ResNet-50 [14], which limits their ability to generate high-quality intermediate features and location information, resulting in unsatisfactory detection results. 2) Current multimodal fake news detection methods [5,6,15] essentially detect fake news by simply concatenating text and image features, ignoring the importance of different modalities to the news. 3) Existing multimodal fake news detection approaches [16,17,18] neglect the degree of feature similarity between multiple modalities despite considering the joint influence of different modal features.
To address the aforementioned challenges, we propose a novel multimodal framework called TGA that leverages transformer [19] to fully capture visual features, fuse multimodal features effectively, and make use of feature similarity between multi-modalities. Specifically, we use transformer and vision transformer to respectively extract text and image features, which are then fused by an attention mechanism to obtain the news representation. Finally, we put the news representation into our detector to detect rumors. To improve the performance of rumor detection, we also map the feature vectors of the two modalities to the same space for alignment and compute feature similarity between the two modalities. If the feature similarity between two modalities in a news is less than some threshold, we believe that the feature between two modalities in this news mismatch, so as to increase the probability that this news is Fake news. Thus we adjust the detection outcome of our detector according to the feature similarity between two modalities in a news, thereby improving detection performance. The main contributions of this paper are as follows:
● We introduce the TGA model, which utilizes various types of transformers to extract and represent visual and text features of news.
● We use attention mechanisms to fuse the representations of different modalities and calculate the degree of feature similarity between different modalities to obtain more robust representations and improve the performance of TGA.
● We evaluate the effectiveness of TGA on public datasets, demonstrating its superior performance compared to other state-of-the-art methods in detecting fake news.
In the field of fake news detection, existing methods mainly include three categories: 1) textual content-based, 2) visual content-based and 3) multimodal-based. In this section, we briefly review the work in recent years and explain the novelty of our method accordingly.
The textual content-based supervised fake news detection method uses textual content from the news as input to detect fake news. Ma et al. [20] first applied deep learning technology to fake news detection by feeding textual content into RNNs, LSTMs, and GRUs. Yu et al. [21] first used a convolutional neural network to model news. Ma et al. [22] applied the idea of multi-tasking for the first time, trained a multi-task model and position classification with the help of RNN. Ma et al. [23] used adversarial learning to detect fake news, improving the robustness and classification accuracy of the model. Vaibhav et al. [24] modeled article sentences as graphs, utilizing GCN to detect fake news and achieve positive results. Cheng et al. [25] used a variational autoencoder (VAE) to self-encode textual content to obtain an embedded representation of news and performed multi-task learning on the obtained news vectors to improve the model. [26] considering the temporal characteristics of rumors, this paper detects rumors using graph neural networks by leveraging the dynamic propagation structure of rumors. Moreover, many false news detection methods now utilize time temporal graphs [27] to construct graph structures. [28] used graph neural networks to extract text features, while utilizing user information and interaction information. However, previous works applied traditional RNN-based models to extract text features, which cannot be parallelized, and the physical meaning of feature extraction is unclear. In order to solve above problems, our work uses a transformer to extract text features, which not only achieves parallelization but also has a stranger explanatory model with its self-attention mechanism [19].
The news contains textual and visual content, such as images and videos. Recently, visual content has been demonstrated to be an essential indicator for detecting fake news [29,30]. Traditional statistical-based methods detect fake news using the number of additional images, image popularity, and image type. With the rise of deep learning, many models use CNN, ResNet, and AlexNet to extract news features. However, traditional convolutional neural network models can only recognize pixel-level features of images. They cannot identify the semantic features of images, so they cannot detect whether images have been manipulated. Given that fake and real images can be very different in both the physical and semantic aspects, literature [31] proposes a fake image discriminator MVNN, which can effectively detect fake images. The best way to model image features has yet to be studied well in past studies, and most of it has focused on extracting text features. Many works use CNN-based models for image feature extraction, while our work introduces the vision transformer to fake news detection for the first time. Vision Transformer obtains global features from shallow layers, retains more spatial information than ResNet, and thus has a more vital ability to extract image features.
Currently, more and more works consider using textual and visual content to detect fake news. With the rise of deep learning, many powerful feature extractors have emerged, such as text feature extractors RNN, Bert, and Transformer, and image feature extractors CNN, ResNet, and AlexNet. Text feature extractors can be used to extract text features, and image feature extractors can extract visual features, which are then fused for fake news detection.
Most of the works [6,9,10,16,27] directly concatenated the textual and image features obtained from the extractor to detect fake news. For instance, literature [9] utilized VGG19 to extract visual content and XLNET to extract text content. Literature [10] used LSTM to model text content and text content in images, and used VGG to model visual content, Literature [6] used VGG to extract visual features and Text-CNN to extract visual features. Literature [16] extracts image features using VGG and text features via bi-directional LSTM.
Some work [27,32] used the contrast between modalities to detect fake news. It has been asserted that news is fake if the visual and text content does not match. Based on this assumption, some people encoded the image and text information of the news and then calculated the similarity between the two. If the similarity is high, the news's text and visual information match and is real news. If the similarity is low, it means that the news's text and visual information do not match, and it is fake news. For example, literature [33] maps textual and visual information into the same vector space to compare the similarity to detect false news. Literature [32] uses BERT to model textual information and ResNet to model visual information to calculate the similarity between them. Inspired by the above-mentioned related works, we map the feature vectors into a new space to calculate similarity after obtaining the feature vectors of the two modalities.
There are also some works [17,27,34,35,36,37,38] that use multimodal information enhancement to detect fake news, where textual information can help to understand visual information and visual information can help to understand textual information. The mutual enhancement between the two modalities can be applied to detect fake news. For example, literature [17] first proposed using attention between modalities to enhance information between modalities; literature [27] employed the attention mechanism to obtain an enhanced visual representation of textual information to understand multimodal information better. Literature [35] designed a two-layer image-text co-attention to fuse visual information and textual information better, and literature [36] utilized the co-attention approach to learn more robust feature representations incorporating textual and visual information to enhance each other. However, these works ignore that different modalities have varying effects on fake news detection. Therefore, we should make the model pay attention to those significant modal information sources to improve its detection ability, according to [37] After using BERT to extract text features, this paper further utilizes BERT to extract both text and visual features, so as to enhance the mutual reinforcement between the two modal features. [38] use a cross-modal alignment module to transform the heterogeneous unimodality features into a shared semantic space. Inspired by the fusion of different modal features in the literature [17], our model uses an attention mechanism to stitch the two modal features together in late fusion.
In this section, we will introduce the TGA model proposed in this paper. TGA is a transformer-based multimodal approach consisting of four key components: text feature extractor, image feature extractor, late fusion, and classifier. These components work together to extract and fuse text and image features, generating a comprehensive representation of the news that is then passed to the classifier for the task of rumor detection.
The framework of TGA is illustrated in Figure 1. We start by obtaining the word embedding using Glove and then use transformer to generate the original vector set for the text. Next, we extract the original vector set for the image by processing different regions of the news image. We then pool the original vectors of text and image and employ an attention mechanism to fuse the guidance vectors of the two modalities, resulting in a final representation of the news. Finally, the news representation is fed into the MLP while mapping the guidance vectors of both modalities to a new target space to predict feature matching. The output of the MLP, combined with the feature similarity value weighted appropriately, obtains the final prediction result.
We obtain word embedding by the pre-trained model Glove [9] after utilizing the Jieba lexicon to segment the news texts. Given the transformer encoder's effectiveness in aggregating text features, we use it to extract text features. The word vectors obtained from the GloVe model are used as input for the transformer encoder. When encoding, we add position embeddings (PE) to the word vectors of each word. Specifically, we use sine and cosine position encoding, generated by applying sine and cosine functions of different frequencies to each position and then adding them to the corresponding word vectors. The calculation formula for PE is as follows:
PE(pos,2i)=sin(pos10002idmodel) | (1) |
PE(pos,2i+1)=cos(pos10002idmodel) | (2) |
where PEϵRL∗dmodel, L is the sentence length, which is 75 in this paper, dmodel denotes the dimension size of the word vector, which is 512 in this paper, pos denotes the absolute position of the word in the sentence, and pos = 0, 1, 2, ..., i indicates which dimension in the word vector. The input word vector is added to the position embedding, and the calculation formula is as follows:
X=GloveEmbedding(X)+PE | (3) |
where XϵRL∗dmodel denotes the word embedding of a news article, and GloveEmbedding is the operation to obtain the word embedding by the Glove model. After obtaining the word embedding from Eq (3), it is used as input for the transformer encoder. The transformer encoder comprises N block structures, as illustrated in the Figure 2. Each block consists of a multi-headed attention layer, residual connection layer, normalization layer, feedforward layer, residual connection layer, and normalization layer. In the first step, the calculation formula for word embedding in the multi-head attention layer is as follows:
Q=XWQ,K=XWK,V=XWV | (4, 5, 6) |
Xa=SelfAttention(Q,K,V) | (7) |
SelfAttention(Q,K,V)=softmax(QTK√dk)V | (8) |
where WQ, WK, WV is the matrix of three weights, dk denotes the dimension of the matrix WK, and QT is the transpose of Q. In the second step, take the residual connection of Xa obtained from Eq (7) is connected with X, then perform regularization, the calculation is as follows:
Xa=X+Xa | (9) |
Xa=LayerNorm(Xa) | (10) |
where LayerNorm denotes the regularization operation. In the third step, pass the regularized word embeddings to the input forward propagation layer. This layer consists of two linear connections and an activation function, and the calculation formula is as follows:
Xh=Activate(Linear(Linear(Xa))) | (11) |
where Activate denotes the activation function, Linear denotes the fully connected layer. In the fourth step, the output of the forward propagation layer is then fed into the residual connection and regularization layer to obtain the final output XhϵRL∗dmodel of an encoding block, which is calculated as follows:
Xh=Xa+Xh | (12) |
Xh=LayerNorm(Xh) | (13) |
Equations (4)–(13) are repeated N times, which in the text, N = 6. This paper refers to the hidden state vectors at different time points as the original vector set of the text. As mentioned earlier, the text guidance vector Vtext is the result of pooling the original vector set of the text. Equation (14) shows the specific operation process.
Vtext=L∑i=1Xihidden | (14) |
where L is the sentence length, which is set to 75 in this paper.
The supplied image is resized to 448*448 and sliced into 196 regions, each measuring 14*14 pixels. The regions are denoted by Ii (i = 1, 2, ... 196). We employ ViT to fully extract the visual elements of news (Vision Transformer [39]). As a pre-trained model, ViT outperforms state-of-the-art image classification models on various image classification datasets and is relatively cost-effective. Moreover, when pre-trained on large-scale datasets and migrated to classification tasks on smaller and medium-sized datasets, ViT outperforms CNNs. [39] Therefore, we use the pre-trained ViT model to obtain the feature vector Vregioni of each region Ii, as shown in Eq (15). The ViT calculation process is the same as for the transformer encoder, and N = 12 is used in ViT. These region feature vectors are referred to as the original vector set of the image.
Vregioni=ViT(Ii) | (15) |
As mentioned earlier, the image guidance vector is the result of pooling all the original vectors, as shown in Eq (16).
Vimage=∑Nri=1VregioniNr | (16) |
where Nr is the number of regions, which is set to 196 in this paper.
To obtain final feature representation for news, we need to fuse the feature representation of different modality. Instead of simply concatenating the representations of different modalities, we employ an attention mechanism to fully integrate textual and visual representation into a multimodal representation. The attention mechanism has become a widely used component in deep learning to emphasize the most important information for the current task among several inputs, while ignoring insignificant information. To be specific, we compute the attention weights for each modality and create the final representation of the news via weighted averaging. To calculate the attention weights for modality m, we use a two-layer feedforward network with the following formula:
~αm=softmax(Wm2⋅tanh(Wm1⋅vm+bm1)+bm2) | (17) |
where vmϵ{Vtext,Vimage} represents the feature representation of modality m, ~αm represents the attention weight of modality m, Wm1, Wm2 represents the weight matrix, and bm1, bm2 represents the bias term. The feature presentation of modality m is then converted into a fixed-length form v′m with the following formula:
v′m=tanh(Wm2⋅vm+bm2) | (18) |
The news feature representation vf is then created by averaging and weighting the feature representations of all modes using the formula below:
vf=∑mϵ{text,image}~αmv′m | (19) |
The classifier is a three-layer MLP that takes the news featue representation vf obtained by late fusion as input for final classification. We denote the classifier as Gr(vf,Θr), where Θr denotes all parameters in the classifier and the output of the classifier ~yf is the probability that the news is fake news.
~yf=Gr(vf,Θr) | (20) |
The sigmoid activation function is utilized in the output layer to restrict the output values to 0 and 1. Through the examination of a significant amount of fake news detection data, we discovered that many fake news texts and images are not related. This is because many fake news writers use captivating images that have nothing to do with the text to attract readers. Therefore, we believe that computing the similarity of features across different modalities would enhance the detection of fake news due to the considerable differences in features between text and images found in such cases. To determine the degree of feature similarity, we map the feature representations of text and images to a new target space via calculation as follows:
S(Vtext,Vimage)=‖M1(Vtext)−M2(Vimage)‖ | (21) |
where S is the Euclidean distance of two modal features in the target space, M1(Vtext) and M2(Vimage) are two mapping functions, both of which consist of two layers of MLPs that map text and image feature representations to the new target space. We denote the final predicted values as:
~yf={~yf+αS(Vtext,Vimage) if S(Vtext,Vimage)>β~yf if S(Vtext,Vimage)≤β | (22) |
If the Euclidean distance between the two modalities is grater than the threshold β value, the result predicted by the classifier plus α times S(Vtext,Vimage) is used as a reference. Where β and α are hyperparameters. The most effective parameter values we get through the experiments are β = 0.65 and α = 0.1. If the final prediction value is greater than or equal to 0.5, we predict it as fake news, otherwise, we predict it as true news. Therefore, to calculate the classification loss, we use cross entropy, which is calculated as follows:
Lr(Θr)=−ylog~yf−(1−y)log(1−~yf) | (23) |
where y denotes the ground truth.
In this section, we first introduce the dataset and parameter settings. Then we compare our proposed model TGA with several baselines and analyze the results of comparative experiments. Finally, we verify the effectiveness of each module of TGA by ablation study and dissect the impact of hyper-parameters by parameter sensitivity experiments.
The Weibo dataset utilized in this paper is retrieved from the DataFountain website (datafountain.cn). The multi-modal dataset is provided by the Beijing Municipal Bureau of Economy and Information Technology and the Big Data Expert Committee of the Chinese Computer Society and includes various fields such as Weibo texts, comments, images, and labels for three categories: "no judgment required", "fake news", and "real news". We selected only two labels: "fake news" and "real news". To clean up the dataset, we preserved only the Chinese characters of the Weibo text and removed content like emojis and meaningless symbols.
We also removed duplicate and low-quality images to ensure the dataset's quality. In this work, we focused on studying text and images, so text-only tweets were deleted, and only one image was kept for tweets with multiple images. After processing, 17,848 pieces of data totaled real and false news in eight categories: science and technology, politics, the military, finance and business, social life, sports and entertainment, medical and health, education, and examination. Due to a limited amount of data in the last four fields, we used data from the first four fields only. All the data in the first four categories were merged and randomly split into a training set (80%), a validation set (10%), and a test set (10%), totaling 16,417 items. Table 1 displays the dataset's specifics.
Domain/Statistics | Fake news | Real news | Total | |
Finance | 428 | 350 | 778 | |
Society | 5642 | 5409 | 11,051 | |
Entertainment | 556 | 733 | 1299 | |
Health | 1756 | 1533 | 3289 | |
Training set | 6840 | 5007 | 11,847 | |
Test set | 564 | 427 | 991 |
The Twitter [40] dataset was released for Verifying Multimedia Use task at MediaEval. In experiments, we keep the same data split scheme as the benchmark [40]. The training set contains 6840 real tweets and 5007 fake tweets, and the test set contains 991 posts, including 564 real tweets and 427 fake tweets. In experiments, we follow the same steps in weibo dataset to remove the duplicated and low-quality images to ensure the quality of the entire dataset.
The text feature extractor and image feature extractor produce output dimensions of 256 and 1024, respectively. The mapping function generates output dimensions of 128 for both text and image features. Furthermore, the text transformer implements multi-headed attention with eight heads, while the image transformer utilizes 16 heads. During training, we employ a batch size of 32, a learning rate of 0.001, and optimize the loss function using the Adam optimizer. To achieve faster convergence, we use a dynamic learning rate method. We record the F1-Score after each epoch and adjust the learning rate to 80% of the previous epoch's rate if the F1-Score does not improve from the previous epoch. Finally, we evaluate model performance using precision, recall, accuracy, and F1-Score.
To verify the effectiveness of our multimodal model, we compare it with the following baselines:
Unimodal Models
● CNN [41]: A CNN-based model which uses CNN to extract image features and employs a three-layer neural network for classification.
● LSTM [42]: A textual model which using LSTM to extract text features of news.
Multimodal Models
● EANN [6]: A model uses a CNN-based extractor to extract text features and a VGG-19 network to extract image features.
● MVAE [4]: A model extracts text and image features of news and reconstructs the original image and text from the hidden layer vectors.
● Spotfake+ [9]: A multimodal model that utilizes transfer learning to capture semantic and contextual information from news texts and their associated images.
● Att-RNN [17]: Combine textual, visual, and social contextual features by attention mechanism.
● MCAN [35]: An end-to-end model which using multiple co-attention layers to fuse image and text features, which can learn the interdependencies between multiple modalities.
● HMCAN [43]: Model multimodal features of news by a multimodal contextual attention network so that information from different modalities complements each other.
Table 2 shows the performance of baselines and our model; we can obtain the following points from the experimental results:
Model | Precision | Recall | Accuracy | F1-Score | |
CNN | 0.592 | 0.806 | 0.556 | 0.683 | |
LSTM | 0.757 | 0.590 | 0.647 | 0.663 | |
EANN | 0.925 | 0.736 | 0.740 | 0.820 | |
MVAE | 0.931_ | 0.708 | 0.723 | 0.804 | |
SpotFake+ | 0.871 | 0.871 | 0.870 | 0.871 | |
att-RNN | 0.741 | 0.777 | 0.772 | 0.723 | |
MCAN | 0.899 | 0.899 | 0.902_ | 0.900_ | |
HMCAN | 0.888 | 0.885 | 0.885 | 0.885 | |
TGA | 0.969 | 0.886_ | 0.922 | 0.925 | |
CNN | 0.452 | 0.539 | 0.425 | 0.479 | |
LSTM | 0.554 | 0.431 | 0.511 | 0.523 | |
EANN | 0.745 | 0.748 | 0.745 | 0.744 | |
MVAE | 0.697 | 0.627 | 0.688 | 0.639 | |
att-RNN | 0.691 | 0.692 | 0.662 | 0.682 | |
MCAN | 0841 | 0.847 | 0.889_ | 0.849 | |
HMCAN | 0.876_ | 0.888 | 0.878 | 0.875_ | |
TGA | 0.912 | 0.854_ | 0.918 | 0.918 |
● Multi-modal models perform significantly better than Unimodal models, which indicates the effectiveness of detecting fake news using multi-modal information.
● Spotfake+ outperforms att-RNN while utilizing pre-trained feature extractors for feature extraction because pre-training typically improves a model's capabilities for generalization and expedites its convergence to the target task.
● HMCAN is superior to Spotfake+ after modality augmentation with a contextual attention network, it indicates the effectiveness of the attention mechanism in fake news detection.
● We can observe that on both datasets, the performance of MCAN is noticeably better than HMCAN. Because MCAN uses two feature extractors to fully extract image features not only highlights the significance of attention mechanisms in multi-modal fusion but also emphasizes the massive contribution of image features to rumor detection.
● Our proposed model TGA outperforms the best baseline model MCAN, although MCAN uses co-attention in multimodal fusion, it ignores the importance of the degree of feature similarity between different modalities for rumor detection, so MCAN does not detect as well as our model TGA. This not only further proves that our feature extractor is superior to traditional CNN and traditional RNN-based feature extractors but also illustrates the significant role of multimodal feature similarity in rumor detection.
For a more visual representation of the comparison experiment results, we plotted the line graphs depicted in Figures 3 and 4, where the horizontal axis shows the comparison models and the vertical axis represents the values of the four evaluation metrics.
In order to verify the effectiveness of each module of TGA, we compare each of the following variants with TGA:
● TGA-T: Only text is used, the image feature part is deleted.
● TGA-I: Only the image is used, the text feature part is deleted.
● TGA-A: The part based on attention mechanism fusion is removed and directly concatenates the features of the two modalities.
● TGA-L: The transformer is replaced with LSTM in the text feature extractor.
● TGA-R: The VIT is replaced with ResNet-50 in the image feature extractor.
● TGA-M: The impact of feature similarity calculation results is removed from the experiment.
Table 3 shows the experimental results of several variants and we can obtain the following points:
Model | Precision | Recall | Accuracy | F1-Score | |
TGA-T | 0.669 | 0.670 | 0.660 | 0.664 | |
TGA-I | 0.895 | 0.867 | 0.880 | 0.880 | |
TGA-A | 0.914_ | 0.914 | 0.906_ | 0.914_ | |
TGA-L | 0.905 | 0.878 | 0.891 | 0.892 | |
TGA-R | 0.790 | 0.823 | 0.906_ | 0.806 | |
TGA-M | 0.886 | 0.859 | 0.867 | 0.872 | |
TGA | 0.969 | 0.886_ | 0.922 | 0.925 | |
TGA-T | 0.548 | 0.557 | 0.548 | 0.550 | |
TGA-I | 0.745 | 0.769 | 0.772 | 0.776 | |
TGA-A | 0.887 | 0.891 | 0.884 | 0.896_ | |
TGA-L | 0.896_ | 0.847 | 0.897_ | 0.902 | |
TGA-R | 0.735 | 0.796 | 0.870 | 0.756 | |
TGA-M | 0.842 | 0.793 | 0.814 | 0.857 | |
TGA | 0.912 | 0.854_ | 0.918 | 0.918 |
● TGA outperforms all variants, which indicates the effectiveness of each module of TGA.
● TGA-T and TGA-I have the worst performance among all variants proving that multimodal detection is superior to unimodal.
● TGA-I is superior to TGA-T, which illustrates the image-based modality model is more effective than the text-based modality model. This is because that it is difficult to distinguish between true and false news according to the text content as they usually contain many similar field-specific terms. However, fake news is often artificially created by using images unrelated to the content to attract attention. When the images contained in a news do not match the field to which the news belongs, the news will easily be identified as Fake news. For this reason, image-based detection is often more effective than text-based detection when dealing with fake news in the same field.
● TGA outperforms TGA-A which indicates that the effectiveness of attention mechanism in multimodal fusion. Due to attention mechanism can help model find the most inportant information.
● TGA-L is inferior to TGA, indicating that Transformer is better than the traditional RNN-based feature extractor for extracting text features. Similarly, TGA-R is inferior to TGA, which proves that VIT is better than the traditional CNN-based feature extractor in extracting image features.
● TGA outperforms TGA-M, because multimodal feature similarity provides the degree of matching between modalities to enhance the model's capability, also demonstrating that the level of semantic matching between multiple modalities significantly impacts news detection.
Additionally, we utilized bar charts, as presented in Figures 5 and 6, to illustrate the results of the ablation experiment in a clearer manner.
The results of our experiments are highly sensitive to the chosen hyperparameters. To provide insights into their effects on the experimental outcomes, we showcase selected hyperparameter results in Figures 7–9. Notably, we conducted all hyperparameter experiments exclusively on the Weibo dataset.
Figure 7 shows the impact of the threshold α value on the experimental results. The fraction of the feature similarity degree of the two modal features in the experimental results is measured using a threshold α value. The final prediction result is calculated using the classifier's prediction result plus α times the feature similarity value. The experimental results show that the more significant α, the greater the influence of the feature similarity degree. Setting α to 0.1 permits us to achieve the best performance for the model.
Figure 8 shows the impact of the threshold β value on the experiment results. Our experiments report that the optimal performance is achieved when β is set to 0.65. When the feature similarity value of the two modalities outweighs β. In that case, We believe that there is a significant disparity in the similarity between features from the two modalities, and we will add the feature similarity value of α times the feature similarity value to the classifier prediction result to evaluate whether the news is fake.
In Figure 9, we illustrate the impact of the word embedding dimension on our experimental outcomes. We observed that a word embedding dimension of 32 yields the best results for our model. Our analysis suggests that when the word embedding dimension is below 32, the vector representation of the words is insufficient to capture word features accurately. As we increase the word embedding dimension beyond 32, the language's inherent ambiguity amplifies, leading to overfitting.
In this paper, we propose a transformer-based multi-modal model TGA to study the problem of detecting multi-modal fake news. Specifically, we use a different type of transformer to extract textual and image features and employ attention mechanisms to fuse multi-modal features in the late stage. In addition, we calculate the semantic matching degree of multiple features to improve the detection effect. Experimental results on real datasets show that our proposed model outperforms existing multi-modal models. We will consider improving the TGA for cross-domain news detection in future work.
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
This work was supported by the Natural Science Foundation of Heilongjiang Province in China (No. LH2020F043).
The authors declare there is no conflict of interest.
[1] | Bandura R (2005) Measuring country performance and state behavior: A survey of composite indices. Technical report, Office of Development Studies, United Nations Development Programme (UNDP), New York. |
[2] | Bandura R (2011) Composite indicators and rankings: Inventory 2011. Technical report, Office of Development Studies, United Nations Development Programme (UNDP), New York. |
[3] |
Greco S, Ishizaka A, Tasiou M, et al. (2019) On the Methodological Framework of Composite Indices: A Review of the Issues of Weighting, Aggregation, and Robustness. Soc Indic Res 141: 61-94. doi: 10.1007/s11205-017-1832-9
![]() |
[4] | Mazziotta M, Pareto A (2020) Composite Indices Construction: The Performance Interval Approach. Soc Indic Res. |
[5] | Nardo M, Saisana M, Saltelli A, et al. (2005) Handbook on constructing composite indicators. OECD Statistics Working Papers. Available from: https://www.oecd-ilibrary.org/economics/handbook-on-constructing-composite-indicators_533411815016. |
[6] |
Qiu D, Li D (2021) Comments on the "SSF Report" from the perspective of economic statistics. Green Finance 3: 403-463. doi: 10.3934/GF.2021020
![]() |
[7] | Rosen R (1991) Life itself: A comprehensive inquiry into the nature, origin, and fabrication of life. New York: Columbia University Press. |
[8] | Saisana M, Tarantola S (2002) State-of-the-art Report on Current Methodologies and Practices for Composite Indicator Development. In Joint Research Centre. Italy: European Commission. |
[9] | United Nations Development Programme (UNDP) (2010) Human Development Report 2010: The Real Wealth of Nations-Pathways to Human Development. New York. |
[10] | Qiu D (1991) The system of the comprehensive evaluation method of multi-indicator. China Statistical Publishing Press. |
[11] | Qiu D (1988) The system of the multi-indicator comprehensive evaluation. Res Financ Econ Iss 09: 49-55. |
[12] | Su WH (2001) The research on theory and method of research on theory and method. Chinese Price Publishing House. |
[13] | OECD (2008) Handbook on Constructing Composite Indicators-methodology and Use Guide. |
[14] | Auyang SY (1998) Foundations of Complex-system Theories. In: Economics, Evolutionary Biology, and Statistical Physics. |
[15] | Kagan J (2009) The Three Cultures: Natural Sciences, Social Sciences and the Humanities in the 21st Century. New York: Cambridge University Press. |
[16] | Chang CL (2011) The relativism and beyond on the philosophy of science. Shandong University Press. |
[17] | Su WH (2012) The review and understanding of the comprehensive evaluation technology of multi-indicator and application research in China. Stat Res 250: 98-107. |
[18] | Qiu D (2012) The Boundary Antinomy of Macro-measurement and It's Significance. Stat Res 250: 83-90. |
1. | Yiqiao Jin, Mohit Chandra, Gaurav Verma, Yibo Hu, Munmun De Choudhury, Srijan Kumar, 2024, Better to Ask in English: Cross-Lingual Evaluation of Large Language Models for Healthcare Queries, 9798400701719, 2627, 10.1145/3589334.3645643 | |
2. | Iman Qays Abduljaleel, Israa H. Ali, Deep Learning and Fusion Mechanism-based Multimodal Fake News Detection Methodologies: A Review, 2024, 14, 1792-8036, 15665, 10.48084/etasr.7907 | |
3. | Yuhan Yan, Haiyan Fu, Fan Wu, Multimodal Social Media Fake News Detection Based on 1D-CCNet Attention Mechanism, 2024, 13, 2079-9292, 3700, 10.3390/electronics13183700 | |
4. | Yuxuan Zhang, Song Huang, Automatic rumor recognition for public health and safety: A strategy combining topic classification and multi-dimensional feature fusion, 2024, 36, 13191578, 102087, 10.1016/j.jksuci.2024.102087 | |
5. | Vallidevi Krishnamurthy, Varshini Balaji, Yours Truly: A Credibility Framework for Effortless LLM-Powered Fact Checking, 2024, 12, 2169-3536, 195152, 10.1109/ACCESS.2024.3520187 | |
6. | Hu Luo, Mingshu Zhang, Facheng Yan, 2024, Rumor Detection Model Based on GAT and Multimodal Feature Codingmes, 979-8-3315-2891-1, 799, 10.1109/ICAICE63571.2024.10864133 | |
7. | A. Vineela, Ashapu Bhavani, B. Vamsi Krishna, A. Bhavani Sankar, An artful multimodal exploration in discerning fake news through text and image harmony, 2025, 1573-7721, 10.1007/s11042-025-20695-4 | |
8. | Sudha Patel, Shivangi Surati, 2025, Chapter 20, 978-3-031-86298-4, 273, 10.1007/978-3-031-86299-1_20 | |
9. | Maged Nasser, Noreen Izza Arshad, Abdulalem Ali, Hitham Alhussian, Faisal Saeed, Aminu Da'u, Ibtehal Nafea, A systematic review of multimodal fake news detection on social media using deep learning models, 2025, 26, 25901230, 104752, 10.1016/j.rineng.2025.104752 |
Domain/Statistics | Fake news | Real news | Total | |
Finance | 428 | 350 | 778 | |
Society | 5642 | 5409 | 11,051 | |
Entertainment | 556 | 733 | 1299 | |
Health | 1756 | 1533 | 3289 | |
Training set | 6840 | 5007 | 11,847 | |
Test set | 564 | 427 | 991 |
Model | Precision | Recall | Accuracy | F1-Score | |
CNN | 0.592 | 0.806 | 0.556 | 0.683 | |
LSTM | 0.757 | 0.590 | 0.647 | 0.663 | |
EANN | 0.925 | 0.736 | 0.740 | 0.820 | |
MVAE | 0.931_ | 0.708 | 0.723 | 0.804 | |
SpotFake+ | 0.871 | 0.871 | 0.870 | 0.871 | |
att-RNN | 0.741 | 0.777 | 0.772 | 0.723 | |
MCAN | 0.899 | 0.899 | 0.902_ | 0.900_ | |
HMCAN | 0.888 | 0.885 | 0.885 | 0.885 | |
TGA | 0.969 | 0.886_ | 0.922 | 0.925 | |
CNN | 0.452 | 0.539 | 0.425 | 0.479 | |
LSTM | 0.554 | 0.431 | 0.511 | 0.523 | |
EANN | 0.745 | 0.748 | 0.745 | 0.744 | |
MVAE | 0.697 | 0.627 | 0.688 | 0.639 | |
att-RNN | 0.691 | 0.692 | 0.662 | 0.682 | |
MCAN | 0841 | 0.847 | 0.889_ | 0.849 | |
HMCAN | 0.876_ | 0.888 | 0.878 | 0.875_ | |
TGA | 0.912 | 0.854_ | 0.918 | 0.918 |
Model | Precision | Recall | Accuracy | F1-Score | |
TGA-T | 0.669 | 0.670 | 0.660 | 0.664 | |
TGA-I | 0.895 | 0.867 | 0.880 | 0.880 | |
TGA-A | 0.914_ | 0.914 | 0.906_ | 0.914_ | |
TGA-L | 0.905 | 0.878 | 0.891 | 0.892 | |
TGA-R | 0.790 | 0.823 | 0.906_ | 0.806 | |
TGA-M | 0.886 | 0.859 | 0.867 | 0.872 | |
TGA | 0.969 | 0.886_ | 0.922 | 0.925 | |
TGA-T | 0.548 | 0.557 | 0.548 | 0.550 | |
TGA-I | 0.745 | 0.769 | 0.772 | 0.776 | |
TGA-A | 0.887 | 0.891 | 0.884 | 0.896_ | |
TGA-L | 0.896_ | 0.847 | 0.897_ | 0.902 | |
TGA-R | 0.735 | 0.796 | 0.870 | 0.756 | |
TGA-M | 0.842 | 0.793 | 0.814 | 0.857 | |
TGA | 0.912 | 0.854_ | 0.918 | 0.918 |
Domain/Statistics | Fake news | Real news | Total | |
Finance | 428 | 350 | 778 | |
Society | 5642 | 5409 | 11,051 | |
Entertainment | 556 | 733 | 1299 | |
Health | 1756 | 1533 | 3289 | |
Training set | 6840 | 5007 | 11,847 | |
Test set | 564 | 427 | 991 |
Model | Precision | Recall | Accuracy | F1-Score | |
CNN | 0.592 | 0.806 | 0.556 | 0.683 | |
LSTM | 0.757 | 0.590 | 0.647 | 0.663 | |
EANN | 0.925 | 0.736 | 0.740 | 0.820 | |
MVAE | 0.931_ | 0.708 | 0.723 | 0.804 | |
SpotFake+ | 0.871 | 0.871 | 0.870 | 0.871 | |
att-RNN | 0.741 | 0.777 | 0.772 | 0.723 | |
MCAN | 0.899 | 0.899 | 0.902_ | 0.900_ | |
HMCAN | 0.888 | 0.885 | 0.885 | 0.885 | |
TGA | 0.969 | 0.886_ | 0.922 | 0.925 | |
CNN | 0.452 | 0.539 | 0.425 | 0.479 | |
LSTM | 0.554 | 0.431 | 0.511 | 0.523 | |
EANN | 0.745 | 0.748 | 0.745 | 0.744 | |
MVAE | 0.697 | 0.627 | 0.688 | 0.639 | |
att-RNN | 0.691 | 0.692 | 0.662 | 0.682 | |
MCAN | 0841 | 0.847 | 0.889_ | 0.849 | |
HMCAN | 0.876_ | 0.888 | 0.878 | 0.875_ | |
TGA | 0.912 | 0.854_ | 0.918 | 0.918 |
Model | Precision | Recall | Accuracy | F1-Score | |
TGA-T | 0.669 | 0.670 | 0.660 | 0.664 | |
TGA-I | 0.895 | 0.867 | 0.880 | 0.880 | |
TGA-A | 0.914_ | 0.914 | 0.906_ | 0.914_ | |
TGA-L | 0.905 | 0.878 | 0.891 | 0.892 | |
TGA-R | 0.790 | 0.823 | 0.906_ | 0.806 | |
TGA-M | 0.886 | 0.859 | 0.867 | 0.872 | |
TGA | 0.969 | 0.886_ | 0.922 | 0.925 | |
TGA-T | 0.548 | 0.557 | 0.548 | 0.550 | |
TGA-I | 0.745 | 0.769 | 0.772 | 0.776 | |
TGA-A | 0.887 | 0.891 | 0.884 | 0.896_ | |
TGA-L | 0.896_ | 0.847 | 0.897_ | 0.902 | |
TGA-R | 0.735 | 0.796 | 0.870 | 0.756 | |
TGA-M | 0.842 | 0.793 | 0.814 | 0.857 | |
TGA | 0.912 | 0.854_ | 0.918 | 0.918 |