
Multimodal sentiment analysis is an important area of artificial intelligence. It integrates multiple modalities such as text, audio, video and image into a compact multimodal representation and obtains sentiment information from them. In this paper, we improve two modules, i.e., feature extraction and feature fusion, to enhance multimodal sentiment analysis and finally propose an attention-based two-layer bidirectional GRU (AB-GRU, gated recurrent unit) multimodal sentiment analysis method. For the feature extraction module, we use a two-layer bidirectional GRU network and connect two layers of attention mechanisms to enhance the extraction of important information. The feature fusion part uses low-rank multimodal fusion, which can reduce the multimodal data dimensionality and improve the computational rate and accuracy. The experimental results demonstrate that the AB-GRU model can achieve 80.9% accuracy on the CMU-MOSI dataset, which exceeds the same model type by at least 2.5%. The AB-GRU model also possesses a strong generalization capability and solid robustness.
Citation: Jun Wu, Xinli Zheng, Jiangpeng Wang, Junwei Wu, Ji Wang. AB-GRU: An attention-based bidirectional GRU model for multimodal sentiment fusion and analysis[J]. Mathematical Biosciences and Engineering, 2023, 20(10): 18523-18544. doi: 10.3934/mbe.2023822
[1] | Ziyue Wang, Junjun Guo . Self-adaptive attention fusion for multimodal aspect-based sentiment analysis. Mathematical Biosciences and Engineering, 2024, 21(1): 1305-1320. doi: 10.3934/mbe.2024056 |
[2] | Yi Liu, Jiahuan Lu, Jie Yang, Feng Mao . Sentiment analysis for e-commerce product reviews by deep learning model of Bert-BiGRU-Softmax. Mathematical Biosciences and Engineering, 2020, 17(6): 7819-7837. doi: 10.3934/mbe.2020398 |
[3] | Hongli Niu, Kunliang Xu . A hybrid model combining variational mode decomposition and an attention-GRU network for stock price index forecasting. Mathematical Biosciences and Engineering, 2020, 17(6): 7151-7166. doi: 10.3934/mbe.2020367 |
[4] | Ning Huang, Zhengtao Xi, Yingying Jiao, Yudong Zhang, Zhuqing Jiao, Xiaona Li . Multi-modal feature fusion with multi-head self-attention for epileptic EEG signals. Mathematical Biosciences and Engineering, 2024, 21(8): 6918-6935. doi: 10.3934/mbe.2024304 |
[5] | Quan Zhu, Xiaoyin Wang, Xuan Liu, Wanru Du, Xingxing Ding . Multi-task learning for aspect level semantic classification combining complex aspect target semantic enhancement and adaptive local focus. Mathematical Biosciences and Engineering, 2023, 20(10): 18566-18591. doi: 10.3934/mbe.2023824 |
[6] | Xiaotong Ji, Dan Liu, Ping Xiong . Multi-model fusion short-term power load forecasting based on improved WOA optimization. Mathematical Biosciences and Engineering, 2022, 19(12): 13399-13420. doi: 10.3934/mbe.2022627 |
[7] | Zhijing Xu, Yang Gao . Research on cross-modal emotion recognition based on multi-layer semantic fusion. Mathematical Biosciences and Engineering, 2024, 21(2): 2488-2514. doi: 10.3934/mbe.2024110 |
[8] | Yuanyao Lu, Kexin Li . Research on lip recognition algorithm based on MobileNet + attention-GRU. Mathematical Biosciences and Engineering, 2022, 19(12): 13526-13540. doi: 10.3934/mbe.2022631 |
[9] | Mohanad A. Deif, Ahmed A. A. Solyman, Mehrdad Ahmadi Kamarposhti, Shahab S. Band, Rania E. Hammam . A deep bidirectional recurrent neural network for identification of SARS-CoV-2 from viral genome sequences. Mathematical Biosciences and Engineering, 2021, 18(6): 8933-8950. doi: 10.3934/mbe.2021440 |
[10] | Qiao Xiang, Tianhong Huang, Qin Zhang, Yufeng Li, Amr Tolba, Isack Bulugu . A novel sentiment analysis method based on multi-scale deep learning. Mathematical Biosciences and Engineering, 2023, 20(5): 8766-8781. doi: 10.3934/mbe.2023385 |
Multimodal sentiment analysis is an important area of artificial intelligence. It integrates multiple modalities such as text, audio, video and image into a compact multimodal representation and obtains sentiment information from them. In this paper, we improve two modules, i.e., feature extraction and feature fusion, to enhance multimodal sentiment analysis and finally propose an attention-based two-layer bidirectional GRU (AB-GRU, gated recurrent unit) multimodal sentiment analysis method. For the feature extraction module, we use a two-layer bidirectional GRU network and connect two layers of attention mechanisms to enhance the extraction of important information. The feature fusion part uses low-rank multimodal fusion, which can reduce the multimodal data dimensionality and improve the computational rate and accuracy. The experimental results demonstrate that the AB-GRU model can achieve 80.9% accuracy on the CMU-MOSI dataset, which exceeds the same model type by at least 2.5%. The AB-GRU model also possesses a strong generalization capability and solid robustness.
Artificial intelligence is one of the focuses of Internet technology development in recent years, which enables computers to imitate human behavior more intelligently. Among them, sentiment analysis is a technical difficulty that artificial intelligence urgently needs to overcome. It allows computers to cross the dimension of machines and more closely resemble human thinking patterns. Sentiment analysis profoundly explains the development prospect of human-computer interaction and opens the way forward for information technology in the new era. However, going from simple 01-computing to complex and variable brain thinking takes work. Common sentiment analysis methods are built on text data because the textual content reflects the emotional value well. Some scholars have used parallel convolutional neural networks [1] (CNN) and recurrent neural networks [2] (RNN) for sentiment analysis of the text. Wang et al. proposed an iterative algorithm called SentiDiff to predict sentiment polarities expressed in Twitter messages [3]. Hassonah et al. offered a hybrid machine learning approach to enhance sentiment analysis [4].
However, a single text modality can no longer provide complete data information in the face of complex data types. On social platforms such as Weibo, Twitter and friend circle, people share a large amount of information, such as text, expressions, images, audio and video, to express their emotions in multiple ways. This information also provides a rich database for multimodal sentiment analysis. Audio data contains information such as the size of the voice and the tone of voice. In contrast, image data contains information such as the facial expressions of people and the color tones of images, all of which can assist text content in expressing human emotions better and improving the accuracy of computer judgment of emotional polarity. Huddar et al. presented a novel attention-based multimodal contextual fusion strategy that extracts contextual information among the utterances before fusion [5]. Jiang et al. proposed a model that uses an interactive information fusion mechanism to interactively learn the visual-specific and textual-specific visual representations [6]. Zadeh et al. proposed the multi-attention recurrent network (MARN) [7], novel neural structure for understanding human communication using multi-headed attention modules and mixed long- and short-term memory networks.
Multimodal sentiment analysis methods still have many problems. Regarding feature extraction, the commonly used reinforcement learning methods are often based on the level of words, which ignores the information interaction between words. Most neural network models are also based on single-layer LSTM (long short term memory), which is challenging to extract deeply complex data. Regarding feature fusion, there is a heterogeneous divide between different modalities and the data is in other distribution spaces, which is difficult to measure directly. Existing models are often limited by the exponential growth in computational and memory costs associated with using tensor representations and it is challenging to extend fusion to multiple modalities while maintaining a reasonable model complexity. This paper proposes an attention-based two-layer bidirectional GRU (AB-GRU) multimodal sentiment analysis method to solve the above problems.
First, we preprocess the data of text, audio and image modes, and input the processed data vectors into the double-layer bidirectional GRU network respectively for feature extraction. Then we input the data vector into the attention module to extract the important information and perform the single mode feature fusion. Secondly, we input the extracted multi-modal feature vectors into the low-rank multimodal fusion (LMF) [8] and carry out feature fusion. We add vector 1 to the eigenvectors of three different modes, and align the three modal vectors to form a three-dimensional Cartesian product model, which is then mapped back to the low-dimensional output vector. Finally, we map the results into the sample space to get an output of affective polarity.
This paper aims to learn human emotion polarity using feature extraction techniques and multimodal fusion techniques to learn how emotions are expressed from features such as the content of the text, the priority of audio and human facial expressions in images to achieve more efficient and accurate emotion analysis. The main contributions of this paper are as follows:
1) A two-layer bidirectional GRU network is used to extract multimodal features, which can effectively learn the association of ordered data like text and audio on time series and improve the accuracy of feature extraction. GRU can streamline the gating mechanism and enhance learning efficiency.
2) Connecting two attention mechanisms after a two-layer bidirectional GRU network can capture important information in the feature vector and enhance the learning efficiency of modal features.
3) LMF converts multiple inputs into a high-dimensional tensor and maps them back to a low-dimensional vector, which can effectively improve the efficiency of operations. Different modalities are decoupled from each other so that the model can extend to data with an arbitrary number of modalities.
The world comprises countless complex and varied elements and humans can perceive them through sight, hearing, smell, taste and touch to obtain rich knowledge and information. With the development of Internet technology, scholars are also working on making computers learn to imitate this unique way of human information reception, which is also the research direction of artificial intelligence. The research in this direction has obtained excellent results and has been successfully applied in many fields, such as natural language processing, image recognition, recommendation systems and target detection.
Combining the unimodal learning algorithms and techniques of artificial intelligence in different fields, scholars have opened the research on multimodal fusion methods. Multimodal fusion [9,10] aims to understand and process several different kinds of modal information, including different modalities such as text, audio, image and video, by machine learning to achieve the task of prediction or classification. In the process of data processing by machines, the data of a single modality usually cannot contain complete information and it is challenging for the learning of a single modality to achieve accurate prediction or quickly produce local optimal solutions, so multiple modal data are introduced for fusion learning to improve the learning efficiency. The basic principle of multimodal fusion is the fusion of features of different modal data, i.e., the features of the input data are extracted first. Fusion methods fuse the extracted features of different modalities. Finally, the fused features are input into models such as classification or prediction according to the requirements to obtain the output results. As shown in Figure 1, multimodal fusion methods are divided into early, late and hybrid fusion methods according to the fusion time.
There have been more mature research results on multimodal fusion techniques for different needs. For example, Radford et al. proposed the CLIP [11] model, whose structure consists mainly of a text encoder and an image encoder, which is matched by calculating the similarity between text and image vectors. Zadeh et al. proposed the tensor fusion model (TFN) [12], which uses unimodal features as input and the modal embedding of the 3-fold Cartesian product display of simulated unimodal, bimodal, and trimodal interactions. Memory fusion network (MFN) [13] gives each view an LSTM function component and encodes it independently to send the interactions across views by temporal information.
The use of multimodal fusion techniques for sentiment analysis is also the focus of this research paper. Textual content usually expresses human emotions directly but not comprehensively. Human language is very complex. For example, irony, mockery, rhetorical questions and other emotionally contradictory statements are complex for computers to understand accurately. We, therefore, resort to audio and image data to assist computers in understanding and classifying emotions. The voice can reflect whether the speaker is anxious or relaxed, and the tone can reflect whether the speaker is angry or calm. All information is contained in the audio data. Vision data can visually represent people's facial expressions and body movements and even the color shades of photos can reflect the photographer's emotion. This information together forms the database of multimodal technology and achieves a more intelligent and accurate multimodal emotion analysis.
In multimodal sentiment analysis methods, the effectiveness of feature extraction can directly affect the downstream tasks. Commonly used feature extraction models are CNN [1], RNN [14,15], LSTM [16,17] and the current newer transformer [18] and BERT [19,20]. The sentiment analysis task is mainly based on the text modality. Compared to other modalities, the text modality often contains the richest and most specific sentiment information, while the other modalities play an auxiliary and corrective role. Therefore, among multimodal sentiment analysis methods, the most commonly used feature extraction method is LSTM, which determines the current output by introducing state variables to store past information and current inputs. Also, it solves the problem that RNNs are easily affected by short-term memory, comprehensive sequence information can rarely be kept completely and essential information is easily missed through a unique gating mechanism. Wei et al. proposed a BiLSTM model with multi-polarity orthogonal attention for implicit sentiment analysis [21]. Zhang et al. proposed a recurrent attention LSTM neural network to achieve sentiment analysis by iteratively locating attention regions covering key sentiment words [22].
The GRU [23,24,25] network used in this paper simplifies the internal structure based on LSTM, simplifies the network model and has fewer model parameters while improves the accuracy rate.
The structure of GRU is shown in Figure 2, which is mainly composed of a reset gate and an update gate.
The role of the reset gate is to determine how much information from the hidden state of the last moment needs to be forgotten and to determine the share of new input and to combine the saved information with the new input.
rt=σ(Wr⋅[ht−1,xt]) | (2.1) |
∼ht=tanh(W∼h⋅[rt×ht−1,xt]) | (2.2) |
where xt is the current input information, ht−1 is the hidden state saved at the last moment, W is the weight, σ is the Sigmoid activation function, compressing the value to between 0 and 1. tanh is the tanh activation function compresses the value to between –1 and 1.
rt is used to adjust the proportion of input information xt. The value of rt ranges from 0 to 1, and the smaller the value, the more input information is retained. ht is the candidate's hidden state. Reset gates help to capture short-term dependencies in the timing information.
Update gates are used to process long-term information, decide how much information from the hidden state of the last moment needs to be remembered and pass the remembered information down the line.
zt=σ(Wz⋅[ht−1,xt]) | (2.3) |
ht=(1−zt)×ht−1+zt×∼ht | (2.4) |
zt is used to adjust the degree of history information preservation. zt takes values from 0 to 1. The smaller the value, the more historical information is preserved. ht is used to preserve the information that needs to be passed backward.
The two-layer bidirectional GRU model used in this paper uses a combination of two propagation modes, favorable and negative propagation, based on the ordinary GRU model and a two-layer stacking approach to the bidirectional GRU model. The text and audio modalities are sequential and the previous content will affect the expression of the later content to a certain extent. Thus there is a particular gap between the expressed content of positive and negative propagation. The bidirectional propagation method used in this paper can learn different positive and negative propagation features separately, saving the corresponding hidden information. In the attention module, we combine positive features with hidden positive information and negative features with hidden negative information to focus on the critical information and get more targeted modal features.
For complex modal information, this paper chooses to stack the bidirectional GRU model to achieve higher accuracy feature extraction and improve the overall efficiency of the model because the internal structure of GRU is more straightforward, and the stacking of two layers can also retain its higher computing rate.
Different fusion methods fit various tasks. Common multimodal tasks are cross-modal retrieval, sentiment analysis, and audio-visual recognition [26]. CLIP targets cross-modal retrieval tasks, which enables image and text matching. GLCM [27] is a self-supervised method for learning audiovisual representations, which can generalize to both the tasks which require global semantic information and the tasks that require fine-grained spatio-temporal information. In this paper, we investigate the sentiment analysis of multimodal book data [28] and propose an attention-based two-layer bi-directional GRU model, which outperforms most of its current counterparts on sentiment classification tasks.
Compared with existing models, AB-GRU has better classification accuracy and model complexity and can extend to data with an arbitrary number of modalities. Compared with the current outstanding CLIP [11], AB-GRU can better target data with more than two modalities and has low model complexity, few parameters and a high training rate. Compared with traditional TFN [12], AB-GRU uses a stacked GRU network in the feature extraction module and connects two attention layers to enhance the capture of important information. In the feature fusion module, AB-GRU decomposes the weights into low-rank factors to reduce the number of parameters in the model and improve the computation rate. GRU [29] has a wide range of applications in the field of deep learning. For text data and audio data with temporal characteristics, GRU can better learn its features, has a simple structure and a small number of parameters. GRU can greatly improve the operation rate in complex multimodal tasks.
The AB-GRU model used in this paper is shown in Figure 3, which consists of a combination of four main modules: input module, feature extraction module, feature fusion module and output module.
1) The input module is used to pre-process the multimodal sentiment analysis data, and the data types used in this paper include text, audio and image data.
2) The feature extraction module, which is the focus of improvement in this paper, uses a two-layer bidirectional GRU model based on attention [30] to extract features from the data of the three modalities and obtain the corresponding three feature vectors.
3) The feature fusion module uses LMF for feature fusion to obtain the fused 3D model, which is then mapped back to the low-dimensional output vector.
4) The output module also contains a decision layer. The low-dimensional output vector obtained in the previous step is mapped to the decision layer to obtain the final output by passing the corresponding single-valued output through the fully connected layer.
We preprocess the data of text, audio and image modalities and then use P2FA to perform word alignment to align the three modalities at word granularity and get the data vector of text modality T=(t1,t2,…,tn), n is the vector length of text modality; the data vector of audio modality A=(a1,a2,…,am), m is the audio modal vector length and the vision modal data vector V=(v1,v2,…,vl), l is the image modal vector length.
The first step in the feature extraction module is to input three modalities, text T=(t1,t2,…,tn), audio A=(a1,a2,…,am), and vision V=(v1,v2,…,vl), into the attention-based two-layer bidirectional GRU network. Figure 4 shows the feature extraction process of the text.
In the second step, we input the text vector T=(t1,t2,…,tn) into the bidirectional GRU network for learning. The input information will perform update and forget operations in each GRU cell. Then, put the text vector into a second GRU network layer and repeats the above steps. These processes are shown in the second and third modules of Figure 4. Finally, we get the positive hidden layer state ht+ of the text, the negative hidden layer state ht− and the output GT=(Gt1,Gt2,…,Gtn) of the text after GRU.
Since this paper uses a bidirectional GRU network, output GT comprises positive and negative propagation processes, so GT can be decomposed into positive output GT+ and negative output GT−.
Similarly, the audio vector passes through the double-layer bidirectional GRU network to obtain the positive hidden layer state ha+, the negative hidden layer state ha− and the output GA=(Ga1,Ga2,…,Gam), which can be decomposed into the positive output GA+ and the negative output GA−. The vision vector passes through the two-layer bidirectional GRU network to obtain the positive hidden layer state hv+, the negative hidden layer state hv− and the output GV=(Gv1,Gv2,…,Gvl), which can be decomposed into the positive output GV+ and the negative output GV−.
In the third step, we put ht+, ht−, GT+ and GT− into the attention module, the third module in Figure 4. The input content goes through the first layer of attention mechanism Attention1 for unimodal feature fusion: positive hidden features are combined with positive output and negative hidden features with negative output. Then, the attention mechanism learns the critical information of positive and negative directions respectively. Finally, the positive features of the text and the negative features of the text are obtained as follows:
FT+=∑softmax[relu(ht+×WT+)]×[relu(tanh(GT+×WT+))]×GT+ | (3.1) |
FT−=∑softmax[relu(ht−×WT−)]×[relu(tanh(GT−×WT−))]×GT− | (3.2) |
where, FT+ is the positive feature obtained from the text feature vector after Attention1 and FT− is the negative feature obtained from the text feature vector after Attention1. WT is the parameter matrix needed to learn. relu and tanh are the activation functions.
Then FT+ and FT− are input into the second layer of attention mechanism Attention2, to combine the positive and negative features and learn the weights of positive and negative features to obtain the full text features:
FT=FT+×θT+FT−×(1−θT) | (3.3) |
where FT is the final text feature obtained from the text vector by the attention-based two-layer bidirectional GRU model and θT is the weight needed to learn.
For the audio modality, we combine the audio feature GA=(Ga1,Ga2,…,Gam), the positive output GA+ and the negative output GA− of the audio feature and the positive hidden layer state ha+ and the negative hidden layer state ha− of the audio into the first attention mechanism Attention1: Combining the positive hidden feature with the positive output and the negative hidden feature with the negative output. Finally we obtain the positive feature and the negative feature of the audio:
FA+=∑softmax[relu(ha+×WA+)]×[relu(tanh(GA+×WA+))]×GA+ | (3.4) |
FA−=∑softmax[relu(ha−×WA−)]×[relu(tanh(GA−×WA−))]×GA− | (3.5) |
where FA+ is the positive feature obtained from the audio feature vector after Attention1 and FA− is the negative feature obtained from the audio feature vector after Attention1. WA is the parameter matrix needed to learn. relu and tanh are the activation functions.
Then FA+ and FA− are input into the second layer of attention mechanism Attention2, to combine the positive and negative features and learn the weights of positive and negative features to get the complete audio features:
FA=FA+×θA+FA−×(1−θA) | (3.6) |
where FA is the final audio feature obtained by passing the audio vector through the attention-based two-layer bidirectional GRU model and θA is the weight to be learned.
For the vision modality, we combine the vision feature GV=(Gv1,Gv2,…,Gvl), the positive output GV+ and the negative output GV− of the vision feature and the positive hidden layer state hv+ and the negative hidden layer state hv− of the vision into the first attention mechanism Attention1: combining the positive hidden feature with the positive output and the negative hidden feature with the negative output. Finally, we obtain the positive feature and the negative feature of the vision:
FV+=∑softmax[relu(hv+×WV+)]×[relu(tanh(GV+×WV+))]×GV+ | (3.7) |
FV−=∑softmax[relu(hv−×WV−)]×[relu(tanh(GV−×WV−))]×GV− | (3.8) |
where FV+ is the positive feature obtained from the vision feature vector after Attention1, FV− is the negative feature obtained from the vision feature vector after Attention1, WV is the parameter matrix to be learned and relu and tanh are the activation functions.
Then FV+ and FV− are input into the second layer of attention mechanism Attention2, combine the positive and negative features and learn the weights of positive and negative features to obtain the complete vision features:
FV=FV+×θV+FV−×(1−θV) | (3.9) |
where FV is the final image feature obtained by passing the image vector through the attention-based two-layer bidirectional GRU model and θV is the weight to be learned.
The final feature extraction module gets the outputs: text feature vector FT = (Ft1, Ft2, …, Ftn), audio feature vector FA=(Fa1,Fa2,…,Fam) and vision feature vector FV=(Fv1,Fv2,…,Fvl).
In particular, a fully connected layer is added after the text modality's attention module to reduce the text features' dimensionality. The size of the fully connected layer is the same as the FT dimension and uses Sigmoid as the activation function.
The Low-Rank Multimodal Fusion used in this paper in the feature fusion module is a method that uses a low-rank weight tensor to make multimodal fusion efficient without affecting performance. The tensor is powerful in terms of expressiveness and can simulate the alignment and fusion between different modalities very well. The model used in this paper is also an improvement on the tensor fusion model (TFN) [12], which differs from TFN in that LMF decomposes the weights into low-rank factors after the multidimensional model into which the tensor is fused, reducing the number of parameters in the model. Tensor-based fusion can be effectively improved by using parallel decomposition of the low-rank weight tensor and the input tensor to compute tensor-based fusion, which is more efficient than simple splicing or pooling and can scale linearly with the number of modes.
In LMF, multimodal fusion can be described as a multilinear function of:
f:D1×D2×⋯×DN→H | (3.10) |
where D1,D2,⋯,DN are the vector spaces of the input modes, N is the number of modes and H is the output vector space.
Multimodal fusion aims to encode the unimodal information of n different modalities and assemble them into a compact multimodal representation. In this paper, we use tensor fusion to store the multimodal interaction information by an additional vector 1 and then obtain a high-dimensional tensor ZN containing all the modalities by modeling. The tensor is usually obtained by finding the outer product of the input modalities.
ZN=⊗Nn=1zn | (3.11) |
where ⊗Nn=1 denotes the tensor outer product of a set of vectors indexed by N. zn is the input representation of the additional vector 1 for different modalities. As shown in Figure 5, two modalities are aligned by the additional vector 1 to form a tw-dimensional tensor Z2, which is then decomposed by the weight tensor W2 and mapped to a low-dimensional output vector.
The feature fusion module in this paper is shown in Figure 6, where the text feature vector FT=(Ft1,Ft2,…,Ftn), the audio feature vector FA=(Fa1,Fa2,…,Fam), and the vision feature vector FV=(Fv1,Fv2,…,Fvl) are input into the low-rank tensor fusion model (LMF) for feature fusion:
A vector with feature value of 1 is appended to each modal feature to store the information interactions between different modalities and to obtain the vector representation ZT for text features, ZA for audio features and ZV for vision features, respectively.
Using the additional vector 1 as the intersection point, we then construct the three modes into a three-dimensional Cartesian product model:
Z=ZT⊗ZA⊗ZV | (3.12) |
where Z denotes the three-dimensional tensor obtained by fusing the three modalities.
The three-dimensional tensor Z is then mapped back to a low-dimensional vector space to obtain the output of the feature fusion module h:
h=g(Z;W,b)=W⋅Z+b | (3.13) |
where g(⋅) is the linear layer function, h is the vector Z generated through the linear layer. W is the weight tensor to be learned and b is the offset.
In LMF, we must map the fused multidimensional tensor back to a low-dimensional output vector to improve the fusion efficiency and facilitate the downstream tasks. In this paper, we parameterize g() as a set of mode-specific low-rank factors for recovering the low-rank weight tensor. By decomposing the weights into a set of low-rank factors and exploiting the nature that the tensor Z can be decomposed into {Zn}Nn=1, we can compute the output vector h directly, thus reducing the number of parameters involved in the tenderization the computational complexity from N-dimensions to linear levels.
Thus, the vector h can be decomposed as:
h=(r∑i=1W(i)T⊗W(i)A⊗W(i)V)⋅Z=(r∑i=1W(i)T⋅ZT)∘(r∑i=1W(i)A⋅ZA)∘(r∑i=1W(i)V⋅ZV) | (3.14) |
where r is the minimum rank that makes the decomposition valid, WT is the weight tensor of the text modality, WA is the weight tensor of the audio modality and WV is the weight tensor of the vision modality.
Decision classification will be performed in the output module and the output sentiment polarity will be obtained.
We link three fully connected layers and a decision layer after LMF. The size e of the threfully connected layers will be reduced the dimensionality of the vector h layer by layer. We input the vector h obtained from the feature fusion module into the classification module, and reduce its dimensionality through the three fully connected layers. Finally, a single-valued output ρ is obtained. ρ will be input into the decision layer and map to a sample space and the sentiment polarity is positive when ρ≥0 and negative when ρ<0.
In this paper, we use the multimodal sentiment analysis datasets CMU-MOSI [31] and CMU-MOSEI as the experimental datasets. The CMU-MOSI dataset is a collection of 93 opinion videos from YouTube movie reviews, each consisting of multiple opinion clips calibrated by five workers, and finally averaged. The sentiment values for each segment ranged from strongly negative to strongly positive, and the linear scale ranged from –3 to +3. The CMU-MOSEI dataset is the largest multimodal sentiment and emotion recognition dataset available, and contains 23, 453 annotated video clips with 250 topics from 1000 different speakers. Each of these video clips contains alignment with audio down to the phoneme level.
Each video is divided into clips based on its transcript. Each paragraph corresponds to the audio and vision of that period to obtain a multimodal sentiment dataset consisting of three modalities: text, audio and vision.
Preprocessing operations are performed for each of the three modalities. The text data are truncated or filled to a length of 50, and word embedding is performed using a 300-dimensional Glove to encode the text sequences into word vector sequences. Enhancement and noise reduction are performed on audio data, and audio features are extracted using the COVAREP acoustic analysis framework. Enhancement and noise reduction are performed on image data using the Facet1 library for extracting visual features.
In this paper, Accuracy (ACC) and F1-score are used as the evaluation metrics of the model. Accuracy is a primary metric to evaluate the classification task and is the ratio of correct samples to the total number of samples in the classification result:
Acc=ncorrectntotal | (4.1) |
where ncorrect is the number of correctly classified samples, and ntotal is the total number.
F1-score is the weighted average of the precision and recall rates:
F1=2×precision×recallprecision+recall | (4.2) |
where precision is the accuracy rate and recall is the recall rate. The accuracy rate reflects the ability of the model to distinguish negative samples. The higher the value, the stronger the ability of the model to distinguish negative samples. The recall rate reflects the model's ability to identify positive samples. The higher the value, the stronger the model's ability to identify positive samples. The F1-score is a combination of the two, and the higher the F1-score, the more robust the model.
The F1-score in this paper is calculated by the weighting method. In the baseline of the experiment, if the F1-score has no value, it indicates that the method is not weighted in the calculation.
To enhance the credibility of the experiments, this paper also uses the MAE loss function and the Corr correlation coefficient as the evaluation metrics of the model and the AdamW optimizer as the processor of the network.
MAE=∑i|yi−ypi|n | (4.3) |
where MAE denotes the squared absolute error, yi denotes the magnitude of the sentiment value of the sample label, ypi denotes the magnitude of the predicted value and n denotes the total number of samples.
Corr(X,Y)=Cov(X,Y)√Var[X]Var[Y] | (4.4) |
where Cov(X, Y) is the covariance between X and Y, Var[X] is the variance of X and Var[Y] is the variance of Y.
The experimental setting of this paper is shown in Table 1.
Experimental environment | configuration |
Operating system | Windows 10 |
Processor | Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz |
torch | 1.9.0cpu |
torchvision | 0.10.0 + cpu |
Programming language | Python3.8 |
Deep learning framework | Pytorch |
In this paper, the loss function used in the experiments is L1Loss, the optimizer is AdamW and the learning rate is 0.001. The activation function used for the text features is Sigmoid, and the activation function for vision features is tanh. The dropout values of the model for the three modes of text, Audio, and Vision are all 0.5. The experimental parameters set on the CMU-MOSI dataset are as follows: the embedding dimensions of text, audio and vision modes are 300, 5 and 20 respectively, and the corresponding hidden dimension in the model is 128, 4 and 16 respectively. Set the batch_size value to 128 and the number of training cycles to 20. Due to the large data set of CMU-MOSEI, we ran the model on GPU and set the experimental parameters as follows: the embedding dimensions of text, audio and vision modes are 300, 35 and 74 respectively, and the corresponding hidden layer dimensions in the model are 128, 16 and 32 respectively. Set the batch_size value to 128 and the number of training cycles to 30.
When feature extraction is performed on the text data, there is an additional fully connected layer for reducing the dimensionality of the text features with a dimension of 128 × 64. The three fully connected layers after the feature fusion module are used to reduce the dimensionality of the fusion vectors, which are (4+1)∗(16+1)∗(64+1)×128, 128 × 128 and 128 × 1. The final single-valued output is obtained.
To validate the performance of the attention-based two-layer bidirectional GRU model proposed in this paper, we compare it with other multimodal fusion models on the CMU-MOSI dataset and CMU-MOSEI dataset.
AB-GRU: This paper proposes the attention-based two-layer bidirectional GRU multimodal sentiment analysis model.
LMF [8]: Low-rank multimodal fusion, which decomposes the weights into low-rank factors, reduces the number of parameters in the model.
TFN [12]: The tensor fusion network is tailored to address the instability of spoken language and accompanying gestures and speech in online videos. It can learn intra-modal and inter-modal dynamics end-to-end.
TFN+: This paper has improved attention-based two-layer bidirectional GRU network by integrating a tensor fusion model.
GME-LSTM [32]: Gated multimodal embedding can solve the fusion challenge when noise is present in the modalities. LSTM with temporal attention can perform word-level fusion with better fusion resolution.
MARN [7]: Multi-attention recurrent network, which discovers interactions between morphologies by using neural components called multi-attention blocks (MAB) and stores them in a mixed memory of recurrent components called long short term hybrid memory (LSTHM).
MFN [13]: Memory fusion network, which explicitly accounts for two interactions in neural structures and models them continuously over time, sends interactions across views with temporal information.
MFM [33]: Multimodal decomposition model optimizes the common generation-discrimination objective across multimodal data and labels by decomposing the representation into two independent sets of factors: multimodal discriminative factors and modality-specific generative factors.
RMFN [34]: Recurrent multi-stage fusion network, which decomposes the fusion problem into multiple stages, each focusing on a subset of multimodal signals, for specialized and efficient fusion.
The AB-GRU model used in this paper experimented on the dataset CMU-MOSI, and the results are shown in Figure 8. After the number of training reaches 10, the ACC and loss values of the model gradually become smooth. After several experiments, the final result is 80.9% for ACC and 93.0% for MAE. The current popular multimodal sentiment analysis methods experimented on the dataset CMU-MOSI under the same experimental environment and parameters. The results were compared with the model in this paper, and the results are shown in Table 2. AB-GRU is the attention-based two-layer bidirectional GRU model proposed in this paper. Compared with other multimodal sentiment analysis models, the AB-GRU model showed significant improvements in both ACC and F1 scores, reaching 80.9 and 81.0%, respectively. Compared with the original LMF model, our improved model resulted in a 4.5% increase in classification accuracy. LMF uses LSTM networks for feature extraction, and after experiments, it can be seen that using bidirectional GRU networks can improve the efficiency of feature extraction, while stacked GRU networks can effectively improve the accuracy without affecting the experimental rate and focus the vision on different modal data through the attention mechanism important information and extract data features in-depth for downstream fusion tasks.
Model | ACC/% | F1-score/% | MAE% | Corr% |
AB-GRU | 80.9 | 81.0 | 93.0 | 65.8 |
TFN+ | 80.1 | 80.1 | 91.9 | 69.7 |
LMF[8] | 76.4 | 75.7 | 91.2 | 66.8 |
TFN[12] | 77.1 | 77.9 | 95.6 | 67.2 |
GME-LSTM[32] | 76.5 | —— | 102.0 | 62.1 |
MARN[7] | 77.1 | 77.0 | 96.8 | 63.2 |
MFN[13] | 77.4 | 77.3 | 97.1 | 62.5 |
MFM[33] | 78.1 | 78.0 | 94.5 | 60.7 |
RMFN[34] | 78.4 | 78.0 | 92.9 | 67.3 |
Compared with the best RMFN model, AB-GRU improves the training effect by 2.5%. Most current multimodal sentiment classification models focus on modality fusion methods to improve and upgrade. The attention-based bilayer bidirectional GRU model proposed in this paper uses the characteristics of different modal data, which improves and upgrades the feature extraction module and chooses a more suitable low-rank tensor fusion model for feature fusion so that the overall performance has been improved. The high computing rate has been maintained.
The comparative structural analysis of AB-GRU and other models on the CMU-MOSI and CMU-MOSEI datasets is shown in Table 3.
Model | CMU-MOSI | CMU-MOSEI | ||
ACC/% | F1-score/% | ACC/% | F1-score/% | |
AB-GRU | 80.7 | 80.9 | 80.3 | 80.1 |
TFN+ | 80.1 | 88.0 | 78.3 | 78.3 |
LMF[8] | 76.4 | 75.7 | 75.2 | 75.0 |
TFN[12] | 77.1 | 77.9 | 76.2 | 76.1 |
GME-LSTM[32] | 76.5 | —— | 75.6 | —— |
MARN[7] | 77.1 | 77.0 | 75.9 | 75.8 |
MFN[13] | 77.4 | 77.3 | 76.0 | 76.0 |
MFM[33] | 78.1 | 78.0 | 76.8 | 76.5 |
RMFN[34] | 78.4 | 78.0 | 76.7 | 76.9 |
Due to the CMU-MOSEI dataset is more complex, the effect of sentiment analysis are all somewhat weakened, but still it can be seen that the AB-GRU model is superior to other sentiment analysis models.
The experiments show that the AB-GRU model achieves satisfactory performance on both the CMU-MOSI dataset and the CMU-MOSEI dataset. This indicates that the model has good generalization and can adapt to different sentiment analysis tasks and achieve good results on different datasets.
In this section, ablation experiments are set up to verify the importance of different modules in the AB-GRU model. The experimental results are shown in Table 4. We generated Figure 8 from the results in Table 4. The histogram shows that AB-GRU achieves superior results and compares them with the results before improving the different modules in the model. LMF is improved based on TFN, so we combined the AB-GRU model with TFN, whose experimental results are shown in Table 4 for TFN+, which has a significant improvement on TFN and once again verified the effectiveness of the attention-based bilayer bidirectional GRU model on multimodal sentiment classification.
Model | ACC/% | F1-score/% | MAE% | Corr% |
AB-GRU | 80.9 | 81.0 | 93.0 | 65.8 |
TFN+ | 80.1 | 80.1 | 91.9 | 69.7 |
single-layer GRU | 80.3 | 79.9 | 91.8 | 67.1 |
LSTM | 79.6 | 79.5 | 97.0 | 67.9 |
Relu | 80.6 | 81.1 | 93.2 | 65.7 |
Add Noisy | 80.7 | 80.7 | 92.7 | 67.8 |
Recurrent neural networks have better results on temporal information such as text and audio. On choosing LSTM or GRU for feature extraction, we verified that using LSTM combined with LMF for multimodal sentiment classification can achieve an accuracy of 79.6%, which exceeds most similar models but is still lower than the AB-GRU model used in this paper. Second, the stacked bilayer GRU improves the accuracy by 0.6% over the single-layer GRU model and only sacrifices a lower training rate.
We also tested the effect of different activation functions on the performance, and the Sigmoid function finally used was slightly better than using the Relu function. In addition, during the experiments, the text modality has high dimensionality. It is more difficult to process, so it is easy to generate overfitting problems when using the GRU network for feature extraction. We considered adding noise to the text data to improve the model's generalization performance. The results are shown in "noisy" in Table 4. After the experiment, the ACC value did not improve significantly.
The different modalities are decoupled from each other in the low-rank tensor fusion model so that it can be extended to data with any number of modalities. To explore the effect of the number and type of modes on the performance, we designed a set of experiments, and the results are shown in Table 5. The training results for different modal combinations following epoch values are shown in Figure 10.
Model | Modality | ACC/% | F1-score/% |
AB-GRU | T+A+V | 80.9 | 81.0 |
AB-GRU | T+A | 79.1 | 78.9 |
AB-GRU | T+V | 75.9 | 75.6 |
AB-GRU | A+V | 56.3 | 56.7 |
It can be seen from Figure 10 that the classification effect of the A+V combination is significantly lower than the other groups, which shows that in the multimodal emotion classification task, text data plays a crucial role. In contrast, audio and image data play a supporting role, where audio data is more compatible with text. The audio data is more compatible with the text, and the T + A combination makes the ACC reach 79.1% and the F1-score reach 78.9%, which can handle the sentiment classification task well. The addition of images has further improved the accuracy. However, the improvement is slight but the small improvement plays a crucial role in the face of complex and redundant information.
In order to solve the problem of the heterogeneinty gap between different modalities and improve the efficiency of feature extraction, this paper proposed an attention-based two-layer bidirectional GRU multimodal sentiment analysis model. The two-layer bidirectional GRU used in this model can effectively learn the text and audio temporal features with a simple structure and fast learning speed. The connected attention layer allows better extraction of essential features. In contrast, the LMF model can reduce the dimensionality of multimodal data, improve the operation rate and increase the accuracy rate. Experimental results show that the performance of the the AB-GRU model proposed in this paper is improved by at least 2.5% compared with other multimodal sentiment analysis models.
In our future work, we will conduct more in-depth research to apply multimodal sentiment analysis methods in different fields. In the medical field, the patient's speech, voice and facial expression can be monitored for condition analysis and timely feedback and treatment can be given. In the short video, classification, integration and recommendation are performed by multimodal methods. Moreover, with the development of technology, we will continue to improve the multimodal sentiment analysis methods, from feature extraction, feature fusion, data pre-processing, and other modules to improve the model's efficiency.
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
This work is supported by the National Natural Science Foundation of China (Grant No.61602161, 61772180), Hubei Province Science and Technology Support Project (Grant No.2020BAB012), Hubei Provincial Science and Technology Program Project (Grant No.2023BCB041), and The Fundamental Research Funds for the Research Fund of Hubei University of Technology (HBUT: 2021046, 21060, 21066)
The authors declare there is no conflict of interest.
[1] | T. Uyen Tran, H. H. T. Thanh, P. H. Dang, M. Riveill, Multitask apect_based sentiment analysis with integrated bidirectional LSTM & CNN model, in Proceedings of the 4th International Conference on FutureNetworks and Distributed Systems (ICFNDS), (2020), 1–7. https://doi.org/10.1145/3440749.3442656 |
[2] | A. Agarwal, P. Dey, S. Kumar, Sentiment analysis using modified GRU, in Proceedings of the 2022 Fourteenth International Conference on Contemporary Computing, (2022), 356–361. https://doi.org/10.1145/3549206.3549270 |
[3] |
L. Wang, J. Niu, S. Yu, SentiDiff: Combining textual information and sentiment diffusion patterns for twitter sentiment analysis, IEEE Trans. Knowl. Data Eng., 32 (2020), 2026–2039. https://doi.org/10.1109/TKDE.2019.2913641 doi: 10.1109/TKDE.2019.2913641
![]() |
[4] |
M. A. Hassonah, R. Al-Sayyed, A. Rodan, A. M. Al-Zoubi, I. Aljarah, H. Faris, An efficient hybrid filter and evolutionary wrapper approach for sentiment analysis of various topics on Twitter, Knowl. Based Syst., 192 (2020), 105353. https://doi.org/10.1016/j.knosys.2019.105353 doi: 10.1016/j.knosys.2019.105353
![]() |
[5] |
M. G. Huddar, S. S. Sannakki, V. S. Rajpurohit, Attention-based multimodal contextual fusion for sentiment and emotion classification using bidirectional LSTM, Multim. Tools Appl., 80 (2021), 13059–13076. https://doi.org/10.1007/s11042-020-10285-x doi: 10.1007/s11042-020-10285-x
![]() |
[6] | T. Jiang, J. Wang, Z. Liu, Y. Ling, Fusion-extraction network for multimodal sentiment analysis, in Proceedings of the Advances in Knowledge Discovery and Data Mining: 24th Pacific-Asia Conference, (2020), 785–797. https://doi.org/10.1007/978-3-030-47436-2_59 |
[7] | A. Zadeh, P. P. Liang, S. Poria, P. Vij, E. Cambria, L. P. Morency, Multi-attention recurrent network for human communication comprehension, preprint, arXiv: 1802.00923. |
[8] | Z. Liu, Y. Shen, V. B. Lakshminarasimhan, P. P. Liang, A. Zadeh, L. P. Morency, Efficient low-rank multimodal fusion with modality-specific factors, preprint, arXiv: 1806.00064. |
[9] | L. N. Zúñiga-Morales, J. Á. González-Ordiano, J. E. Quiroz-Ibarra, S. J. Simske, Impact evaluation of multimodal information on sentiment analysis. in Proceedings of the Advances in Computational Intelligence: 21st Mexican International Conference on Artificial Intelligence, (2022), 18–29. https://doi.org/10.1007/978-3-031-19496-2_2 |
[10] | D. Zeng, Y. Yu, K. Oyama, Deep triplet neural networks with cluster-CCA for audio-visual cross-modal retrieval, in ACM Transaction on Multimedia Computing Communication and Applications (TOMCCAP), (2020), 1–23. https://doi.org/10.1145/3387164 |
[11] | A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, et al., Learning transferable visual models from natural language supervision, preprint, arXiv: 2103.00020. |
[12] | A. Zadeh, M. Chen, S. Poria, E. Cambria, L. P. Morency, Tensor fusion network for multimodal sentiment analysis, preprint, arXiv: 1707.07250. |
[13] | A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria, L. P. Morency, Memory fusion network for multi-view sequential learning, preprint, arXiv: 1802.00927. |
[14] |
G. Van Houdt, C. Mosquera, G. Nápoles, A review on the long short-term memory model, Artif. Intell. Rev., 53 (2020), 5929–5955. https://doi.org/10.1007/s10462-020-09838-1 doi: 10.1007/s10462-020-09838-1
![]() |
[15] | A. P. Rodrigues, R. Fernandes, A. Shetty, K. Lakshmanna, R. M. Shafi, Real-time twitter spam detection and sentiment analysis using machine learning and deep learning techniques, Comput. Intell. Neurosci., (2022). https://doi.org/10.1155/2022/5211949 |
[16] | A. Londhe, P. V. R. D. P. Rao, Aspect based sentiment analysis–an incremental model learning approach using LSTM-RNN, in Proceedings of the Advances in Computing and Data Sciences: 5th International Conference, (2021), 677–689. https://doi.org/10.1007/978-3-030-81462-5_59 |
[17] |
H. Jelodar, Y. Wang, R. Orji, S. Huang, Deep sentiment classification and topic discovery on novel coronavirus or COVID-19 online discussions: NLP using LSTM recurrent neural network approach, IEEE J. Biomed. Health Inform., 24 (2020), 2733–2742. https://doi.org/10.1109/JBHI.2020.3001216 doi: 10.1109/JBHI.2020.3001216
![]() |
[18] | F. Wang, S. Tian, L. Yu, J. Liu, J. Wang, K. Li, et al., TEDT: Transformer-based encoding-decoding translation network for multimodal sentiment analys, Cogn. Comput., (2022), 1–15 https://doi.org/10.1007/s12559-022-10073-9 |
[19] |
J. Wu, T. Zhu, J. Zhu, T. Li, C. Wang, A Optimized BERT for Multimodal Sentiment Analysis, ACM Trans. Multim. Comput. Commun. Appl., 19 (2023), 1–12. https://doi.org/10.1080/09540091.2022.2155614 doi: 10.1080/09540091.2022.2155614
![]() |
[20] |
A. Bello, S. C. Ng, M. F. Leung, A BERT framework to sentiment analysis of tweets, Sensors, 23 (2023), 506. https://doi.org/10.3390/s23010506 doi: 10.3390/s23010506
![]() |
[21] |
J. Wei, J. Liao, Z. Yang, S. Wang, Q. Zhao, BiLSTM with multi-polarity orthogonal attention for implicit sentiment analysis, Neurocomputing, 383 (2020), 165–173. https://doi.org/10.1016/j.neucom.2019.11.054 doi: 10.1016/j.neucom.2019.11.054
![]() |
[22] |
Y. Zhang, J. Wang, X. Zhang, Conciseness is better: Recurrent attention LSTM model for document-level sentiment analysis, Neurocomputing, 462 (2021), 101–112. https://doi.org/10.1016/j.neucom.2021.07.072 doi: 10.1016/j.neucom.2021.07.072
![]() |
[23] |
J. Hassan, U. Shoaib, Multi-class review rating classification using deep recurrent neural network, Neural Process. Letters, 51 (2020), 1031–1048. https://doi.org/10.1007/s11063-019-10125-6 doi: 10.1007/s11063-019-10125-6
![]() |
[24] | A. Zouzou, I. E. Azami, Text sentiment analysis with CNN & GRU model using GloVe, in Proceedings of the 2021 Fifth International Conference On Intelligent Computing in Data Sciences, (2021), 1–5. https://doi.org/10.1109/ICDS53782.2021.9626715 |
[25] | A. G. Eker, K. Eker, N. Duru, Multi-class sentiment analysis from turkish tweets with RNN, in Proceedings of the 2021 6th International Conference on Computer Science and Engineering (UBMK), (2021), 560–564. https://doi.org/10.1109/UBMK52708.2021.9558958 |
[26] | L. Zhu, Z. Zhu, C. Zhang, Y. Xu, X. Kong, Multimodal sentiment analysis based on fusion methods: A survey, Inform. Fusion, (2023), 306–325. https://doi.org/10.1016/j.inffus.2023.02.028 |
[27] | S. Ma, Z. Zeng, D. McDuff, Y. Song, Contrastive self-supervised learning of global-local audio-visual representations, 2021. |
[28] | L. Zhu, M. Xu, Y. Bao, Y. Xu, X. Kong, Deep learning for aspect-based sentiment analysis: A review, PeerJ Comput. Sci., (2022), e1044. https://doi.org/10.7717/peerj-cs.1044 |
[29] | X. Liu, J. You, Y. Wu, T. Li, L. Li, Z. Zhang, et al., Attention-based bidirectional GRU networks for efficient HTTPS traffic classification. Inform. Sci., (2020), 297–315. https://doi.org/10.1016/j.ins.2020.05.035 |
[30] | J. Wu, T. Zhu, J. Zhu, T. Li, C. Wang, Hierarchical multiples self-attention mechanism for multi-modal analysis, Multim. Syst., (2023). https://doi.org/10.1016/j.ins.2020.05.035 |
[31] | A. Zadeh, R. Zellers, E. Pincus, L. P. Morency, Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos, preprint, arXiv: 1606.06259. |
[32] | M. Chen, S. Wang, P. P. Liang, T. Baltrušaitis, A. Zadeh, L. P. Morency, Multimodal sentiment analysis with word-level fusion and reinforcement learning. in Proceedings of the 19th ACM international conference on multimodal interaction, (2017), 163–171. https://doi.org/10.1145/3136755.3136801 |
[33] | Y. H. H. Tsai, P. P. Liang, A. Zadeh, L. P. Morency, R. Salakhutdinov, Learning factorized multimodal representations, preprint, arXiv: 1806.06176. |
[34] | P. P. Liang, Z. Liu, A. Zadeh, L. P. Morency, Multimodal language analysis with recurrent multistage fusion, in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, (2018), 150–161. https://doi.org/10.18653/v1/D18-1014 |
1. | Lei Wang, Zhiwei Guan, Jian Liu, Jianyou Zhao, Research on the Driving Behavior and Decision-Making of Autonomous Vehicles (AVs) in Mixed Traffic Flow by Integrating Bilayer-GRU-Att and GWO-XGBoost Models, 2024, 15, 2032-6653, 333, 10.3390/wevj15080333 | |
2. | Dahuang Liu, Zhenguo Yang, Zhiwei Guo, 2024, Progressive Fusion Network with Mixture of Experts for Multimodal Sentiment Analysis, 979-8-3503-0850-1, 150, 10.1109/ICACI60820.2024.10537022 | |
3. | Xi Wang, Rong Guo, RETRACTED: Research on application of multimodal data fusion in intelligent building environment perception, 2024, 10641246, 1, 10.3233/JIFS-241252 | |
4. | Ganesh Chandrasekaran, S. Dhanasekaran, C. Moorthy, A. Arul Oli, Multimodal sentiment analysis leveraging the strength of deep neural networks enhanced by the XGBoost classifier, 2024, 1025-5842, 1, 10.1080/10255842.2024.2313066 | |
5. | Jun Wu, Jiangpeng Wang, Shilong Jing, Jinyu Liu, Tianfeng Zhang, Min Han, Pengfei Zhan, Gan Zuo, Text-dominant strategy for multistage optimized modality fusion in multimodal sentiment analysis, 2024, 30, 0942-4962, 10.1007/s00530-024-01518-2 | |
6. | Muhammad Fairuzabadi, Kusrini Kusrini, Ema Utami, Arief Setyanto, 2024, Advancements and Challenges in Gated Recurrent Units (GRU) for Text Classification: A Systematic Literature Review, 979-8-3315-0553-0, 1, 10.1109/IC2IE63342.2024.10748229 | |
7. | Xinhui Kang, Ziteng Zhao, Optimal design of ceramic form combining stable diffusion model and GRU-Attention, 2025, 64, 14740346, 103062, 10.1016/j.aei.2024.103062 |
Experimental environment | configuration |
Operating system | Windows 10 |
Processor | Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz |
torch | 1.9.0cpu |
torchvision | 0.10.0 + cpu |
Programming language | Python3.8 |
Deep learning framework | Pytorch |
Model | ACC/% | F1-score/% | MAE% | Corr% |
AB-GRU | 80.9 | 81.0 | 93.0 | 65.8 |
TFN+ | 80.1 | 80.1 | 91.9 | 69.7 |
LMF[8] | 76.4 | 75.7 | 91.2 | 66.8 |
TFN[12] | 77.1 | 77.9 | 95.6 | 67.2 |
GME-LSTM[32] | 76.5 | —— | 102.0 | 62.1 |
MARN[7] | 77.1 | 77.0 | 96.8 | 63.2 |
MFN[13] | 77.4 | 77.3 | 97.1 | 62.5 |
MFM[33] | 78.1 | 78.0 | 94.5 | 60.7 |
RMFN[34] | 78.4 | 78.0 | 92.9 | 67.3 |
Model | CMU-MOSI | CMU-MOSEI | ||
ACC/% | F1-score/% | ACC/% | F1-score/% | |
AB-GRU | 80.7 | 80.9 | 80.3 | 80.1 |
TFN+ | 80.1 | 88.0 | 78.3 | 78.3 |
LMF[8] | 76.4 | 75.7 | 75.2 | 75.0 |
TFN[12] | 77.1 | 77.9 | 76.2 | 76.1 |
GME-LSTM[32] | 76.5 | —— | 75.6 | —— |
MARN[7] | 77.1 | 77.0 | 75.9 | 75.8 |
MFN[13] | 77.4 | 77.3 | 76.0 | 76.0 |
MFM[33] | 78.1 | 78.0 | 76.8 | 76.5 |
RMFN[34] | 78.4 | 78.0 | 76.7 | 76.9 |
Model | ACC/% | F1-score/% | MAE% | Corr% |
AB-GRU | 80.9 | 81.0 | 93.0 | 65.8 |
TFN+ | 80.1 | 80.1 | 91.9 | 69.7 |
single-layer GRU | 80.3 | 79.9 | 91.8 | 67.1 |
LSTM | 79.6 | 79.5 | 97.0 | 67.9 |
Relu | 80.6 | 81.1 | 93.2 | 65.7 |
Add Noisy | 80.7 | 80.7 | 92.7 | 67.8 |
Model | Modality | ACC/% | F1-score/% |
AB-GRU | T+A+V | 80.9 | 81.0 |
AB-GRU | T+A | 79.1 | 78.9 |
AB-GRU | T+V | 75.9 | 75.6 |
AB-GRU | A+V | 56.3 | 56.7 |
Experimental environment | configuration |
Operating system | Windows 10 |
Processor | Intel(R) Core(TM) i7-10875H CPU @ 2.30GHz |
torch | 1.9.0cpu |
torchvision | 0.10.0 + cpu |
Programming language | Python3.8 |
Deep learning framework | Pytorch |
Model | ACC/% | F1-score/% | MAE% | Corr% |
AB-GRU | 80.9 | 81.0 | 93.0 | 65.8 |
TFN+ | 80.1 | 80.1 | 91.9 | 69.7 |
LMF[8] | 76.4 | 75.7 | 91.2 | 66.8 |
TFN[12] | 77.1 | 77.9 | 95.6 | 67.2 |
GME-LSTM[32] | 76.5 | —— | 102.0 | 62.1 |
MARN[7] | 77.1 | 77.0 | 96.8 | 63.2 |
MFN[13] | 77.4 | 77.3 | 97.1 | 62.5 |
MFM[33] | 78.1 | 78.0 | 94.5 | 60.7 |
RMFN[34] | 78.4 | 78.0 | 92.9 | 67.3 |
Model | CMU-MOSI | CMU-MOSEI | ||
ACC/% | F1-score/% | ACC/% | F1-score/% | |
AB-GRU | 80.7 | 80.9 | 80.3 | 80.1 |
TFN+ | 80.1 | 88.0 | 78.3 | 78.3 |
LMF[8] | 76.4 | 75.7 | 75.2 | 75.0 |
TFN[12] | 77.1 | 77.9 | 76.2 | 76.1 |
GME-LSTM[32] | 76.5 | —— | 75.6 | —— |
MARN[7] | 77.1 | 77.0 | 75.9 | 75.8 |
MFN[13] | 77.4 | 77.3 | 76.0 | 76.0 |
MFM[33] | 78.1 | 78.0 | 76.8 | 76.5 |
RMFN[34] | 78.4 | 78.0 | 76.7 | 76.9 |
Model | ACC/% | F1-score/% | MAE% | Corr% |
AB-GRU | 80.9 | 81.0 | 93.0 | 65.8 |
TFN+ | 80.1 | 80.1 | 91.9 | 69.7 |
single-layer GRU | 80.3 | 79.9 | 91.8 | 67.1 |
LSTM | 79.6 | 79.5 | 97.0 | 67.9 |
Relu | 80.6 | 81.1 | 93.2 | 65.7 |
Add Noisy | 80.7 | 80.7 | 92.7 | 67.8 |
Model | Modality | ACC/% | F1-score/% |
AB-GRU | T+A+V | 80.9 | 81.0 |
AB-GRU | T+A | 79.1 | 78.9 |
AB-GRU | T+V | 75.9 | 75.6 |
AB-GRU | A+V | 56.3 | 56.7 |