
Facial expression is a type of communication and is useful in many areas of computer vision, including intelligent visual surveillance, human-robot interaction and human behavior analysis. A deep learning approach is presented to classify happy, sad, angry, fearful, contemptuous, surprised and disgusted expressions. Accurate detection and classification of human facial expression is a critical task in image processing due to the inconsistencies amid the complexity, including change in illumination, occlusion, noise and the over-fitting problem. A stacked sparse auto-encoder for facial expression recognition (SSAE-FER) is used for unsupervised pre-training and supervised fine-tuning. SSAE-FER automatically extracts features from input images, and the softmax classifier is used to classify the expressions. Our method achieved an accuracy of 92.50% on the JAFFE dataset and 99.30% on the CK+ dataset. SSAE-FER performs well compared to the other comparative methods in the same domain.
Citation: Mubashir Ahmad, Saira, Omar Alfandi, Asad Masood Khattak, Syed Furqan Qadri, Iftikhar Ahmed Saeed, Salabat Khan, Bashir Hayat, Arshad Ahmad. Facial expression recognition using lightweight deep learning modeling[J]. Mathematical Biosciences and Engineering, 2023, 20(5): 8208-8225. doi: 10.3934/mbe.2023357
[1] | Ting-Huai Ma, Xin Yu, Huan Rong . A comprehensive transfer news headline generation method based on semantic prototype transduction. Mathematical Biosciences and Engineering, 2023, 20(1): 1195-1228. doi: 10.3934/mbe.2023055 |
[2] | Wu Zeng, Zheng-ying Xiao . Few-shot learning based on deep learning: A survey. Mathematical Biosciences and Engineering, 2024, 21(1): 679-711. doi: 10.3934/mbe.2024029 |
[3] | Hao Zhang, Lina Ge, Guifen Zhang, Jingwei Fan, Denghui Li, Chenyang Xu . A two-stage intrusion detection method based on light gradient boosting machine and autoencoder. Mathematical Biosciences and Engineering, 2023, 20(4): 6966-6992. doi: 10.3934/mbe.2023301 |
[4] | Depei Wang, Lianglun Cheng, Tao Wang . Fairness-aware genetic-algorithm-based few-shot classification. Mathematical Biosciences and Engineering, 2023, 20(2): 3624-3637. doi: 10.3934/mbe.2023169 |
[5] | Guoli Wang, Pingping Wang, Jinyu Cong, Benzheng Wei . MRChexNet: Multi-modal bridge and relational learning for thoracic disease recognition in chest X-rays. Mathematical Biosciences and Engineering, 2023, 20(12): 21292-21314. doi: 10.3934/mbe.2023942 |
[6] | Yevgeniy Bodyanskiy, Olha Chala, Natalia Kasatkina, Iryna Pliss . Modified generalized neo-fuzzy system with combined online fast learning in medical diagnostic task for situations of information deficit. Mathematical Biosciences and Engineering, 2022, 19(8): 8003-8018. doi: 10.3934/mbe.2022374 |
[7] | Jia Yu, Huiling Peng, Guoqiang Wang, Nianfeng Shi . A topical VAEGAN-IHMM approach for automatic story segmentation. Mathematical Biosciences and Engineering, 2024, 21(7): 6608-6630. doi: 10.3934/mbe.2024289 |
[8] | Hanjiang Wu, Jie Huang, Kehan Wu, António M. Lopes, Liping Chen . Precise tracking control via iterative learning for one-sided Lipschitz Caputo fractional-order systems. Mathematical Biosciences and Engineering, 2024, 21(2): 3095-3109. doi: 10.3934/mbe.2024137 |
[9] | Peng Wang, Shiyi Zou, Jiajun Liu, Wenjun Ke . Matching biomedical ontologies with GCN-based feature propagation. Mathematical Biosciences and Engineering, 2022, 19(8): 8479-8504. doi: 10.3934/mbe.2022394 |
[10] | Shi Liu, Kaiyang Li, Yaoying Wang, Tianyou Zhu, Jiwei Li, Zhenyu Chen . Knowledge graph embedding by fusing multimodal content via cross-modal learning. Mathematical Biosciences and Engineering, 2023, 20(8): 14180-14200. doi: 10.3934/mbe.2023634 |
Facial expression is a type of communication and is useful in many areas of computer vision, including intelligent visual surveillance, human-robot interaction and human behavior analysis. A deep learning approach is presented to classify happy, sad, angry, fearful, contemptuous, surprised and disgusted expressions. Accurate detection and classification of human facial expression is a critical task in image processing due to the inconsistencies amid the complexity, including change in illumination, occlusion, noise and the over-fitting problem. A stacked sparse auto-encoder for facial expression recognition (SSAE-FER) is used for unsupervised pre-training and supervised fine-tuning. SSAE-FER automatically extracts features from input images, and the softmax classifier is used to classify the expressions. Our method achieved an accuracy of 92.50% on the JAFFE dataset and 99.30% on the CK+ dataset. SSAE-FER performs well compared to the other comparative methods in the same domain.
Deep learning has significantly succeeded in image recognition, image segmentation and target detection. However, deep learning models often require a large amount of labeled data to train, and labeling the data will take more time. Some scholars propose zero-shot learning. Zero-shot learning uses the model obtained from the seen classes to infer the unseen class of samples, which can largely reduce the labeling of samples [1,2]. Zero-shot learning mimics the process of human cognition of new things. For example, a person knows an animal, like a horse, through pictures and their linguistic descriptions; when knowing about the linguistic description of zebras, zebras look like horses and have black and white stripes on its body, then the person can still recognize zebras by this linguistic description when seeing zebras even though he or she has never seen a zebra [3]. In zero-shot learning, semantic features are needed in addition to using sample features. Semantic features are linguistic descriptions of the samples, such as the color, size and other characteristics. Word vectors extracted by Word2Vec [4] are usually used as semantic features. In zero-shot learning, the model is trained by the seen class samples with semantic features to find the relationship between semantic features and the seen class samples, and then transfers the model to unseen class samples to infer the categories of the unseen class samples.
There are two categories of zero-shot learning: generalized zero-shot learning and conventional zero-shot learning. The test set contains only the unseen class samples for conventional zero-shot learning. While for generalized zero-shot learning, the test set contains not only the unseen class samples but also the seen class samples. In zero-shot learning, there are many approaches devoted to finding the relationship between the training set samples and the training set semantic features, such as mapping the training set samples to the semantic feature space, mapping the semantic features to the sample space [5], mapping the semantic features and samples to the common space [6,7] and mapping the semantic features and samples to each other's space [8]. However, the training set contains only the seen class samples, and the classes in the training set are not the same as those in the unseen class, which can lead to the inaccurate classification of the unseen class samples when the classification models are used in the test set.
To address these problems, some scholars use generative models, such as Variational Autoencoder (VAE) [9] and Generative Adversarial Network (GAN) [10], to generate the unseen class pseudo samples, and input them to the classifier for training, which can alleviate the problem of inaccurate classification of samples in the unseen classes. It has been proposed in the literature [11,12] that the use of VAE and GAN leads to the problems of posterior collapse and training instability, and the use of Wasserstein Auto-Encoder (WAE) and Wasserstein Generative Adversarial Network (WGAN) to generate pseudo samples can alleviate this problem. However, the unseen class samples generated by these methods are easily biased to the features of the seen class samples, leading to inaccurate classification results when classifying the real unseen class samples. To address these problems, we propose the following methods:
1) Different from the above generation models, we use autoencoder to generate the unseen class samples. To make the sample features in the latent space more distinguishable and representative, a classifier is used for the sample features in the latent space.
2) In order to reduce the unseen class pseudo sample features that are biased to the seen class sample features, we propose new sample features and use them together with unseen class semantic features for cross-reconstruction loss function.
3) The proposed method is validated for three datasets, AWA1, AWA2 and aPY and have good results.
The structure of this paper is organized as follows. First, we thoroughly review the related works in Section 2. The proposed method is illustrated in Section 3. In Section 4, we discuss the experiments, and we conclude the paper in Section 5.
For zero-shot learning, embedding-based zero-shot learning is a common approach. However, embedding-based zero-shot learning can produce domain shift problems and misclassification in unseen class samples [13].
There are three solutions to alleviate the unseen class samples that are easily misclassified into the seen classes in zero-shot learning: calibrated stacking, generative models and detection of the unseen class samples. Calibrated stacking [14] added a calibrated term to the classifier so that the score of the seen class is reduced during classification, and the score of the unseen class samples can be increased. The calibrated stacking equation is as follows:
ˆy=argmaxc∈Tfc(x)−γI[c∈S] |
where γ is the calibrated factor and the indicator function. I[⋅] indicates whether c belongs to a seen class, if c is a seen class, the value of the indicator function is 1, otherwise the value of the indicator function is 0. The classifier of APN [15] used class embedding and added calibrated stacking in the classifier to alleviate misclassification of the unseen class samples.
The method of generative models is to use generative models to generate pseudo samples substitute real unseen class samples [16]. The training set and pseudo samples are used in training the classifier, so that the unseen class samples can avoid bias to the seen classes. Multi-modal Feature Fusion algorithm (MFF) [17] used visual principal component features to compensate for the lack of descriptive information using only semantic features, and then combined GAN and VAE to generate high-quality pseudo-samples. Cross- and Distribution Aligned VAE (CADA-VAE) [18] was the VAE method. The latent space distributional alignment and cross-alignment were used to ensure the alignment between the two different modalities of sample features and semantic features. To make the generated samples close to the real samples, Over-Complete Distribution using Conditional Variational Autoencoder (OCD-CVAE) [19] used over-complete distribution to generate pseudo samples. Chen et al. [12] proposed to use WAE to generate pseudo samples and used an aggregated posterior distribution in the latent space to align the manifold structure of the sample features and the semantic features. f-CLSWGAN [20] used WGAN to generate pseudo samples and used a classifier to make the generated pseudo samples more discriminative. Based on f-CLSWGAN, Adaptive Bias-Aware GAN (ABA-GAN) [21] proposed adaptive adversarial loss and domain loss functions to make the generated pseudo samples more meaningful and to distinguish the seen classes from the unseen classes. Li et al. [11] used WGAN to generate pseudo samples and used multimodal cyclic loss function and bi-directional autoencoder. In response to the fact that GAN is not easy to train, and the pseudo samples generated by VAE are of low quality, Dual VAEGAN [22] used a combination of GAN and VAE.
Detection of samples of the unseen class. This method first distinguishes whether the samples belong to the seen classes or the unseen classes and then classifies the samples into specific class. GatingAE [3] first used the latent space and the cross-reconstruction space to detect samples belonging to the unseen class, and then used a linear classifier to classify the samples in the seen classes and a nearest neighbor classifier for the samples belonging to the unseen classes. Chen et al. [23] proposed determining whether a sample belongs to the seen class or the unseen class by calculating the cosine similarity between the latent space features of the samples and the mean value of each class. However, the models in these methods are obtained by training the samples from the seen classes; when migrating to the samples from the unseen classes, the classification results of the samples from the unseen classes are still biased to the seen classes.
Cao et al. [24] achieved recognition of zero shot traffic signs using autoencoder. Different from the literature [24], we use autoencoder to generate the unseen samples to alleviate the misclassification of the unseen class samples. To prevent the generated unseen class samples biased towards the features of seen class samples and improve the classification accuracy of the unseen class samples, we add the information of both unseen class semantic features and the proposed sample features.
In zero-shot learning, the training set can be denoted as S={XS,AS,YS}, and the unseen class can be denoted as U={XU,AU,YU}, where X denotes sample features, A denotes semantic features and Y denotes labels. For conventional zero-shot learning, the class of XU is predicted by the classifier: XU→YU; for generalized zero-shot learning, the class of X is predicted by the classifier: X→YS∪YU.
In this study, we use autoencoder to generate the pseudo samples of unseen classes, and the model is shown in Figure 1. In the Figure 1, E1 and E2 represent the encoder, D1 and D2 represent the decoder. The sample features and semantic features are encoded to obtain the same dimensional latent space features.
According to the autoencoder, for the training set, the generated sample features ~XS and the semantic features of the seen classes ~AS need to approximate the input features XS and AS.Assuming that there are m samples, the reconstruction loss function can be written as:
Lrecon1=1m∑mi=1|xsi−~xsi|+1m∑mi=1|asi−~asi| | (1) |
We use the lowercase xsi, ~xsi, asi and asi to denote one sample feature in XS, one generate sample feature in ~XS, one semantic feature in AS and one generated semantic feature in ~AS respectively. We want these two modality features to be aligned in the latent space. We use ZS to represent the sample features of the latent space and ZAS to represent the seen class semantic features of the latent space.
Llatent−recon=1m∑mi=1|zSi−zASi| | (2) |
In Eq (2), we use the lowercase zSi to denote one sample feature in ZS, and use the zASi to denote one seen class semantic feature in ZAS. In addition to the reconstruction loss function shown in Eq (1), zero-shot learning contains two different modalities, sample features and semantic features. Aligning different modalities can reduce the domain shift problem [25]. Inspired by GatingAE [3], Chen et al. [12], CADA-VAE [18] and Discriminative Cross-Aligned Variational Autoencoder(DCA-VAE) [25], we use the cross-reconstruction loss function. The features ¯XS and ¯AS are obtained by passing the semantic features and sample features of the latent space through D2 and D1 decoders, respectively, the cross-reconstruction loss function is as follows:
Lcross−recon1=1m∑mi=1|xsi−¯xsi|+1m∑mi=1|asi−¯asi| | (3) |
Here, ¯xsi denotes one feature in ¯XS, and ¯asi denotes one feature in ¯AS. Although we can use Eqs (1), (2) and (3) to train the model and then to generate samples of the unseen classes, Eqs (1), (2) and (3) only contains samples of the seen classes and semantic features, which will lead to the pseudo samples being biased to the seen classes. To address this problem, we add the unseen class semantic features AUS to the model and propose new sample features ˆX.
The unseen class semantic features AUS can be obtained by the following method. The sample features of the training set are mapped to the semantic feature space using the following equation:
minW||XS−WTAS||2F+α||W||2F | (4) |
||⋅||F in Eq (4) denotes Frobenius norm. The mapping matrix W is obtained as follows:
W=XTSAS(ATSAS+αI)−1 | (5) |
where I in Eq (5) represents the unit matrix and α denotes an adjustable parameter. The sample features in the training set are then mapped to the semantic feature space through the mapping matrix W and find the nearest unseen class semantic features, which constitute AUS.
After obtaining AUS, we input AUS to the autoencoder to obtain the generated unseen class semantic features ~AUS, and the reconstruction loss between ~AUS and AUS is:
Lrecon2=1m∑mi=1|ausi−~ausi| | (6) |
Here, we use ausi to denote one unseen class semantic feature in AUS and ~ausi to denote one generated unseen class semantic feature in ~AUS. Except the reconstruction loss for AUS. We also want to align different modalities between the unseen class semantic features and sample features, but there is a lack of unseen class samples in the training set. In this paper, we take the following approach to get the cross-reconstruction loss function: Find the difference between the unseen class semantic features and the seen class sample features in the latent space, the difference represents the relationship between the unseen class semantic features and the seen class sample features in the latent space. Then, pass the difference through the decoder D1 to get θ. We use ZS and ZAU to represent the latent features of the seen class sample features and the latent features of the unseen class semantic features. The formula is as follows:
θ=D1(ZS−ZAU) | (7) |
Then subtract θ from the sample features of the training set to obtain the feature ˆX:
ˆX=XS−θ | (8) |
The cross-reconstruction loss function for unseen class semantic features can be written as:
Lcross−recon2=1m∑mi=1|^xi−¯xusi|+β1m∑mi=1|ausi−¯asi| | (9) |
β in the above equation is an adjustable parameter. ^xi denotes one feature in ˆX, ¯xusi denotes one feature obtained by passing one unseen class semantic feature through the decoder D1. The reason for using the feature ˆX instead of XS is that ˆX can reduce the information of the seen class samples in the loss function, which can alleviate the similarity between the unseen class pseudo samples and the seen class samples.
To better find the relationship between the semantic features of unseen classes and the training set samples in Eq (7), and also make the sample features in the latent space distinguishable and representative, the sample features of the latent space are classified using the cross-entropy loss function:
Lclassifier=−∑mi=1ysilog~ysi | (10) |
~ysi in Eq (10) is the predicted label and ysi is the true label of the sample features of the latent space.
Combining Eqs (1), (2), (3), (6), (9) and (10), the objective function is:
L=Lrecon1+Llatent−recon+Lrecon2+Lcross−recon1+Lcross−recon2+Lclassifier | (11) |
After the model is trained according to Eq (11), the samples are generated with the sample features XS and semantic features AUS. For generalized zero-shot classification, all the generated seen class samples and unseen class samples need to be input to the classifier for training. For conventional zero-shot classification, only the generated unseen class samples need to be input to the classifier for training.
Three datasets, AWA1, AWA2 and aPY, are used in our study.
1) AWA1 [26]: The seen class contains 40 categories, and the unseen class contains 10 categories. The number of samples in the seen class is 19832, the number of samples in the unseen class is 5685 and the dimension of the semantic features is 85.
2) AWA2 [27]: The seen class contains 40 categories, and the unseen class contains 10 categories. The number of samples in the seen class is 23527, the number of samples in the unseen class is 7913 and the dimension of the semantic features is 85.
3) aPY [28]: The seen class contains 20 categories, and the unseen class contains 12 categories. The number of samples in the seen class is 5932, the number of samples in the unseen class is 7924, and the dimension of the semantic features is 64.
The sample features and semantic features used in our study are taken from the literature [27]. Following the literature [12], the input dimension of encoder E1 is 2048 dimensions, the output of the first layer is 512 dimensions, and the dimension of the latent space is 128; the dimension of the output of the first layer of encoder E2 is 128.The dimension of the output of the first layer of decoder D1 is 256 and the dimension of output is 2048; the dimension of output of the first layer of decoder D2 is 256. We use the Adam algorithm for optimization, the learning rate is 0.001 and the batch size is 256.
We use the evaluation criteria proposed in the literature [27]. For the conventional zero-shot classification, only the accuracy of classification needs to be calculated:
acc=1|C|∑|C|i#correct predictions in isamples in i |
For generalized zero-shot classification, not only the classification accuracy of the seen class and the unseen class should be calculated, but also the harmonic mean. Assuming that the classification accuracy of the samples of seen classes is denoted as acctr and the classification accuracy of the samples of unseen classes is denoted as accts, the harmonic mean can be written as:
H=2×acctr×acctsacctr+accts |
The generalized zero-shot classification results and the conventional zero-shot classification results are shown in Tables 1 and 2, where the results of Semantic Autoencoder (SAE) [8], Direct Attribute Prediction (DAP) [26], Indirect Attribute Prediction (IAP) [26] and Structured Joint Embedding (SJE) [29] are from the literature [27]. In Table 1, "ts" represents the classification results of unseen classes and "tr" represents the classification results of the seen classes. From Table 1, the proposed method is 1% less than CADA-VAE [18] for the AWA1 dataset. For the AWA2 dataset, the proposed method is 0.5% better than Chen et al. [23]. For the aPY dataset, the proposed method is 3.1% higher than DAP [26], while it is 4.9% higher than the generative model Chen et al. [12]. The accuracy of the proposed method on unseen class is higher than other methods.
AWA1 | AWA2 | aPY | |||||||
ts | tr | H | ts | tr | H | ts | tr | H | |
SAE [8] | 1.8 | 77.1 | 3.5 | 1.1 | 82.2 | 2.2 | 0.4 | 80.9 | 0.9 |
DAP [26] | 46.5 | 68.5 | 55.4 | 43.7 | 70.2 | 53.3 | 27.6 | 55.8 | 37.0 |
IAP [26] | 2.1 | 78.2 | 4.1 | 0.9 | 87.6 | 1.8 | 5.7 | 65.6 | 10.4 |
SJE [29] | 11.3 | 74.6 | 19.6 | 8.0 | 73.9 | 14.4 | 3.7 | 55.7 | 6.9 |
Preserving Semantic Relations (PSR) [30] | 20.7 | 73.8 | 32.3 | 13.5 | 51.4 | 21.4 | |||
f-CLSWGAN [20] | 57.9 | 61.4 | 59.6 | ||||||
Zhang et al. [31] | 20.7 | 67.9 | 38.6 | 16.1 | 66.9 | 25.9 | |||
Li et al. [11] | 54.9 | 71.7 | 62.2 | ||||||
CADA-VAE [18] | 57.3 | 72.8 | 64.1 | 55.8 | 75.0 | 63.9 | |||
Chen et al. [23] | 54.7 | 72.7 | 62.4 | 55.6 | 76.9 | 64.2 | |||
Chen et al. [12] | 54.5 | 72.8 | 62.3 | 55.2 | 73.5 | 63.0 | 26.7 | 51.5 | 35.2 |
The proposed method | 62.4 | 63.9 | 63.1 | 60.6 | 69.5 | 64.7 | 31.5 | 55.3 | 40.1 |
Table 2 shows the conventional zero-shot classification results. For the AWA1 dataset, the proposed method is slightly lower than f-CLSWGAN [20] and Li et al. [11], which used the GAN model. For the aPY dataset, the method in this paper is slightly lower than the method of Zhang et al. [31] and more accurate than the other methods. The accuracy of the method in this paper is higher than the other methods on the AWA2 dataset.
AWA1 | AWA2 | aPY | |
SAE [8] | 53.0 | 54.1 | 8.3 |
DAP [26] | 44.1 | 46.1 | 33.8 |
IAP [26] | 35.9 | 35.9 | 36.6 |
SJE [29] | 65.6 | 61.9 | 32.9 |
PSR [30] | 63.8 | 38.4 | |
Cross-Class Sample Synthesis (CCSS) [32] | 56.3 | 63.7 | 35.5 |
f-CLSWGAN [20] | 69.9 | ||
Zhang et al. [31] | 68.8 | 41.3 | |
Li et al. [11] | 69.9 | ||
CADA-VAE [18] | 58.8 | 60.3 | |
Chen et al. [12] | 65.2 | 65.5 | 32.7 |
The proposed method | 67.1 | 66.1 | 39.8 |
The parameters involved in the model are α, β and the dimensionality of the latent space, where we denote the dimensionality of the latent space as d. The effects of taking different values of α, β and d on the generalized zero-shot classification and the conventional zero-shot classification are shown in Figures 2, 3 and 4.
Figure 2 shows the effects of the parameter α on the zero-shot classification results. The parameter α is used to prevent overfitting. Taking the values of α as 0.1, 1, 10 and 100. It can be seen from Figure 2 that the classification results of the aPY dataset are decreasing and then increasing as α keeps increasing. The results of the AWA2 dataset on the conventional zero-shot classification are increasing all the time, while the values of the generalized zero-shot classification are decreasing and then increasing. The results of AWA1 dataset on the conventional zero-shot classification is decreasing and then increasing, and the harmonic mean is always increasing.
The values of β are taken as 0.001, 0.01, 0.1 and 1. β is used to regulate the relationship between the training set samples and the generated unseen class semantic features, and the value of β is taken small because the training set samples are not the real unseen class samples. From Figure 3, the accuracy of conventional zero-shot classification on the aPY dataset is almost unaffected by the value of β, but the value in the generalized zero-shot classification decreases with increasing of β. The classification accuracy of AWA1 and AWA2 on conventional zero-shot classification is also almost unaffected by the value of β, but the harmonic mean value increases and then decreases with increasing of β.
Figure 4 shows the effects of dimension d on the zero-shot classification results, with d taking values of 64,128 and 256. For the aPY dataset, the accuracy of conventional zero-shot classification increases first and then decreases as d increases, the harmonic mean value keeps decreasing. For the AWA1 dataset, the results increase and then decrease with increasing d, except for the classification results of the seen classes. For the AWA2 dataset, most of the zero-shot classification results show a trend of increasing and then decreasing with increasing d.
Figure 5 shows the tSNE for generalized zero-shot classification of the aPY dataset, where (a) and (b) denote the training set samples and unseen class samples, respectively, and (c) and (d) are the generated training set samples and unseen class samples.
For the training set samples, the distribution between the generated samples and the original samples is almost the same, and the generated samples are more dispersed between different categories and more concentrated within classes than the original samples. For the unseen class samples, there are more samples presenting orange color in the original samples, while in the generated samples, since AUS is chosen for generating the unseen class samples in this paper, it will lead to the number of some classes will be more and the number of some other classes will be less in the generated samples. Except for the inconsistent number of samples, most of the generated samples are similar to the distribution of the real samples.
The ablation experiments are divided into the following cases: a. Only Eq (1) is retained as the loss function of the model. b. Add Llatent−recon to Eq(1). c. Lcross−recon1≠0 on the basis of b. d. Lclassifier≠0 on the basis of c; e. Based on d, the second term in Lcross−recon2 is not 0; f. Based on e, Lrecon2≠0. The proposed method is to add the first term in Lcross−recon2, on the basis of f. The harmonic mean H and the accuracy acc of the conventional zero-shot classification are shown in Table 3.
AWA1 | AWA2 | aPY | ||||
acc | H | acc | H | acc | H | |
a | 45.1 | 30.0 | 45.1 | 17.9 | 30.7 | 22.0 |
b | 52.6 | 32.3 | 59.4 | 38.4 | 38.8 | 25.0 |
c | 56.5 | 33.3 | 57.6 | 23.4 | 35.2 | 26.3 |
d | 59.7 | 48.3 | 59.7 | 40.2 | 36.8 | 34.4 |
e | 61.1 | 51.4 | 60.2 | 43.8 | 36.9 | 35.1 |
f | 61.1 | 49.8 | 61.0 | 45.7 | 37.0 | 36.4 |
The proposed method | 67.1 | 63.1 | 66.1 | 63.1 | 39.8 | 40.1 |
As can be seen in Table 3, most of the classification results are increased as the term increased in the loss function. However, for the AWA1 dataset when changing from e to f, the accuracy does not change for the conventional zero-shot classification, and the harmonic mean decreases slightly. The proposed method does not increase particularly much in the conventional zero-shot classification compared to other methods, especially in the aPY dataset, and for aPY dataset, the method b is larger than other methods except the proposed method. For AWA2 when changing from b to c, the results decreases, especially in harmonic mean. The seen class information increased when we add the Lcross−recon1 and the accuracy of the unseen classes decreases. However, for generalized zero-shot classification, the proposed method can provide some information about the unseen classes when training the model, and reduce the similarity between the generated unseen class samples and the seen class samples.
By replacing ˆX in the loss function Lcross−recon2 with XS, the results are shown in Table 4 numbered as (1), and the results of the proposed model numbered as (2). From Table 4, when XS is used instead of ˆX, the results of zero-shot classification are all decreased, especially for generalized zero-shot classification. This is because the loss function contains more information about the samples of the seen classes, making the results easily biased to the seen classes.
AWA1 | AWA2 | aPY | ||||||||||
acc | ts | tr | H | acc | ts | tr | H | acc | ts | tr | H | |
(1) | 55.1 | 37.0 | 71.8 | 48.8 | 55.6 | 34.8 | 73.0 | 47.1 | 35.8 | 27.0 | 52.1 | 35.6 |
(2) | 67.1 | 62.3 | 63.9 | 63.1 | 66.1 | 60.6 | 69.5 | 64.7 | 39.8 | 31.5 | 55.3 | 40.1 |
In this study, an autoencoder approach is used for generating samples of unseen classes in zero-shot learning. For the problem that the generated unseen class sample features are always biased to the seen class features, we add the semantic features of the unseen class with the proposed new sample features to the cross-reconstruction loss function. This can reduce the information of the seen class samples and make the generated unseen class samples closer to the real unseen class samples, and improve the classification accuracy of the unseen class samples. The experimental results on three datasets verify that the proposed method can achieve good results.
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.
The authors declare there is no conflict of interest.
[1] |
A. T. Lopes, E. D. Aguiar, A. F. D. Souza, T. Oliveira-Santos, Facial expression recognition with convolutional neural networks: coping with few data and the training sample order, Pattern Recognit., 61 (2017), 610–628. https://doi.org/10.1016/j.patcog.2016.07.026 doi: 10.1016/j.patcog.2016.07.026
![]() |
[2] | S. S. Hammed, A. Sabanayagam, E. Ramakalaivani, A review on facial expression recognition systems, J. Crit. Rev., 7 (2020), 903–905. Available from: https://www.jcreview.com/admin/Uploads/Files/61aa04ff88cda6.89247605.pdf. |
[3] |
S. Rajan, P. Chenniappan, S. Devaraj, N. Madian, Facial expression recognition techniques: a comprehensive survey, IET Image Proc., 13 (2019), 1031–1040. https://doi.org/10.1049/iet-ipr.2018.6647 doi: 10.1049/iet-ipr.2018.6647
![]() |
[4] | S. H. Ma, S. M. Lai, Y. Sun, Z. C. Pan, Research status and prospect of face expression recognition, in 2019 Chinese Control And Decision Conference (CCDC), (2019), 640–646. https://doi.org/10.1109/CCDC.2019.8833483 |
[5] | T. A. Rashid, Convolutional neural networks based method for improving facial expression recognition, in Intelligent Systems Technologies and Applications 2016, (2016), 73–84. https://doi.org/10.1007/978-3-319-47952-1_6 |
[6] |
M. A. Jaffar, Facial expression recognition using hybrid texture features based ensemble classifier, Int. J. Adv. Comput. Sci. Appl., 8 (2017), 449–453. https://doi.org/10.14569/IJACSA.2017.080660 doi: 10.14569/IJACSA.2017.080660
![]() |
[7] |
R. Gupta, Positive emotions have a unique capacity to capture attention, Prog. Brain Res., 247 (2019), 23–46. https://doi.org/10.1016/bs.pbr.2019.02.001 doi: 10.1016/bs.pbr.2019.02.001
![]() |
[8] |
A. B. S. Salamh, H. I. Akyüz, A new deep learning model for face recognition and registration in distance learning, Int. J. Emerging Technol. Learn., 17 (2022), 29. https://doi.org/10.3991/ijet.v17i12.30377 doi: 10.3991/ijet.v17i12.30377
![]() |
[9] |
M. Ahmad, D. Ai, G. Xie, S. F. Qadri, H. Song, Y. Huang, et al., Deep belief network modeling for automatic liver segmentation, IEEE Access, 7 (2019), 20585–20595. https://doi.org/10.1109/ACCESS.2019.2896961 doi: 10.1109/ACCESS.2019.2896961
![]() |
[10] |
S. F. Qadri, D. Ai, G. Hu, M. Ahmad, Y. Huang, Y. Wang, et al., Automatic deep feature learning via patch-based deep belief network for vertebrae segmentation in CT images, Appl. Sci., 9 (2018), 69. https://doi.org/10.3390/app9010069 doi: 10.3390/app9010069
![]() |
[11] |
I. Hirra, M. Ahmad, A. Hussain, M. U. Ashraf, I. A. Saeed, S. F. Qadri, et al., Breast cancer classification from histopathological images using patch-based deep learning modeling, IEEE Access, 9 (2021), 24273–24287. https://doi.org/10.1109/ACCESS.2021.3056516 doi: 10.1109/ACCESS.2021.3056516
![]() |
[12] | M. Ahmad, J. Yang, D. Ai, S. F. Qadri, Y. Wang, Deep-stacked auto encoder for liver segmentation, in Advances in Image and Graphics Technologies, (2017), 243–251. https://doi.org/10.1007/978-981-10-7389-2_24 |
[13] |
S. F. Qadri, L. Shen, M. Ahmad, S. Qadri, S. S. Zareen, M. A. Akbar, SVseg: stacked sparse autoencoder-based patch classification modeling for vertebrae segmentation, Mathematics, 10 (2022), 796. https://doi.org/10.3390/math10050796 doi: 10.3390/math10050796
![]() |
[14] |
M. Ahmad, S. F. Qadri, S. Qadri, I. A. Saeed, S. S. Zareen, Z. Iqbal, et al., A lightweight convolutional neural network model for liver segmentation in medical diagnosis, Comput. Intell. Neurosci., 2022 (2022), 7954333. https://doi.org/10.1155/2022/7954333 doi: 10.1155/2022/7954333
![]() |
[15] |
M. Ahmad, S. F. Qadri, M. U. Ashraf, K. Subhi, S. Khan, S. S. Zareen, et al., Efficient liver segmentation from computed tomography images using deep learning, Comput. Intell. Neurosci., 2022 (2022), 2665283. https://doi.org/10.1155/2022/2665283 doi: 10.1155/2022/2665283
![]() |
[16] | S. F. Qadri, M. Ahmad, D. Ai, J. Yang, Y. Wang, Deep belief network based vertebra segmentation for CT images, in Image and Graphics Technologies and Applications, (2018), 536–545. https://doi.org/10.1007/978-981-13-1702-6_53 |
[17] | M. Ahmad, Y. Ding, S. F. Qadri, J. Yang, Convolutional-neural-network-based feature extraction for liver segmentation from CT images, in Eleventh International Conference on Digital Image Processing (ICDIP 2019), 11179 (2019), 829–835. https://doi.org/10.1117/12.2540175 |
[18] |
I. Banerjee, Y. Ling, M. C. Chen, S. A. Hasan, C. P. Langlotz, M. Moradzadeh, et al., Comparative effectiveness of convolutional neural network (CNN) and recurrent neural network (RNN) architectures for radiology text report classification, Artif. Intell. Med., 97 (2019), 79–88. https://doi.org/10.1016/j.artmed.2018.11.004 doi: 10.1016/j.artmed.2018.11.004
![]() |
[19] | M. Murugappan, A. M. Mutawa, S. Sruthi, A. Hassouneh, A. Abdulsalam, S. Jerritta, et al., Facial expression classification using KNN and decision tree classifiers, in 2020 4th International Conference on Computer, Communication and Signal Processing (ICCCSP), (2020), 1–6. https://doi.org/10.1109/ICCCSP49186.2020.9315234 |
[20] | M. Qasim, M. Khan, W. Mehmood, F. Sobieczky, M. Pichler, B. Moser, A comparative analysis of anomaly detection methods for predictive maintenance in SME, in Database and Expert Systems Applications - DEXA 2022 Workshops, (2022), 22–31. https://doi.org/10.1007/978-3-031-14343-4_3 |
[21] |
M. Khan, A. Ahmad, F. Sobieczky, M. Pichler, B. A. Moser, I. Bukovský, A systematic mapping study of predictive maintenance in SMEs, IEEE Access, 10 (2022), 88738–88749. https://doi.org/10.1109/ACCESS.2022.3200694 doi: 10.1109/ACCESS.2022.3200694
![]() |
[22] | W. Rafique, M. Khan, N. Sarwar, M. Sohail, A. Irshad, A graph theory based method to extract social structure in the society, in Intelligent Technologies and Applications, (2018), 437–448. https://doi.org/10.1007/978-981-13-6052-7_38 |
[23] | M. Khan, M. Liu, W. Dou, S. Yu, vGraph: graph virtualization towards big data, in 2015 Third International Conference on Advanced Cloud and Big Data, (2015) 153–158. https://doi.org/10.1109/CBD.2015.33 |
[24] | W. Rafique, M. Khan, X. Zhao, N. Sarwar, W. Dou, A blockchain-based framework for information security in intelligent transportation systems, in Intelligent Technologies and Applications, (2019), 53–66. https://doi.org/10.1007/978-981-15-5232-8_6 |
[25] | P. Haindl, G. Buchgeher, M. Khan, B. Moser, Towards a reference software architecture for human-AI teaming in smart manufacturing, in Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results, (2022), 96–100. https://doi.org/10.1145/3510455.3512788 |
[26] | W. Rafique, M. Khan, W. Dou, Maintainable software solution development using collaboration between architecture and requirements in heterogeneous IoT paradigm (Short Paper), in Collaborative Computing: Networking, Applications and Worksharing, (2019), 489–508. https://doi.org/10.1007/978-3-030-30146-0_34 |
[27] |
W. Rafique, M. Khan, N. Sarwar, W. Dou, SocioRank*: A community and role detection method in social networks, Comput. Electr. Eng., 76 (2019), 122–132. https://doi.org/10.1016/j.compeleceng.2019.03.010 doi: 10.1016/j.compeleceng.2019.03.010
![]() |
[28] |
Z. Hu, J. Tang, P. Zhang, J. Jiang, Deep learning for the identification of bruised apples by fusing 3D deep features for apple grading systems, Mech. Syst. Signal Process., 145 (2020), 106922. https://doi.org/10.1016/j.ymssp.2020.106922 doi: 10.1016/j.ymssp.2020.106922
![]() |
[29] |
M. Iqtait, F. Mohamad, M. Mamat, Feature extraction for face recognition via active shape model (ASM) and active appearance model (AAM), IOP Conf. Ser.: Mater. Sci. Eng., 332 (2018), 012032. https://doi.org/10.1088/1757-899X/332/1/012032 doi: 10.1088/1757-899X/332/1/012032
![]() |
[30] | H. Jung, S. Lee, J. Yim, S. Park, J. Kim, Joint fine-tuning in deep neural networks for facial expression recognition, in 2015 IEEE International Conference on Computer Vision (ICCV), (2015), 2983–2991. https://doi.org/10.1109/ICCV.2015.341 |
[31] | M. J. Cossetin, J. C. Nievola, A. L. Koerich, Facial expression recognition using a pairwise feature selection and classification approach, in 2016 International Joint Conference on Neural Networks (IJCNN), (2016), 5149–5155. https://doi.org/10.1109/IJCNN.2016.7727879 |
[32] | X. Zhao, X. Liang, L. Liu, T. Li, Y. Han, N. Vasconcelos, et al., Peak-piloted deep network for facial expression recognition, in Computer Vision – ECCV 2016, (2016), 425–442. https://doi.org/10.1007/978-3-319-46475-6_27 |
[33] |
R. N. Abiram, P. Vincent, Identity preserving multi-pose facial expression recognition using fine tuned VGG on the latent space vector of generative adversarial network, Math. Biosci. Eng., 18 (2021), 3699–3717. https://doi.org/10.3934/mbe.2021186 doi: 10.3934/mbe.2021186
![]() |
[34] | H. Yang, L. Yin, CNN based 3D facial expression recognition using masking and landmark features, in 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), (2017), 556–560. https://doi.org/10.1109/ACII.2017.8273654 |
[35] | W. Wei, Q. Jia, G. Chen, Real-time facial expression recognition for affective computing based on Kinect, in 2016 IEEE 11th Conference on Industrial Electronics and Applications (ICIEA), (2016), 161–165. https://doi.org/10.1109/ICIEA.2016.7603570 |
[36] | B. Huang, Z. Ying, Sparse autoencoder for facial expression recognition, in 2015 IEEE 12th Intl Conf on Ubiquitous Intelligence and Computing and 2015 IEEE 12th Intl Conf on Autonomic and Trusted Computing and 2015 IEEE 15th Intl Conf on Scalable Computing and Communications and Its Associated Workshops (UIC-ATC-ScalCom), (2015), 1529–1532. https://doi.org/10.1109/UIC-ATC-ScalCom-CBDCom-IoP.2015.274 |
[37] |
T. Ahmad, H. Mao, L. Lin, G. Tang, Action recognition using attention-joints graph convolutional neural networks, IEEE Access, 8 (2019), 305–313. https://doi.org/10.1109/ACCESS.2019.2961770 doi: 10.1109/ACCESS.2019.2961770
![]() |
[38] | M. Wang, T. C. Yeh, Human action recognition using CNN and BoW methods, 2016. Available from: http://cs229.stanford.edu/proj2016spr/report/053.pdf. |
[39] |
C. Shen, K. Zhang, J. Tang, A covid-19 detection algorithm using deep features and discrete social learning particle swarm optimization for edge computing devices, ACM Trans. Internet Technol., 22 (2021), 1–17. https://doi.org/10.1145/3453170 doi: 10.1145/3453170
![]() |
[40] |
R. K. Meleppat, C. R. Fortenbach, Y. Jian, E. S. Martinez, K. Wagner, B. S. Modjtahedi, et al., In Vivo imaging of retinal and choroidal morphology and vascular plexuses of vertebrates using swept-source optical coherence tomography, Transl. Vision Sci. Technol., 11 (2022), 11. https://doi.org/10.1167/tvst.11.8.11 doi: 10.1167/tvst.11.8.11
![]() |
[41] |
K. Ratheesh, L. Seah, V. Murukeshan, Spectral phase-based automatic calibration scheme for swept source-based optical coherence tomography systems, Phys. Med. Biol., 61 (2016), 7652. https://doi.org/10.1088/0031-9155/61/21/7652 doi: 10.1088/0031-9155/61/21/7652
![]() |
[42] |
R. Meleppat, M. Matham, L. Seah, An efficient phase analysis-based wavenumber linearization scheme for swept source optical coherence tomography systems, Laser Phys. Lett., 12 (2015), 055601. https://doi.org/10.1088/1612-2011/12/5/055601 doi: 10.1088/1612-2011/12/5/055601
![]() |
[43] |
R. K. Meleppat, P. Prabhathan, S. L. Keey, M. V. Matham, Plasmon resonant silica-coated silver nanoplates as contrast agents for optical coherence tomography, J. Biomed. Nanotechnol., 12 (2016), 1929–1937. https://doi.org/10.1166/jbn.2016.2297 doi: 10.1166/jbn.2016.2297
![]() |
[44] | D. Girish, V. Singh, A. Ralescu, Understanding action recognition in still images, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), (2020), 1523–1529. https://doi.org/10.1109/CVPRW50498.2020.00193 |
[45] | H. Yang, U. Ciftci, L. Yin, Facial expression recognition by de-expression residue learning, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2018), 2168–2177. https://doi.org/10.1109/CVPR.2018.00231 |
[46] | Y. Zhou, B. E. Shi, Photorealistic facial expression synthesis by the conditional difference adversarial autoencoder, in 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), (2017), 370–376. https://doi.org/10.1109/ACII.2017.8273626 |
[47] |
B. Yan, G. Han, Effective feature extraction via stacked sparse autoencoder to improve intrusion detection system, IEEE Access, 6 (2018), 41238–41248. https://doi.org/10.1109/ACCESS.2018.2858277 doi: 10.1109/ACCESS.2018.2858277
![]() |
[48] |
S. R. Livingstone, F. A. Russo, The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): a dynamic, multimodal set of facial and vocal expressions in North American English, PloS One, 13 (2018), e0196391. https://doi.org/10.1371/journal.pone.0196391 doi: 10.1371/journal.pone.0196391
![]() |
[49] |
M. F. H. Siddiqui, A. Y. Javaid, A multimodal facial emotion recognition framework through the fusion of speech with visible and infrared images, Multimodal Technol. Interact., 4 (2020), 46. https://doi.org/10.3390/mti4030046 doi: 10.3390/mti4030046
![]() |
[50] | J. Jang, D. H. Kim, H. I. Kim, Y. M. Ro, Color channel-wise recurrent learning for facial expression recognition, in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (2017), 1233–1237. https://doi.org/10.1109/ICASSP.2017.7952353 |
[51] | S. Happy, A. Routray, Robust facial expression classification using shape and appearance features, in 2015 Eighth International Conference on Advances in Pattern Recognition (ICAPR), (2015), 1–5. https://doi.org/10.1109/ICAPR.2015.7050661 |
[52] |
K. M. Koo, E. Y. Cha, Image recognition performance enhancements using image normalization, Hum.-centric Comput. Inf. Sci., 7 (2017), 33. https://doi.org/10.1186/s13673-017-0114-5 doi: 10.1186/s13673-017-0114-5
![]() |
[53] |
Y. Liu, Y. Li, X. Ma, R. Song, Facial expression recognition with fusion features extracted from salient facial areas, Sensors, 17 (2017), 712. https://doi.org/10.3390/s17040712 doi: 10.3390/s17040712
![]() |
[54] | A. Ng, J. Ngiam, C. Y. Foo, Y. Mai, C. Suen, A. Coates, et al., Unsupervised feature learning and deep learning, 2013. Available from: https://redirect.cs.umbc.edu/courses/pub/www/courses/graduate/678/spring15/visionaudio.pdf. |
[55] |
L. Chen, M. Zhou, W. Su, M. Wu, J. She, K. Hirota, Softmax regression based deep sparse autoencoder network for facial emotion recognition in human-robot interaction, Inf. Sci., 428 (2018), 49–61. https://doi.org/10.1016/j.ins.2017.10.044 doi: 10.1016/j.ins.2017.10.044
![]() |
[56] | M. J. Lyons, “Excavating AI” Re-excavated: Debunking a fallacious account of the JAFFE dataset, preprint, arXiv:2107.13998. |
[57] | M. J. Lyons, M. Kamachi, J. Gyoba, Coding facial expressions with Gabor wavelets (IVC special issue), preprint, arXiv:2009.05938. |
[58] | T. Kanade, J. F. Cohn, Y. Tian, Comprehensive database for facial expression analysis, in Proceedings Fourth IEEE International Conference on Automatic Face and Gesture Recognition (Cat. No. PR00580), (2000), 46–53. https://doi.org/10.1109/AFGR.2000.840611 |
[59] | S. Eng, H. Ali, A. Cheah, Y. Chong, Facial expression recognition in JAFFE and KDEF datasets using histogram of oriented gradients and support vector machine, in IOP Conf. Ser.: Mater. Sci. Eng., 705 (2019), 012031. https://doi.org/10.1088/1757-899X/705/1/012031 |
[60] | R. B. Palm, Prediction as a candidate for learning deep hierarchical models of data, 2012. Available from: https://www2.imm.dtu.dk/pubdb/edoc/imm6284.pdf. |
[61] |
L. Du, H. Hu, Modified classification and regression tree for facial expression recognition with using difference expression images, Electron. Lett., 53 (2017), 590–592. https://doi.org/10.1049/el.2017.0731 doi: 10.1049/el.2017.0731
![]() |
[62] |
V. H. Duong, Y. S. Lee, J. J. Ding, B. T. Pham, M. Q. Bui, J. C. Wang, Projective complex matrix factorization for facial expression recognition, EURASIP J. Adv. Signal Process., 2018 (2018), 10. https://doi.org/10.1186/s13634-017-0521-9 doi: 10.1186/s13634-017-0521-9
![]() |
[63] |
T. Zhang, Face expression recognition based on deep learning, J. Phys.: Conf. Ser., 1486 (2020), 042048. https://doi.org/10.1088/1742-6596/1486/4/042048 doi: 10.1088/1742-6596/1486/4/042048
![]() |
1. | Tianshu Wei, Jinjie Huang, A Dual Discriminator Method for Generalized Zero-Shot Learning, 2024, 79, 1546-2226, 1599, 10.32604/cmc.2024.048098 |
AWA1 | AWA2 | aPY | |||||||
ts | tr | H | ts | tr | H | ts | tr | H | |
SAE [8] | 1.8 | 77.1 | 3.5 | 1.1 | 82.2 | 2.2 | 0.4 | 80.9 | 0.9 |
DAP [26] | 46.5 | 68.5 | 55.4 | 43.7 | 70.2 | 53.3 | 27.6 | 55.8 | 37.0 |
IAP [26] | 2.1 | 78.2 | 4.1 | 0.9 | 87.6 | 1.8 | 5.7 | 65.6 | 10.4 |
SJE [29] | 11.3 | 74.6 | 19.6 | 8.0 | 73.9 | 14.4 | 3.7 | 55.7 | 6.9 |
Preserving Semantic Relations (PSR) [30] | 20.7 | 73.8 | 32.3 | 13.5 | 51.4 | 21.4 | |||
f-CLSWGAN [20] | 57.9 | 61.4 | 59.6 | ||||||
Zhang et al. [31] | 20.7 | 67.9 | 38.6 | 16.1 | 66.9 | 25.9 | |||
Li et al. [11] | 54.9 | 71.7 | 62.2 | ||||||
CADA-VAE [18] | 57.3 | 72.8 | 64.1 | 55.8 | 75.0 | 63.9 | |||
Chen et al. [23] | 54.7 | 72.7 | 62.4 | 55.6 | 76.9 | 64.2 | |||
Chen et al. [12] | 54.5 | 72.8 | 62.3 | 55.2 | 73.5 | 63.0 | 26.7 | 51.5 | 35.2 |
The proposed method | 62.4 | 63.9 | 63.1 | 60.6 | 69.5 | 64.7 | 31.5 | 55.3 | 40.1 |
AWA1 | AWA2 | aPY | |
SAE [8] | 53.0 | 54.1 | 8.3 |
DAP [26] | 44.1 | 46.1 | 33.8 |
IAP [26] | 35.9 | 35.9 | 36.6 |
SJE [29] | 65.6 | 61.9 | 32.9 |
PSR [30] | 63.8 | 38.4 | |
Cross-Class Sample Synthesis (CCSS) [32] | 56.3 | 63.7 | 35.5 |
f-CLSWGAN [20] | 69.9 | ||
Zhang et al. [31] | 68.8 | 41.3 | |
Li et al. [11] | 69.9 | ||
CADA-VAE [18] | 58.8 | 60.3 | |
Chen et al. [12] | 65.2 | 65.5 | 32.7 |
The proposed method | 67.1 | 66.1 | 39.8 |
AWA1 | AWA2 | aPY | ||||
acc | H | acc | H | acc | H | |
a | 45.1 | 30.0 | 45.1 | 17.9 | 30.7 | 22.0 |
b | 52.6 | 32.3 | 59.4 | 38.4 | 38.8 | 25.0 |
c | 56.5 | 33.3 | 57.6 | 23.4 | 35.2 | 26.3 |
d | 59.7 | 48.3 | 59.7 | 40.2 | 36.8 | 34.4 |
e | 61.1 | 51.4 | 60.2 | 43.8 | 36.9 | 35.1 |
f | 61.1 | 49.8 | 61.0 | 45.7 | 37.0 | 36.4 |
The proposed method | 67.1 | 63.1 | 66.1 | 63.1 | 39.8 | 40.1 |
AWA1 | AWA2 | aPY | ||||||||||
acc | ts | tr | H | acc | ts | tr | H | acc | ts | tr | H | |
(1) | 55.1 | 37.0 | 71.8 | 48.8 | 55.6 | 34.8 | 73.0 | 47.1 | 35.8 | 27.0 | 52.1 | 35.6 |
(2) | 67.1 | 62.3 | 63.9 | 63.1 | 66.1 | 60.6 | 69.5 | 64.7 | 39.8 | 31.5 | 55.3 | 40.1 |
AWA1 | AWA2 | aPY | |||||||
ts | tr | H | ts | tr | H | ts | tr | H | |
SAE [8] | 1.8 | 77.1 | 3.5 | 1.1 | 82.2 | 2.2 | 0.4 | 80.9 | 0.9 |
DAP [26] | 46.5 | 68.5 | 55.4 | 43.7 | 70.2 | 53.3 | 27.6 | 55.8 | 37.0 |
IAP [26] | 2.1 | 78.2 | 4.1 | 0.9 | 87.6 | 1.8 | 5.7 | 65.6 | 10.4 |
SJE [29] | 11.3 | 74.6 | 19.6 | 8.0 | 73.9 | 14.4 | 3.7 | 55.7 | 6.9 |
Preserving Semantic Relations (PSR) [30] | 20.7 | 73.8 | 32.3 | 13.5 | 51.4 | 21.4 | |||
f-CLSWGAN [20] | 57.9 | 61.4 | 59.6 | ||||||
Zhang et al. [31] | 20.7 | 67.9 | 38.6 | 16.1 | 66.9 | 25.9 | |||
Li et al. [11] | 54.9 | 71.7 | 62.2 | ||||||
CADA-VAE [18] | 57.3 | 72.8 | 64.1 | 55.8 | 75.0 | 63.9 | |||
Chen et al. [23] | 54.7 | 72.7 | 62.4 | 55.6 | 76.9 | 64.2 | |||
Chen et al. [12] | 54.5 | 72.8 | 62.3 | 55.2 | 73.5 | 63.0 | 26.7 | 51.5 | 35.2 |
The proposed method | 62.4 | 63.9 | 63.1 | 60.6 | 69.5 | 64.7 | 31.5 | 55.3 | 40.1 |
AWA1 | AWA2 | aPY | |
SAE [8] | 53.0 | 54.1 | 8.3 |
DAP [26] | 44.1 | 46.1 | 33.8 |
IAP [26] | 35.9 | 35.9 | 36.6 |
SJE [29] | 65.6 | 61.9 | 32.9 |
PSR [30] | 63.8 | 38.4 | |
Cross-Class Sample Synthesis (CCSS) [32] | 56.3 | 63.7 | 35.5 |
f-CLSWGAN [20] | 69.9 | ||
Zhang et al. [31] | 68.8 | 41.3 | |
Li et al. [11] | 69.9 | ||
CADA-VAE [18] | 58.8 | 60.3 | |
Chen et al. [12] | 65.2 | 65.5 | 32.7 |
The proposed method | 67.1 | 66.1 | 39.8 |
AWA1 | AWA2 | aPY | ||||
acc | H | acc | H | acc | H | |
a | 45.1 | 30.0 | 45.1 | 17.9 | 30.7 | 22.0 |
b | 52.6 | 32.3 | 59.4 | 38.4 | 38.8 | 25.0 |
c | 56.5 | 33.3 | 57.6 | 23.4 | 35.2 | 26.3 |
d | 59.7 | 48.3 | 59.7 | 40.2 | 36.8 | 34.4 |
e | 61.1 | 51.4 | 60.2 | 43.8 | 36.9 | 35.1 |
f | 61.1 | 49.8 | 61.0 | 45.7 | 37.0 | 36.4 |
The proposed method | 67.1 | 63.1 | 66.1 | 63.1 | 39.8 | 40.1 |
AWA1 | AWA2 | aPY | ||||||||||
acc | ts | tr | H | acc | ts | tr | H | acc | ts | tr | H | |
(1) | 55.1 | 37.0 | 71.8 | 48.8 | 55.6 | 34.8 | 73.0 | 47.1 | 35.8 | 27.0 | 52.1 | 35.6 |
(2) | 67.1 | 62.3 | 63.9 | 63.1 | 66.1 | 60.6 | 69.5 | 64.7 | 39.8 | 31.5 | 55.3 | 40.1 |