Sufficient and necessary conditions of near-optimal controls for a stochastic listeriosis model with spatial diffusion

Zhaoyan Meng; Shuting Lyu; Mengqing Zhang; Xining Li; Qimin Zhang; Zhaoyan Meng; Shuting Lyu; Mengqing Zhang; Xining Li; Qimin Zhang

doi:10.3934/era.2024140

Electronic Research Archive

2024, Volume 32, Issue 5: 3059-3091. doi: 10.3934/era.2024140

Previous Article Next Article

Research article

Sufficient and necessary conditions of near-optimal controls for a stochastic listeriosis model with spatial diffusion

1.
School of Mathematics and Information Science, North Minzu University, Yinchuan 750021, China
2.
School of Mathematics and Statistics, Ningxia University, Yinchuan 750021, China

Received: 26 November 2023 Revised: 28 March 2024 Accepted: 16 April 2024 Published: 23 April 2024

Random environment and human activities have important effects on the survival of listeria. In this paper, treating infected people and removing bacteria from the environment as control strategies, we developed a listeriosis model that considers random noise and spatial diffusion. By constructing a Lyapunov function, we demonstrated the existence and uniqueness of the global positive solution of the model. However, it was a challenging task to realize the optimal control of the model by solving the Pontryagin random maximum principle with the lowest control cost. Therefore, our study on near-optimal controls is of great significance for controlling the spread of listeriosis. Initially, we gave some adjoint equations and a priori estimates. Subsequently, the Pontryagin random maximum principle was utilized to establish the sufficient and necessary conditions for achieving near-optimal controls. Ultimately, the theoretical findings are corroborated through numerical analysis.

Keywords:

Citation: Zhaoyan Meng, Shuting Lyu, Mengqing Zhang, Xining Li, Qimin Zhang. Sufficient and necessary conditions of near-optimal controls for a stochastic listeriosis model with spatial diffusion[J]. Electronic Research Archive, 2024, 32(5): 3059-3091. doi: 10.3934/era.2024140

Related Papers:

[1]	Shengxiang Wang, Xiaohui Zhang, Shuangjian Guo . The Hom-Long dimodule category and nonlinear equations. Electronic Research Archive, 2022, 30(1): 362-381. doi: 10.3934/era.2022019
[2]	Yongjie Wang, Nan Gao . Some properties for almost cellular algebras. Electronic Research Archive, 2021, 29(1): 1681-1689. doi: 10.3934/era.2020086
[3]	Xing Zhang, Xiaoyu Jiang, Zhaolin Jiang, Heejung Byun . Algorithms for solving a class of real quasi-symmetric Toeplitz linear systems and its applications. Electronic Research Archive, 2023, 31(4): 1966-1981. doi: 10.3934/era.2023101
[4]	Shuguan Ji, Yanshuo Li . Quasi-periodic solutions for the incompressible Navier-Stokes equations with nonlocal diffusion. Electronic Research Archive, 2023, 31(12): 7182-7194. doi: 10.3934/era.2023363
[5]	Natália Bebiano, João da Providência, Wei-Ru Xu . Approximations for the von Neumann and Rényi entropies of graphs with circulant type Laplacians. Electronic Research Archive, 2022, 30(5): 1864-1880. doi: 10.3934/era.2022094
[6]	Jincheng Shi, Shuman Li, Cuntao Xiao, Yan Liu . Spatial behavior for the quasi-static heat conduction within the second gradient of type Ⅲ. Electronic Research Archive, 2024, 32(11): 6235-6257. doi: 10.3934/era.2024290
[7]	Wenjie Zuo, Mingguang Shao . Stationary distribution, extinction and density function for a stochastic HIV model with a Hill-type infection rate and distributed delay. Electronic Research Archive, 2022, 30(11): 4066-4085. doi: 10.3934/era.2022206
[8]	Jianxing Du, Xifeng Su . On the existence of solutions for the Frenkel-Kontorova models on quasi-crystals. Electronic Research Archive, 2021, 29(6): 4177-4198. doi: 10.3934/era.2021078
[9]	Xuerong Hu, Yuxiang Han, Junyan Lu, Linxiang Wang . Modeling and experimental investigation of quasi-zero stiffness vibration isolator using shape memory alloy springs. Electronic Research Archive, 2025, 33(2): 768-790. doi: 10.3934/era.2025035
[10]	Zayd Hajjej . Asymptotic stability for solutions of a coupled system of quasi-linear viscoelastic Kirchhoff plate equations. Electronic Research Archive, 2023, 31(6): 3471-3494. doi: 10.3934/era.2023176

Abstract

1. Introduction

Compared to traditional offline education, e-learning offers greater flexibility and convenience. Students can access learning resources anytime and anywhere through the internet. They are not constrained by time or place, enabling them to study at their own pace and convenience. In addition, e-learning breaks down geographical barriers, which provides students with global learning opportunities. They can participate in international online courses, interact with other students and access educational resources from different cultures. The combination of e-learning and offline education is an effective way to achieve education equity ^[1]. However, it is essential to acknowledge that e-learning encounters challenges, including technological demands and the lack of face-to-face interaction. More importantly, teachers find it challenging to ascertain the genuineness of the emotional states conveyed by students ^[2].

Studies have demonstrated that emotions not only provide teachers with valuable insights into students' learning status but also significantly enhance their learning outcomes ^[3]. According to Yerkes-Dodson's law, solving complex algebra problems becomes feasible when emotions are maintained at a calm level. When emotions are at a moderately agitated level, they help to solve simple mathematical arithmetic problems. When emotions are at a higher level of agitation, such as anger, it is difficult to solve difficult problems. Therefore, it is important to recognize students' trusted emotions. In contrast to traditional offline education, the primary challenge in e-learning lies in the absence of emotional communication, potentially resulting in less favorable learning outcomes.Emotion arises from the dynamic interplay between internal feelings and the external environment. Physiological signals, such as EEG ^[4], ECG ^[5], and EMG ^[6], can truly reflect emotional changes. Yet, utilizing professional equipment for collecting physiological signals in e-learning isn't feasible due to its exorbitant cost, limiting its widespread applicability. However, it is not suitable to collect physiological signals by wearing professional equipment for e-learning, because professional equipment is extremely expensive, and it cannot widen the scope of application. More importantly, it may affect students' learning state. To broaden the application scenario and reduce the interference, videos collected during the learning process can be used to analyze student's learning state, i.e., facial expressions and micro-expression ^[7], eye states ^[8], gaze points ^[9], head posture ^[10], and non-contact physiological signals ^[11]. These states can be extracted from videos, that can further be used as cues to monitor students' learning states. Compared with other methods, videos can senselessly record learning state, which can provide a true, natural, and objective reflection of the student's emotional state.

2. The development of trusted emotion recognition in the field of education

2.1. The development of trusted emotion recognition

In 1997, Professor Rosalind Picard of the Massachusetts Institute of Technology (MIT) introduced the concept of affective computing, which emphasizes the connection between computing and emotions ^[12]. In the 1990s, Japan pioneered the research of Kansei engineering by combining aesthetics with engineering. Kansei engineering aims to enhance user satisfaction by incorporating human emotional needs into the process of product design and manufacturing. During the same period, countries and companies in Europe also started research work on facial expression recognition, measurement of emotional information and wearable computing, such as the Emotion Research laboratory led by Klaus Soberer at the University of Geneva and the Emotion Robotics Research Group led by Canamero at the Free University of Brussels. In terms of market applications, a multi-model shopping assistant based on the EMBASSI system was proposed in 2001 with the support of the German Ministry of Education and Research as well as the participation of more than 20 universities and companies. This shopping assistant will take into account the psychological and environmental needs of consumers.

2.2. The development of trusted emotion recognition in education

The research focus on affective computing is to capture signals generated by physiological indicators or behavioral characteristics using various types of sensors. After signals are obtained, models can be established. The signals include facial expressions, micro-expressions, speech, body gestures, hand gestures, electrocardiograms, electroencephalograms, etc. Affective computing has been applied in many fields. In the field of driving, affective computing can reduce accidents by alerting drivers when they are not concentrating or are fatigued ^[13]. In the field of e-commerce, affective computing automatically senses the user's purchase intention and makes accurate recommendations ^[14]. In the medical field, it detects possible psychological and mental anomalies by analyzing the changes in patients' emotional and psychological states to provide assistance to doctors' diagnosis ^[15]. Emotional fluctuation can reflect changes in learners' psychology as well as state. Emotional fluctuation refers to the changes or fluctuations in students' emotional states over a period of time. In the educational context, students' emotions can be influenced by various factors such as the difficulty of the learning content, the effectiveness of teaching methods, and individual factors. Integrating artificial intelligence with education enables the analysis of students' emotional fluctuations during lessons, facilitating teachers in promptly adapting teaching methods, stragtegies and content. This, in turn, enhances students' learning effectiveness and outcomes. Teaching adjustment involves adapting and modifying teaching strategies, content, and support based on students' emotional fluctuation and learning needs. Understanding students' emotional fluctuation is crucial for teachers because emotions can impact students' learning experiences and outcomes. By observing and analyzing students' emotional fluctuation, teachers can make appropriate teaching adjustments to better meet students' needs and facilitate effective learning.

In the field of education, interviews and questionnaires are two commonly used methods to detect student's emotions. Interviews are used to ask students face-to-face questions about a set of relevant issues and the results are analyzed to identify students' emotions. Interviews are the simplest form of emotion recognition. Although this method is relatively simple, it has the following problems. First, students may try to hide their true emotions. Second, emotion is characterized by its temporary nature. As students need to evaluate after the lesson, disengagement from the teaching situation can lead to emotional forgetting, and it may distort evaluation results. Questionnaires are another subjective method of emotion recognition. Participants are asked to complete a questionnaire designed to collect the emotional state of the participants and emotion recognition is achieved by analyzing the results of the questions ^[16]. The Academic Emotions Scale for University Students (AEQ) is widely used for questionnaire-based emotion recognition, which combines activation and validity dimensions to determine the intensity of continuous emotions. The questionnaire-based emotion recognition method has the following shortcomings. Initially, there is a concern regarding the reliability of the survey outcomes. Additionally, the quantity of questionnaire items influences the analysis findings. When there are fewer questions, it does not always reflect the results of emotion recognition. When there are more questions, students get bored, which will affect the final results. Third, the recovery rate of the questionnaire is not guaranteed. Therefore, the traditional method of emotion recognition based on interviews and questionnaires cannot reflect the trusted emotions.

As compared to the traditional methods, physiological recognition can truly reflect emotional fluctuations, which is widely used in the field of emotion recognition ^{[17,18,19,20]}. However, professional equipment should be used to collect physiological signals, which may not only affect the learner's state but also increase the cost. Thus, it can only be used in laboratory environments. Except for physiological information, speech ^[21], textual ^[22], and body ^[23] information has also been used for emotion recognition tasks. However, it is difficult to obtain much speech and body movement data during the e-learning.

3. Trusted emotion recognition based on video

Established multimodal emotion recognition methods need to incorporate multiple types of state data, including image, audio, text, and physiological state. In order to establish a multimodal emotion recognition model, different devices need to be utilized to obtain raw data, which may increase the cost. In addition, there is a difference in sampling frequency between different collection devices. Therefore, complex preprocessing operations are required for the alignment of the raw data. The accuracy of data alignment affects the results of emotion recognition.

Unlike emotion recognition based on multi-source state data, emotion recognition based on video data can be achieved at a low cost. Second, only video is used, which does not require complex data alignment operations. Finally, video can record student's state during the learning process.

Image, speech, and non-contact physiological signals can be obtained from video, which can reflect emotional changes. Furthermore, different state data can be aligned under the same equipment and the complementarity between different state data can be used to improve the accuracy of emotion recognition. The state data mainly includes a one-dimensional time series and a two-dimensional image series. The one-dimensional time series includes speech, head posture, non-contact physiological signals, eye states, and so on. The one-dimensional time series can be defined as:

$y = {f}_{1}\left(t\right)$

(1)

where t represents time, y represents the signal amplitude.

The two-dimensional image series includes facial expression and micro-expression, which can be defined as:

$z = {f}_{2}(x, y)$

(2)

where $x, y$ represent the row and column of the image, and $z$ represents the pixel value at $(x, y)$ .

In the following section, we will summarize the relevant state data that can be extracted from videos and analyze the advantages and limitations of performing authentic emotion recognition based on different types of state data.

3.1. Trusted emotion recognition based on facial expressions

3.1.1. Facial expression recognition database

Facial expression recognition database can be divided into three types, i.e., images collected from laboratory conditions, images collected from the internet, and images collected from specific environments. The commonly used facial expression recognition databases are listed in Table 1. Images collected from the laboratory have good quality due to the control of external factors such as lighting conditions and posture. Images collected from the internet have rich types comprising different poses, occlusions, and lighting conditions. In addition, there are databases for specific application environments. For example, the NIR KMU-FED database is used for drivers' emotion recognition.

Table 1. Commonly used facial expression recognition databases.

Database	Category	Resolution	Type	Quantity	Collection environment
NIR KMU-FED ^[24]	6	1600 × 1200	1	1106	Driving environment
MMEW ^[25]	7	1920 × 1080	1	300	Laboratory
FaceWarehouse ^[26]	20	640 × 480	2	3000	Laboratory
Radboud Faces ^[27]	8	1024 × 681	3	59616	Laboratory
JAFFE ^[28]	7	256 × 256	3	213	Laboratory
CK+ ^[29]	7	48 × 48	3	593	Laboratory
OULU-CASIA ^[30]	6	100 × 100	3	15339	Laboratory
FER2013 ^[31]	7	320 × 240	3	35887	Internet
AFFECTNET ^[32]	8	224 × 224	3	450000	Internet
RAF-DB ^[33]	7	48 × 48	3	35887	Internet
EmotioNet ^[34]	22	400 × 300	3	1000000	Internet
Note: 1 represents image sequence, 2 represents depth image, 3 represents image.

| Show Table

DownLoad: CSV

3.1.2. Facial expression recognition methods

Facial expressions are commonly used for emotion recognition tasks. In 1976, a facial action coding system (FACS) was proposed by Ekman. FACS divides the facial region into several independent and interrelated action units (AUs). By analyzing the movement characteristics of these AUs, the controlled facial regions, and the corresponding facial expressions, standard facial actions can be derived ^[35]. A trained human can recognize different emotions by analyzing the movements of AUs. However, manually calibrated facial units require a lot of manpower and time resources. With the development of artificial intelligence and computer vision, automated facial expression recognition has received widespread attention. Facial expression recognition can be divided into traditional methods and deep learning methods.

The flowchart of facial expression recognition methods based on traditional features is shown in Figure 1. It is necessary to extract features from the image. Then, a classifier is selected to classify the extracted features. Traditional features include LBP ^[36], SIFT ^[37], HOG ^[38], and color features ^[39]. LBP features demonstrate resilience to variations in lighting and changes in pose Therefore, it is widely used in facial expression recognition tasks. LBP features can be expressed as:

$LBP({x}_{c}, {y}_{c}) = \sum _{p = 0}^{P-1}{2}^{p}s({i}_{p}-{i}_{c})$

(3)

Figure 1. Facial expression recognition based on traditional features.

DownLoad: Full-Size Img PowerPoint

with ${x}_{c}, {y}_{c}$ as central pixel with intensity ${i}_{c}$ ; and ${i}_{n}$ being the intensity of the neighbor pixel. S is the sign function defined as:

$s\left(x\right) = \left\{\begin{array}{l}1\ if\ x\ge 0\\ 0\ \ \ \ \ \ \ \ else\end{array}\right.$

(4)

However, the primitive LBP features have high dimensionality, which require significant memory resources and have slow computational speed. In order to overcome the shortcomings of primitive LBP features, researchers have successively proposed double LBP ^[40] and Riu-LBP ^[36]. Double LBP and Riu-LBP are defined as (5) and (8), respectively.

${LBP}_{P, R}^{Double}\left(x\right) = \left\{{LBP}_{P, R}^{+}\right(x), {LBP}_{P, R}^{-}(x\left)\right\}$

(5)

${LBP}_{P, R}^{+}\left(x\right)$ and ${LBP}_{P, R}^{-}\left(x\right)$ can be expressed as:

${LBP}_{P, R}^{+}\left(x\right) = \sum _{p = 0}^{P-1}s({g}_{p}-{g}_{x}-n){2}^{p}, s\left(u\right) = \left\{\begin{array}{c}1, u > 0\\ 0, u\le 0\end{array}\right.$

(6)

${LBP}_{P, R}^{-}\left(x\right) = \sum _{p = 0}^{P-1}s({g}_{p}-{g}_{x}-n){2}^{p}, s\left(u\right) = \left\{\begin{array}{c}1, u < 0\\ 0, u\ge 0\end{array}\right.$

(7)

${LBP}_{P, R}^{Riu} = \left\{\begin{array}{c}\sum _{p = 0}^{P}s\left(g\right(p)-g(c\left)\right)\\ P+1\end{array}\right.$

(8)

where c is the center pixel, g() denotes gray level of pixel and s() is the sign function. However, external factors such as illumination, pose change, occlusion, and image quality. can affect the results of facial expression recognition based on traditional features.

Unlike traditional features, deep learning-based methods can automatically extract features. The flowchart of facial expression recognition based on deep learning methods is shown in Figure 2. First, it is necessary to collect a large number of diverse facial expression images as inputs. To increase the robustness of the model, data augmentation strategies are applied to enrich the input images. Then, deep learning models are designed for feature extraction. Finally, the model is trained by minimizing the loss function to achieve facial expression recognition. Compared with traditional methods, deep learning-based methods can effectively overcome the influence of external factors on recognition results. In ^[41], sparse representation and extreme learning are combined to overcome the impact of head and lighting variations on facial expression recognition accuracy. Reference ^[42] proposed the IERN model and eliminated the effect of background and dataset bias on recognition accuracy from the perspective of causal inference. In ^[43], EASE was proposed to overcome the effect of image quality and inaccurate training data labeling on model accuracy and robustness. The model can accurately recognize emotions with ambiguity in noisy data. Reference ^[44] proposed AffMen model, which integrates the common features of emotions and the personality features learned in real time when performing continuous emotion recognition to reduce the impact on the accuracy of emotion recognition due to cultural background, gender and personality differences. Facial expression recognition algorithms in mask-obscured scenes is relatively low due to the problem of missing facial information caused by face masks. To solve the problem mentioned above, TKNN was proposed in ^[45], which integrates eyebrow and eye state information in mask-obscured scenes. The proposed algorithm utilizes facial feature points in the eyebrow and eye regions to calculate various relative distances and angles, capturing the state information of eyebrows and eyes.

Figure 2. Facial expression recognition based on deep learning.

DownLoad: Full-Size Img PowerPoint

In 2020, researchers at the University of California, Berkeley, analyzed the extent to which 16 types of facial expressions occurred in thousands of situations in 6 million videos from 144 countries and territories. They discovered that 16 facial expressions appeared in similar contexts around the world. The researchers also found that each specific facial expression was uniquely associated with a set of contexts that were more similar. These backgrounds were maintained 70 percent across 12 regions. Their findings reveal a fine-grained model of human facial expressions preserved throughout the modern world ^[46]. Most research work is based on the study of basic discrete emotions proposed by Ekman. Facial expressions are complex, as the same expression under different intensities can reflect different emotional states, such as smile and laughter, sadness and grief. Therefore, references ^[47,48] investigated the problem of facial expression recognition under fine granularity and classified each basic expression into multiple categories according to its intensity. Moreover, they implemented facial expression recognition under fine granularity based on convolutional neural networks and graph neural networks, respectively. Compared to convolutional neural networks, graph neural networks can achieve better performance.

Although facial expressions are commonly used for emotion recognition, they cannot truly reflect a student's emotional changes, for facial expressions can be deceptive. As students attempt to hide their true emotions, relying solely on facial expressions may not accurately reflect their emotions. In addition, facial expressions are diverse, they cannot capture the difference between different cultures and backgrounds. Furthermore, facial expressions can be ambiguous, as the same expression can convey different emotions. For example, a smile can express joy and friendliness, but it can also express concealment or insincerity. Finally, existing facial expression recognition algorithms primarily focus on basic facial expressions, which may not comprehensively reflect student's emotions. Therefore, emotion recognition based on facial expression has inherent limitations.

3.2. Trusted emotion recognition based on micro-expressions

3.2.1. Micro-expression recognition database

Micro-expression can express the true emotions that people are trying to suppress and hide. However, it is more difficult to recognize if it is untrained due to its short duration and low intensity. Therefore, most of the databases are captured using high-speed cameras in laboratory environments. The commonly used micro-expression recognition databases are listed in Table 2.

Table 2. Commonly used micro-expression recognition databases.

Database	Category	Resolution	Type	Quantity	Collection environment
SMIC ^[34]	3	640 × 480	100	164	Laboratory
CASME ^[49]	7	640 × 480 720 × 1280	60	195	Laboratory
CASME Ⅱ ^[50]	5	640 × 480	200	247	Laboratory
CASME Ⅲ ^[51]	7	1280 × 720	30	5902	Laboratory
SAMM ^[25]	8	2040 × 1088	200	159	Laboratory
SAMM Long ^[52]	8	2040 × 1088	200	147	Laboratory
MEVIEW ^[53]	7	1280 × 720	25	40	Real scenario
York DDT ^[54]	2	320 × 240	25	20	Laboratory
USF-HD ^[55]	6	720 × 1280	29.7	100	Laboratory
POLIKOVSKY ^[56]	6	640 × 480	200	42	Laboratory
MMEW ^[57]	7	1920 × 1080	90	300	Laboratory

| Show Table

DownLoad: CSV

CASME, CASMEII, and CASME Ⅲ databases have been created successively by the Institute of Psychology, Chinese Academy of Sciences. CASME Ⅲ database introduced depth information and physiological information for the first time to provide multi-dimensional data support for comprehensive and accurate analysis of micro-expression changes.

3.2.2. Micro-expression recognition methods

Compared to facial expressions, micro-expressions can reveal the true emotions that people are trying to hide. Micro-expression recognition methods can be divided into traditional methods and deep learning methods.

Micro-expression recognition based on traditional methods cannot obtain better performance due to the short duration and low intensity. The reason is that there are no significant differences between the features of different micro-expressions (results can be seen in Figure 3). Therefore, it is not possible to accurately identify different micro-expressions based on traditional features.

Figure 3. Feature extraction based on traditional methods for micro-expressions.

DownLoad: Full-Size Img PowerPoint

Compared with traditional algorithms, deep learning methods can effectively improve classification performance. Convolution neural networks are widely used in micro-expression recognition tasks. In ^[58], DTSCNN was proposed for micro-expression recognition and an accuracy of 66.67% was achieved on CASME micro-expression dataset. Reference ^[59] proposed a deep neural network for micro-expression intensity recognition based on temporal-spatial features. This was used to generate temporal-spatial features with distinguishability and it achieved 60.98% classification accuracy on CASME Ⅱ dataset. Reference ^[60] proposed the LEARNet model which can effectively detect the small changes in facial muscles. Reference ^[61] proposed the TSCNN model, which consists of dynamic-temporal, static-spatial and local-spatial components. These components are used to extract micro-expression time-varying features, appearance and contour features, and local features, respectively. Based on the fused features, an accuracy of 80.97% was achieved on the CASME Ⅱ dataset. In ^[62], deep recurrent convolution neural networks were used to extract micro-expression time-varying features in terms of both appearance and shape, respectively, and an accuracy of 80.3% was achieved for the CASME Ⅱ dataset. There are fewer publicly available micro-expression recognition datasets that can be used to train the deep neural networks. To solve the shortcoming mentioned above. Transfer learning was used by ^[63], and the results showed that the accuracy can be improved by 8% based on transfer learning.

The reason why CNNs are commonly used for micro-expression recognition is that time and spatial domain features can be extracted which can reflect the micro-expression changes. Feature extraction based on deep learning is shown in Figure 4. Results show that there are some differences in depth features between different micro-expressions compared to traditional features. Detailed features such as texture and contour, can be extracted at shallow layers. As the number of model layers increases, the extracted features are more discriminative. Although the performance of micro-expression recognition can be improved by CNNs, it cannot build the relationship between different features. Furthermore, a large amount of training data is required to train the model. Therefore, micro-expression recognition based on CNNs cannot obtain better performance.

Figure 4. Feature extraction based on CNNs for micro-expressions.

DownLoad: Full-Size Img PowerPoint

Facial images reflect relatively stable local structures and local texture patterns in the images. Therefore, facial images can be effectively represented by graphs, which prossess strong processing capabilities for non-Euclidean structures. The graph can be expressed by $G = (V, E)$ , where $V$ is the set of vertexes and $E$ is the set of edges. An edge connects ${v}_{i}$ and ${v}_{j}\in V$ is denoted as ${e}_{ij}$ . If there is an edge connecting ${v}_{i}$ and ${v}_{j}$ , ${v}_{i}$ is the neighbor of ${v}_{j}$ , and vice versa. Assume that all neighbors of ${v}_{i}$ are $N\left({v}_{i}\right)$ ; thus, $N\left({v}_{i}\right)$ can be given by $N\left({v}_{i}\right) = \left\{{v}_{j}\right|\exists {e}_{ij}\in Eor{e}_{ji}\in E\}$ . The number of edges with ${v}_{i}$ as the endpoint is called the degree of ${v}_{i}$ , denoted as deg ( ${v}_{i}$ ) = | $N\left({v}_{i}\right)$ |. In the graph structure, the facial feature points are represented as nodes and the distances between feature points are represented as edges. The movement of facial muscles can be represented as the movement of feature points. Different expressions cause feature points in different regions of the face to move with different tendencies. Therefore, micro-expression recognition can be achieved by modeling the motion of feature points. Micro-expression recognition based on a graph neural network is shown in Figure 5. First, a data augmentation strategy is used. Second, the graph structure is created based on landmarks. Third, feature extraction is performed based on the graph structure. Finally, micro-expression recognition is performed based on the extracted features. Micro-expression is characterized by short duration and low amplitude, as well as it requires the use of high-speed video cameras for video capture. When students use portable devices such as mobile phones, tablets, and computers for e-learning, these devices are equipped only with ordinary video cameras, which makes it difficult to accurately detect micro-expressions on videos captured by ordinary cameras.

Figure 5. Feature extraction based on graph neural network for micro-expressions.

DownLoad: Full-Size Img PowerPoint

3.3. Trusted emotion recognition based on eye states

3.3.1. Eye region structure database

Eye region structure database can be divided into three types according to the auxiliary light source, that is, images collected under infrared light source, ordinary light source, and natural light source, respectively. Commonly used eye region structure databases are listed in Table 3.

Table 3. Commonly used eye region structure databases.

Database	Category	Resolution	Equipment	Light	Distance
CASIA.v1 ^[64]	320 × 280	IRI	Near-infrared camera	1	Close range
CASIA.v2 ^[65]	640 × 480	IRI	Near-infrared camera	1	Close range
CASIA.v3 ^[66]	Multi-type	IRI	Near-infrared camera	1	Close range
CASIA.v4 ^[67]	Multi-type	IRI	LMBS	1	3m
ICE 2005 ^[68]	480 × 640	IRI	LG EOU 2200	1	Close range
WVU ^[69]	480 × 640	IRI	OKI Iris Pass-H	1	Close range
LPW ^[70]	640 × 480	IRI	Head-camera	1	Close range
UPOL ^[71]	576 × 768	RGB	Visible light camera	2	Close range
UBIRISv1 ^[72]	Multi-type	RGB	Nikon E5700v1.0	2	< 50 cm
UBIRISv2 ^[73]	400 × 300	RGB	Ordinary camera	2	4-8m
TEyeD ^[74]	Multi-type	IRI	Head-camera	1	Close range
Note: 1, 2, and 3 stand for infrared light source, ordinary light source and natural light, respectively. IRI and RBG stand for infrared image and colour image, respectively.

| Show Table

DownLoad: CSV

The eye consists of pupil, iris and sclera. In the facial image, the eye region is small, and pupil and iris have similar colors. The variability of eye structures can be enhanced by using infrared light sources. Therefore, most of the eye structure datasets are acquired using specialized equipment with the assistance of infrared light sources. In order to improve the accuracy of eye structure segmentation, researchers have successively proposed eye structure databases based on ordinary light sources and without light source assistance.

3.3.2. Eye states recognition methods

Eye states can truly reflect emotion changes. Eye tracking is a method that can record users' real, natural, and objective behavior. By tracking the user's point of gaze and eye movements, eye tracking technology analyses the factors that attract the user's attention and emotional changes. Professional eye tracking devices are now widely used in education. The eye tracking methods can be classified as invasive and non-invasive methods.

Invasive methods were commonly used in the early stage, including direct observation method, mechanical recording method, electromagnetic induction method, and optical recording. Invasive methods cannot obtain high accuracy and it may cause some damage to the eye. Non-invasive methods can be divided into wearable and non-wearable methods. Wearable eye tracking systems require the user to wear special equipment such as helmets or glasses equipped with optical cameras. The weight of the helmet affects the ease of use. However, wearable eye tracking systems offer high accuracy. The head can be moved over a wide range. The non-wearable eye tracking system acquires the user's facial image through the camera. It analyses and extracts features from the face as well as the human eye to get the feature parameters that can reflect the changes in the line of sight. Then, the feature parameters of the human eye are transformed into three-dimensional of the line of sight through the feature parameter and mapping model, so as to estimate the direction of the line of sight and the position of the landing point. Non-wearable eye tracking systems have the advantages of low interference, easy operation, and wide applicability. Current commercial eye tracking devices include Tobii, Polhemus, ASL, SMI, and Eyelink. Commercial devices are mainly based on the pupillary corneal reflection method, which first uses an infrared light source to produce a Pulchin spot on the cornea. Then, the direction vector between the centre of the pupil and the Pulchin spot is calculated. Finally, the direction of gaze is estimated based on this direction vector. This method requires the addition of an auxiliary light source. It has high requirements for image collection equipment and the equipment is expensive. It can only be used in experimental settings and is not easily generalized. For application environments such as e-learning, how to use ordinary cameras to acquire the region of interest during students' learning process, has received the attention of relevant researchers.

Depending on the number of cameras and auxiliary light sources, gaze estimation detection methods based on ordinary cameras can be classified as single camera single light source, single camera multiple light sources, and multi-camera multiple light source methods.

The single camera single light source system needs to obtain the human eye invariant parameters in the acquired images. The hardware configuration of this system is simple. However, it requires a complex calibration process. In addition, head movement may affect the accuracy of gaze estimation. The single camera multiple light source system with multiple light sources will produce multiple Pulchin spots in the eye region. The light sources are in fixed positions. Therefore, gaze can be determined by the relative positional relationship between the Pulchin spots. The single camera multiple light source system reduces the sensitivity to head movements. It allows the head to move within a certain range while maintaining the accuracy of the line of sight. For the multiple camera multiple light source system, parameters such as the centre of curvature of the cornea and the centre of the pupil, which are relevant for the detection of the visual field, are obtained according to the principle of multi camera stereo vision. The multi camera multiple light sources improves the effects of head movement and light variations on the accuracy of gaze tracking. However, it requires a complex calibration process, which includes calibration of the light source, user position, display position, and camera. Moreover, auxiliary light can affect student's learning state and increase usage costs. Therefore, how to accurately detect gaze estimation using ordinary cameras without disturbing the user's state and increasing the usage cost has become an urgent problem ^[75].

The region of interest may change as learning content changes. Although it cannot be directly used for emotion recognition, it can provide supportive clues for emotion recognition, as shown in Figure 6.

Figure 6. Emotion recognition based on gaze estimation.

DownLoad: Full-Size Img PowerPoint

Unlike gaze points, gaze duration, gaze counts, pupil size, eye closure time, and blink frequency can also be used to reflect emotion change.. However, when using ordinary cameras for eye state analysis, exogenous factors such as lighting, posture, occlusion, and image quality can affect the accuracy of eye state analysis because the eye occupies only a small part of the face region and the pupil and iris have similar colors. The current detection of the eye state requires the use of auxiliary light to improve the distinguishability of the eye region structure, and the use of professional equipment such as head-mounted cameras, infrared cameras, or depth cameras to capture high-resolution eye images to achieve the detection of the eye state. It is difficult to analyze the eye state based on the image sequence captured by ordinary cameras, as shown in Figure 7.

Figure 7. Eye region image captured from (a) near-infrared camera, (b) visible light camera (daylight), (c) visible light camera (night).

DownLoad: Full-Size Img PowerPoint

Moreover, when employing deep learning techniques for eye region structure segmentation, the training data's proximity to the real-world application scenario directly impacts the model's training efficacy and the accuracy of the segmentation results. However, most of the datasets (as shown in Table 3) are collected in laboratories or under controlled conditions based on professional equipment with the assistance of light sources. Therefore, the insufficiency of relevant datasets, restricts the development of deep learning-based methods for eye region structure segmentation of images captured by ordinary cameras under natural light.

3.4. Trusted emotion recognition based on non-contact physiological signals

Physiological signals such as heart rate, oxygen saturation, respiratory rate, heart rate variability, and blood pressure can reflect emotional changes. Unlike the Collection of physiological signals using specialized equipment, non-contact physiological signals can be extracted from video based on remote photoplethysmography.

3.4.1. Non-contact physiological signals database

The database of non-contact physiological signals is acquired in a laboratory environment. Heart rate information is recorded using sensors while the camera records images of the subject's face. Commonly used non-contact physiological signals databases are shown in Table 4.

Table 4. Commonly used non-contact physiological signals databases.

Database	Subjects	Samples	Length	Equipment (Video/BVP)
COHFACE ^[76]	40	160	1 min	Logitech HD C525/SA9311M
PFF ^[77]	13	85	3 min	---/MIO Alpha Ⅱ
PURE ^[78]	10	60	1 min	ECO274CVGE/CMS50E
UBFC-rPPG ^[79]	42	42	2 min	Logitech C920HD/CMS50E
VIPL-HR ^[80]	107	3130	30 s	Logitech C310/CMS60C
VIPL-HR-V2 ^[81]	500	2500	10 s	RealSense F200/CMS60C
MMSE-HR ^[82]	40	102	---	FLIP A655sc/Biopac MP150
MR-NIRP-Car ^[83]	19	190	2−5 min	GS3U341C6NIRC/CMS50D
MR-NIRP-Indoor ^[84]	12	---	30 s	BFLYU323S6CC/CMS50D+
OBF ^[85]	106	200	5 min	SN9C201 & 202/NeXus-10MKII
Note: "---" means not introduced in the text.

| Show Table

DownLoad: CSV

3.4.2. Trusted emotion recognition method based on non-contact physiological signals

Heart rate, heart rate variability, and pulse oximeter values can be extracted from video based on rPPG, the reflection from skin recorded by the camera can be defined as a time-varying function of the color channel, which can be expressed as:

${C}_{k}\left(t\right) = I\left(t\right)\times \left({V}_{s}\right(t)+{V}_{d}(t\left)\right)+{V}_{n}\left(t\right)$

(9)

where ${C}_{k}\left(t\right)$ represents the kth pixel value at time t; $I\left(t\right)$ represents the illumination intensity at time t, which can be affected by the change of illumination and the distance between the illumination, the skin and the camera; ${V}_{s}\left(t\right)$ , ${V}_{d}\left(t\right)$ , and ${V}_{n}\left(t\right)$ represent the specular reflection, the diffuse reflection, and the noise at time t, respectively.

As light reaches the skin, a large amount of light is reflected off the skin surface and only a small amount of light enters the tissue, therefore the specular reflection does not contain pulse information, the specular reflection can be expressed as:

${V}_{s}\left(t\right) = {u}_{s}\times ({s}_{0}+s(t\left)\right)$

(10)

Different from the specular reflection, the concentration of hemoglobin changes periodically with the pulse of the heart, so the light intensity of menstrual hemoglobin also changes periodically, which can be used to reflect the pulse. The diffuse reflection can be expressed as:

${V}_{d}\left(t\right) = {u}_{d}\times {d}_{0}+{u}_{p}\times p\left(t\right)$

(11)

where $p\left(t\right)$ represents the intensity of heart rate.

Taking into account the specular reflection and the diffuse reflection, ${C}_{k}\left(t\right)$ can be expressed as

${C}_{k}\left(t\right) = I\left(t\right)\times ({u}_{s}\times {s}_{0}+{u}_{s}\times s(t)+{u}_{d}\times {d}_{0}+{u}_{p}\times p(t\left)\right)+{V}_{n}\left(t\right)$

(12)

Define ${u}_{c}\times {c}_{0} = {u}_{s}\times {s}_{0}+{u}_{s}\times s\left(t\right)$ , I(t) $= {I}_{0}+{I}_{0}\times i\left(t\right)$ , where ${u}_{c}$ represent the unit vector of the skin's intrinsic reflections, ${c}_{0}$ represent the light intensity, ${\mathrm{I}}_{0}$ represent the fixed light intensity, ${I}_{0}\times i\left(t\right)$ represents the varying light intensity, the skin reflection model can be expressed as:

${C}_{k}\left(t\right) = ({I}_{0}+{I}_{0}\times i(t\left)\right)\times ({u}_{c}\times {c}_{0}+{u}_{s}\times s(t)+{u}_{p}\times p(t\left)\right)+{V}_{n}\left(t\right)$

(13)

According to the skin reflection model, we can observe the image sequence ${C}_{k}\left(t\right)$ collected by the camera contains the pulse signal $p\left(t\right)$ , and decomposition algorithm can be used to extract the pulse signal. The flowchart of non-contact physiological signals can be seen in Figure 8. First, the facial part is captured using camera. After face is detected, the region of interest is extracted from facial parts. Second, in order to obtain better performance, signal processing is conducted, including detrending, filtering, and signal decomposition, etc. Then, BVP signals are extracted from the preprocessed signal. Finally, FFT is used to transfer BVP signal from the time domain to the frequency domain and the heart rate can be calculated based on the signal frequency results.

Figure 8. The flowchart of non-contact physiological detection.

DownLoad: Full-Size Img PowerPoint

Non-contact physiological signals have been applied to emotion recognition, medical care, state monitoring, biometric recognition and other fields ^[86,87]. In ^[88], non-contact physiological signals were used to recognize micro-expression. First, heart rate variability is extracted from video. Then, time-domain, frequency-domain and statistical features are extracted separately. Finally, the extracted features are fused and micro-expression recognition is performed based on the fused features. The accuracy is improved by 17.05% compared to micro-expression recognition based on facial images. In ^[89], pulse rate variability was extracted from videos for micro-expression recognition. CASME Ⅱ micro-expression dataset was used to test the performance. Results show that an accuracy of 60% is reached. Compared to micro-expression recognition based on heart rate variability, the accuracy is improved by 0.21%. In ^[90], heart rate variability was extracted from the video to monitor the player's emotions. Most research work is based on forward faces for physiological signal detection. Because forward faces can stably be extracted to face-related regions. References ^[91] and ^[92] addressed heart rate detection when face information is missing in educational scenarios. The proposed method can obtain stable and accurate heart rate measurements when faces are missing.

Compared with measurement methods that require wearing sensors, such as ECG and EMG, rPPG-based measurement of physiological data does not affect the user's state. In addition, it expands the application scenarios by getting rid of the limitation of professional equipment. However, the accuracy of measurement results can be affected by various aspects such as light, posture, image resolution, region of interest, signal decomposition algorithm, length of preprocessed video, and initial value of heart rate. In addition, physiological signals are private ^[93]. How to accurately measure relevant physiological signals in a safe and reliable environment is the prerequisite and guarantee for emotion recognition. Although researchers have proposed more robust and more accurate algorithms ^[94,95] to achieve the measurement of physiological parameters under different exogenous situations. However, the accuracy needs to be further improved compared to contact methods.

3.5. Trusted emotion recognition based on speech

3.5.1. Speech emotion recognition database

The dataset of speech emotion recognition can be divided into discrete emotion recognition databases and continuous emotion recognition databases. Commonly used speech emotion recognition databases are shown in Table 5. Discrete emotion recognition databases are used to judge the category of emotion by the annotator's vote. Continuous emotion recognition databases are used to quantify the values of different dimensions with the help of MAAT or SAM system.

Table 5. Commonly used speech emotion recognition databases.

Database	Language	Samples	Participants	Emotional category
CASIA ^[96]	Chinese	9600	4	5 categories of discrete emotions
RAVDESS ^[97]	English	1440	24	8 categories of discrete emotions
RML ^[98]	Multiple	720	8	6 categories of discrete emotions
BAUM-1s ^[99]	Turkish	1222	31	8 categories of discrete emotions
IEMOCAP ^[100]	English	1150	10	V, A, D continuous emotions
CreativeIT ^[101]	English	8 h	16	V, A, D continuous emotions
VAM ^[102]	German	1018	47	V, A, D continuous emotions
SEMAINE ^[103]	German	80 h	150	V, A, D, E, I continuous emotions
RECOLA ^[104]	French	9.5 h	46	V, A continuous emotions
Note: V, A, D, E and I stand for Valence, Arousal, Dominance, Expectancy and Intensity respectively.

| Show Table

DownLoad: CSV

3.5.2. Speech emotion recognition methods

Speech signals are also widely used in the field of emotion recognition. The flowchart of emotion recognition based on speech signals is shown in Figure 9.

Figure 9. The flowchart of speech emotion recognition.

DownLoad: Full-Size Img PowerPoint

Speech emotion recognition algorithms can be divided into traditional algorithms and deep learning algorithms. For traditional algorithms, acoustic features should be extracted from the original signals first. Then, classifiers need to be selected to classify the extracted features. The acoustic features include rhythm, power spectrum, acoustic, non-linearity, etc. Among them MFCC features are widely used in speech recognition tasks. Defined $x\left(n\right)$ is the input speech signal. First, pre-emphasis is applied to the input signal, which enhances the higher-frequency components relative to the lower-frequency components. This can be achieved using the following formula:

$y\left(n\right) = x\left(n\right)-\alpha x\left(n-1\right)$

(14)

where y(n) is the pre-emphasized signal, x(n) is the original speech signal, and $\alpha$ is the pre-emphasis coefficient. Then, the pre-emphasized signal y(n) is divided into short frames of length N. Overlapping frames can be obtained using a frame shift of M with a desired overlap ratio R. Each frame can be represented as:

${x}_{i}\left(n\right) = y(n+i\times M), 0\le i\le L-1$

(15)

where L is the number of frames, and n is the starting position of the frame. After the process of frame segmentation, a window function w(n) is applied to each frame signal. Commonly used window functions include the Hamming window. The windowed frame can be calculated as:

${x}_{i}\left(n\right) = {x}_{i}\left(n\right)\times w\left(n\right), 0\le n\le N-1$

(16)

The fourth step is applying the DFT (discrete fourier transform) to each windowed frame ${x}_{i}\left(n\right)$ to obtain the frequency spectrum ${x}_{i}\left(k\right)$ , and then a set of Mel filters are applied to the frequency spectrum. These filters are uniformly spaced in the Mel frequency scale. The output of each Mel filter can be computed as:

${H}_{m}\left(k\right) = \left\{\begin{array}{c}0, k < f(m-1)\\ \frac{k-f(m-1)}{f\left(m\right)-f(m-1)}, f(m-1)\le k < f\left(m\right)\\ 1, f\left(m\right)\le k\le f(m+1)\\ \frac{f(m+1)-k}{f(m+1)-f\left(m\right)}, f\left(m\right) < k\le f(m+1)\\ 0, k > f(m+1)\end{array}\right.$

(17)

where m is the index of the Mel filter, k is the frequency bin index, and f(m) is the center frequency of the m-th Mel filter. The center frequencies can be calculated using the inverse Mel frequency scale function:

$f\left(m\right) = {f}_{mel}^{-1}\left(m\right)$

(18)

The sixth step is to compress the dynamic range by taking the logarithm of the output energy from each Mel filter, it can be computed as:

${s}_{i}\left(m\right) = log\left(\sum _{k = 0}^{K-1}{{|X}_{i}\left(k\right)|}^{2}\cdot {{|H}_{m}\left(k\right)|}^{2}\right)$

(19)

where ${s}_{i}\left(m\right)$ is the output energy of the m-th Mel filter for the i-th frame, K is the length of the spectrum. Then, DCT (discrete cosine transform) is applied to the log-compressed energy spectrum ${s}_{i}\left(m\right)$ to obtain the cepstral coefficients. The DCT can be computed as:

${C}_{i}\left(l\right) = {\sum }_{m = 0}^{M-1}\left({S}_{i}\left(m\right)\cdot cos\left[\frac{\pi }{M}\left(l+0.5\right)m\right]\right)$

(20)

where ${C}_{i}\left(l\right)$ is the value of the I-th cepstral coefficient for the i-th frame, M is the desired number of cepstral coefficients to be retained. Finally, only the first few cepstral coefficients are retained as the final MFCC features.

Reference ^[105] extracted MFCC features in speech signals and HMM was used as a classifier for emotion recognition. Reference ^[106] proposed hierarchical speech recognition framework. They obtained the probability of emotion recognition in the case of SVM and GMM as classifiers based on MFCC features in the first level respectively. After that, the obtained probability values are used as inputs to the SVM classifier in the second level to get the final recognition results. Testing is performed on Emo-DB dataset. The accuracy is improved by 8% and 9% compared to single SVM and GMM classifiers, respectively. Due to the environment containing noise that affects speech recognition performance, the functional paralanguage contained in speech, although it carries a lot of emotional information, such as sighs, questioning, laughter, etc., also affects the accuracy of speech emotion recognition due to the interference of feature bursts. Reference ^[107] proposed a speech emotion recognition method that fuses functional paralanguage. The functional paralanguage of the utterance to be recognized is detected by a functional paralanguage automatic detection algorithm based on fixed-length segmentation and the SPS-DM model, as well as the pure speech signal and the functional paralanguage signal, are obtained. The two types of signals are then fused using a confidence-based adaptive weight fusion method to improve the accuracy and robustness of recognition. A recognition accuracy of 67.41% is obtained for the PFSED dataset. Reference ^[108] performed feature extraction based on deep networks for emotional and functional paralanguage, respectively. Then, they discussed the effect of different feature fusion algorithms on the recognition results. For feature-level fusion, affective features and functional paralinguistic features are fused using splicing. The fused features are used as inputs to the Bi-LSTM model for emotion recognition. For model-level fusion, cell-coupled LSTM is used instead of Bi-LSTM. The cell-coupled LSTM contains two parts. One-part deals with affective features and the other part deals with functional paralinguistic features. For decision-level fusion, two different classifiers are trained using sentiment features and functional paralinguistic features, respectively, and the results are fused using linear weighting. Tested on the NNIME dataset, the decision-level fusion achieved the highest recognition accuracy of 61.92%.

Deep learning-based speech emotion recognition algorithm using end-to-end approach for emotion recognition can effectively extract emotion features with high recognition accuracy and robustness. Since convolution neural network and recurrent neural network can effectively perform feature extraction and capture contextual information in speech, it is widely used in deep learning based speech emotion recognition. Reference ^[109] used convolution neural network and time-domain pyramid matching for feature extraction of speech signals and recognized the fused features based on SVM. First, the Mel spectrogram of the speech signal is extracted and the Mel spectrogram is segmented according to the speech segments. Then, the high-level features of the Mel spectrogram of each segment are extracted using the AlexNet model. The learned high-level features are fused using time domain pyramid for feature fusion. Finally, the fused features are classified using SVM. EMO-DB, RML, eNTERFACE05, and BAUM-1s datasets were used to test the performance. Results show that all of these yielded high classification accuracy. Reference ^[110] performed feature extraction and feature fusion on speech spectrograms based on convolution neural network and multiple kernel learning, respectively, and then the support vector machines algorithm was used for emotion recognition. It was tested on EMO-DB and CASIA datasets and achieved 86% and 88% recognition accuracy. Reference ^[111] extracted MFCC and Mel spectrograms for speech signals and performed speech emotion recognition based on the bidirectional LSTM model. The bidirectional LSTM model consists of two parts for emotion recognition based on MFCC features and two Mel spectrograms with different temporal frequency resolutions. After obtaining the output of the two-part structure, feature fusion is performed at the decision level. Reference ^[112] performed emotion recognition based on 1D-CNN-LSTM and using speech and Mel spectrograms, respectively. The results show that 2D-CNN-LSTM is able to learn local features, global features and temporal dependencies of emotions compared to 1D-CNN-LSTM. It was tested on Emo-DB and IEMOCAP datasets and obtained 92.9% and 89.16% recognition accuracies, respectively. Reference ^[113] extracted spectrograms, MFCC, cochleograms and fractal dimensions for speech signals, transformed time sequences into image sequences and used an end-to-end approach combined with an attention mechanism based on a 3D CNN-LSTM model to achieve motion recognition. It was tested on RAVDESS, RML, and SAVEE datasets and achieved 96.18%, 93.2% and 87.5% recognition accuracy, respectively.

Feature extraction of speech spectrogram based on convolution neural network will be missing spatial information, which is significantly related to low-level features such as resonance peaks and treble. Capsule network can extract spatial information from the speech spectrogram and transfer the information by dynamic routing. Reference ^[114] introduced capsule networks into the speech emotion recognition task. To improve the model's ability to capture contextual features, they introduced a recurrent neural network to capture the time domain information in speech and achieved 72.73% accuracy in the EMOCAP dataset. It has an internal recurrent routing protocol algorithm. Therefore, capsule networks are slower. Especially as the size of the dataset increases, the number of training epochs for the model needs to be increased. Furthermore, the model compression algorithm applied to the convolution neural network cannot be directly applied to the capsule network. Therefore, capsule networks have higher complexity. Reference ^[115] proposed the DC-LSTM COMP-CapsNet capsule network model which can reduce the model's computational complexity while ensuring recognition accuracy. It was tested on four public data and achieved 89.3% recognition accuracy.

Compared to convolution neural network and recurrent neural network, the transformer has superior performance in feature extraction and long sequence modeling. Therefore, the transformer was introduced into the field of speech emotion recognition. Reference ^[116] proposed unsupervised domain adaptation approach based on Transformer and mutual information for cross-corpus speech emotion recognition. The transformer is used to extract dynamic features of speech from Mel's spectrogram. Then, common and individual features of different precursor corpora are learned from the extracted features based on maximum and minimum mutual information strategies to eliminate the differences between corpora. Finally, an interactive multi-head attention fusion strategy is proposed to learn the complementarities between different features, and speech emotion recognition is performed based on the fused features. Reference ^[117] proposed a transformer-like model for speech emotion recognition in order to reduce the problem of excessive time and memory overhead during the training process for the transformer model. The method achieves similar recognition accuracy as the original Transformer model while reducing the time and memory overheads during model training. Transformer-based approaches usually require frame-by-frame computation of attention coefficients, which cannot capture local emotion information and are susceptible to noise interference. Reference ^[118] proposed the BAT model. By calculating the self-attention of the spectrogram after chunking, they can mitigate the effect of local noise generated by high-frequency energy while capturing real emotions. In addition, they proposed a cross-block attention mechanism to facilitate the information interaction between blocks. Moreover, they integrated the FCCE module used to reduce attentional bias. IEMOCAP and Emo-DB datasets were used to test the performance, an accuracy of 75.2% and 89% were obtained. Transformer based speech emotion recognition model requires a large amount of data to train the model adequately. In order to remedify the problem of insufficient training data, ^[119] proposed the ADAN model for generating training samples. They combined the generated data and tested it on Emo-DB and IEMOCAP datasets, which achieved recognition accuracies of 84.49% and 66.92%, respectively.

3.6. Trusted emotion recognition based on other signals

Head posture, hand posture, and facial feature points can also be used for emotion recognition. head posture i.e., head motion is characterized by calculating the pitch, yaw and roll angles of the head in space. The flowchart of emotion recognition based on head posture is shown in . First, feature point detection is performed in the facial image and the detected facial feature points are matched with the 3D face model. Second, according to the relationship between the world coordinate system, camera coordinate system and imaging plane in order to determine the affine transformation matrix from the 3D face model to the 2D face feature points. The euler angles consist of yaw, roll and pitch, which can be defined as $\alpha$ , $\beta$ , and $\gamma$ and the quaternions can be defined as follows.

${R}_{wb} = {\left({R}_{x}\right(\alpha \left){R}_{y}\right(-\beta \left){R}_{z}\right(r\left)\right)}^{T}$

(21)

Figure 10. The flowchart of emotion recognition based on head posture.

DownLoad: Full-Size Img PowerPoint

where ${R}_{x}\left(\alpha \right) = \left[\begin{array}{ccc}1 & 0 & 0\\ 0 & cos\left(\alpha \right) & sin\left(\alpha \right)\\ 0 & -sin\left(\alpha \right) & cos\left(\alpha \right)\end{array}\right]$ , ${R}_{y}\left(-\beta \right) = \left[\begin{array}{ccc}cos\left(\beta \right) & 0 & sin\left(\beta \right)\\ 0 & 1 & 0\\ -sin\left(\beta \right) & 0 & cos\left(\beta \right)\end{array}\right]$ , ${R}_{z}\left(\gamma \right) = \left[\begin{array}{ccc}cos\left(\gamma \right) & sin\left(\gamma \right) & 0\\ -sin\left(\gamma \right) & cos\left(\gamma \right) & 0\\ 0 & 0 & 1\end{array}\right]$

Thus, yaw, roll, and pitch can be calculated as follows:

$\alpha = arctan2\left({R}_{wb}\right(\mathrm{3, 2}), {R}_{wb}(\mathrm{3, 3}\left)\right)$

(22)

$\beta = arcsin\left({R}_{wb}\left(\mathrm{3, 1}\right)\right)$

(23)

$\gamma = arctan2\left({R}_{wb}\right(\mathrm{2, 1}), {R}_{wb}(\mathrm{1, 1}\left)\right)$

(24)

Third, the time series of head posture is obtained based on the Euler angles and quaternions. Finally, feature extraction of head posture is performed for emotion recognition.

Facial feature points can also be extracted as features for emotion recognition. Reference ^[120] judged the user's blinking based on the positional relationship between the eye feature points. Moreover, they extracted the duration of blinking, frequency of blinking, average aspect ratio, and eye distance when eyes are open as eye movement features based on the obtained time series to judge the learning engagement of students. Reference ^[121] recognized four discrete emotions through the relative positional relationships between facial feature points and the angle relationships between feature points in different parts of the face. When the model detects negative emotions in the learning process of students with autism, it automatically regulates the learning content and attention by means of animation in order to enhance the online learning ability of students with autism.

The online learning environment, although it is not possible to perform affective analysis by analyzing the learner's complete body posture, affective changes can be analyzed by analyzing the Hand-over-Face (HOF), a movement that often occurs during the learning process. It has been shown that different hand movements and hand shapes imply different affective states, which in turn affects behavioral engagement ^[122,123], as shown in Figure 11.

Figure 11. The relationship between HOF and emotions.

DownLoad: Full-Size Img PowerPoint

3.7. Trusted emotion recognition based on multiple signals

Emotion recognition based on a single signal cannot obtain better performance, which limits the scope of application. The advantages and disadvantages can be seen in Table 6.

Table 6. Summary of the emotion recognition based on signal feature.

Methods	Advantage	Disadvantage
3.1	Facial images are less difficult to acquire	Does not always truly reflect changes in emotion
3.2	Can truly reflect changes in emotions	Short duration, low amplitude and high difficulty.
3.3	Can truly reflect changes in emotions	Low accuracy of segmentation based on common equipment
3.4	Can truly reflect changes in emotions	Easy to be disturbed by external factors, low accuracy
3.5	May provide clues for emotion recognition	Low recognition accuracy under noise interference
3.6	May provide clues for emotion recognition	Low accuracy in single state recognition

| Show Table

DownLoad: CSV

The accuracy of emotion recognition can be improved based on multiple types of signals compared to single signal by exploiting the complementarity between signals. Established research methods need to combine video with audio, textual content and physiological to achieve emotion recognition. The multimodal emotion recognition process is shown in Figure 12.

Figure 12. The flowchart of emotion recognition based on multiple signals.

DownLoad: Full-Size Img PowerPoint

There are large differences in dimensions and types between different signals. Therefore, data fusion is needed to take advantage of the complementarity between multimodal signals for emotion recognition. Feature fusion methods can be classified into data-level fusion, feature-level fusion, decision-level fusion and model-level fusion. Data-level fusion is also known as sensor-level fusion, which is a direct combination of the original data collected by the sensors and generates a new set of data. Data level fusion preserves the original data information and maintains the integrity of the information. However, data-level fusion is complicated and cumbersome. Feature-level fusion is to extract features from different original data and then compose a new feature set from each feature. Feature-level fusion can retain the information of the original data to the maximum extent. It has been shown that feature-level fusion can achieve the best recognition performance ^[124]. Decision-level fusion requires a joint decision on the credibility of each model. It is easier to implement than feature-level fusion. The main strategy adopted for model-level fusion is to learn more complex changes by building deep network models. It can fit more complex features to increase the nonlinear fitting capability. Signals can be extracted from video as shown in Figure 13, including one-dimensional signals and two-dimensional signals. Among them, the one-dimensional signals include head posture, non-contact physiological signal, pupil size and eye state. The two-dimensional signals include facial expression, micro-expression and gaze estimation. Data fusion can be performed at the feature-level (e.g., Figure 14(a)) and decision-level (e.g., Figure 14(b)) for emotion detection.

Figure 13. Multiple signals captured from video.

DownLoad: Full-Size Img PowerPoint

Figure 14. Fusion method based on different dimensions and types.

DownLoad: Full-Size Img PowerPoint

3.7.1. Multimodal emotion recognition databases

The multimodal emotion recognition database contains visual, speech, textual, and physiological data, of which visual includes facial expression, body posture, eye state, etc. Speech contains speech content that can reflect the subject's emotion, and physiological includes EEG, EMG, heart rate, and so on. According to the emotion label category, the database can be divided into discrete emotion recognition database and continuous emotion recognition database. The multimodal emotion recognition database is shown in Table 7.

Table 7. Commonly used multimodal emotion recognition databases.

Database	Samples	Type	Category
DEAP ^[125]	1280	Images + Physiological Signals	Continuous emotion
CMU-MOSI ^[126]	2199	Image + Eye	Continuous emotion
CH-SIMS ^[127]	2281	Audio + Text	Continuous emotion
Multi-ZOL ^[128]	28,469	Image + Eye	Continuous emotion
LIRIS-ACCEDE ^[129]	9800	Audio + Text	Continuous emotion
MAHNOB-HCI ^[130]	532	Image + Text	10 categories of discrete emotions
eNTERFACE'05 ^[131]	1166	Image + Audio	5 categories of discrete emotions
CMU-MOSEI ^[126]	23,453	Image + Eye	6 categories of discrete emotions
IEMOCAP ^[100]	10,039	Physiological signals	9 categories of discrete emotions

| Show Table

DownLoad: CSV

3.7.2. Multimodal emotion recognition methods

Video and audio are commonly combined for emotion recognition tasks. Reference^[132] proposed a bilinear pooled attention network model based on adaptive and multilayer decomposition for emotion recognition. For audio signals, an audio encoder was used to encode the spectrogram of audio signals and designed a convolutional network with attention mechanism and global response normalisation to extract audio features. For video signals, a video encoder was used to encode the extracted faces in the video and weighted the encoding results using a one-dimensional attention mechanism. Then, they proposed the FBP method based on the self-attention mechanism to achieve the fusion of audio and video features. Finally, emotion recognition was performed based on the fused features. Results show that an accuracy of 61.1% was reached combining video and audio data, compare with single data, an improvement of 9.03% and 26.11% was reached, respectively. Reference ^[133] proposed a convolution neural network model based on a two-stage fuzzy fusion strategy for emotion recognition based on the facial image and speech extracted from the video. Experiments were conducted on three multimodal datasets with facial expression and speech extraction. The results show that compared with single facial expression or speech, the accuracy can be improved up to 17.55%, 26.36%, and 37.25% on Entereface'05, AFEW and SAVEE dataset, respectively. Similar to ^[133,134] detected not only facial image but also facial feature points in the video. They extract HOG-TOP features for facial images and geometric features for facial feature points. Furthermore, they used HOG-TOP features and geometric features to represent visual features. They extracted the relevant acoustic feature for the speech signal. Finally, SVM classifier was used to recognize emotion based on fused features. Results show that an accuracy of 45.2% on AFEW 4.0 database can be obtained based on dynamic, acoustic and geometric features. When a user speaks, facial changes due to speech can be confused with facial changes due to emotional changes. Therefore, ^[135] first divided the face into upper and lower parts. Second, the pitch and content of the speech were combined with the upper and lower parts of the facial image, respectively. The emotion at that particular time was estimated separately. Third, emotion recognition was performed on the speech signal based on SVM classifier. Finally, the recognition results based on upper and lower facial image and speech signals were fused at the decision level for emotion recognition. IEMOCAP and SAVEE datasets were used to evaluate the algorithm's performance, an accuracy of 67.22% and 86.01%was achieved respectively. Further analysis finds that the upper face region, particularly the eyebrow region, is highly associated with speech emphasis. Reference ^[136] combined facial expression, intonation, and speech content. Initially, facial features were extracted using FEA2. Then, features were extracted for intonation and speech content separately. Finally, the fused features were subjected to emotion recognition based on SVM. Visual, audio and text were combined to achieve emotion recognition, an accuracy of 95.6% was reached.C with each single signal, the accuracy can be improved up to 51.2%, 9.6%, and 21.7%, respectively.

In order to improve the performance of emotion recognition, text is also combined with video and audio for emotion recognition. Due to the differences in dimensional among different modal data, most of the existing research methods need to design different fusion networks. These fusion network designs were based on the differences in dimensions when performing multimodal data fusion. In ^[137], a modality-independent fusion network based on transformer was proposed which can fuse video, audio and textual features. The experiments carried out on the Hume-Reaction datasets and the performance were reported in terms of Pearson correlation. The best performance can be obtained by combining image, text, audio, and FAU, which is statistically significant with other configurations. Reference ^[138] investigated the fusion of different modal features. They first used transformer for feature extraction of three modal data. Then, the fusion was performed based on the intermediate layer features. The experimental results show that better recognition accuracy can be obtained based on intermediate layer features compared to deep features. Multimodal data fusion can take advantage of the complementarity between different data. While variations in features are expected among different modal data, they may also introduce redundant information. This leads to incomplete learned multimodal features. The experimental results on IEMOCAP and CMU-MOSEI show that compared with all baseline models, multimodal data fusion obtains the best results, which demonstrates the effectiveness of sufficiently leveraging the multimodal interactions among the intermediate representations. Reference ^[139] proposed FDMER, a multimodal emotion recognition method based on decoupled representation learning. FDMER maps each modality to modality-invariant subspace and modality-specific subspace respectively. The modality-invariant subspace and modality-specific subspace learn common and individual features among different models, respectively. The learned features are cascaded and adaptive weights are learned using the cross-modal attention fusion module to obtain effective multimodal features. Furthermore, most multimodal models fail to capture the association between each modality. Since fine-grained information is ignored, discriminative features cannot be extracted and it limits the generality. CMU-MOSI, CMU-MOSEI, and UR_FUNNY dataset were used to test the performance. Compared with single text, audio and visual pattern, the best performance can be obtained based on multimodal data fusion and the accuracy is 84.2%, 83.9%, and 70.55%, respectively. ^[140] proposed a feature fusion mechanism to encode temporal-spatial features to learn a more complete feature representation. Then, a late fusion strategy is used to capture fine-grained relationships between multimodalities, which is combined with an integration strategy to improve model performance. Results show that multimodal fusion is an efficient approach for further performance improvement by integrating complementary emotion-related information from each modality and the ensemble method can effectively combine the advantages of the single model.

Besides audio and text, EEG are combined with facial images for emotion recognition. Reference ^[141] combined facial expression with haptics. They collected pressure values under different emotions through pressure sensors and combined the pressure values with facial expressions for continuous emotion detection. Analysis shows that the participants usually integrated both visual and touch information to evaluate the emotional valence and adding this type of tactile stimulation to multimodal emotional communications platforms might enhance the quality of the mediated emotional interactions. In ^[142], DCN was applied to fuse EEG signals and facial expressions in feature level to classify positive, neutral and negative emotion for deaf subjects. First, the EEG signals and facial expressions were simultaneously collected when they were watching emotional movie clips. Then, the EEG signals were preprocessed, and the DE feature in different bands were extracted to obtain 310 dimensions features. The first frame of one second were extracted from facial video records to select a portion of the facial expression, and resize the image to 30 × 30. Finally, two modal features were fused as the input of DCN to classify three emotions. The classification accuracy is 98.35%, which is 0.14% higher than facial expression emotion recognition and 0.96% lower than EEG emotion recognition. Unlike ^[142] who extracted facial expression and EEG features separately, ^[143] first transformed EEG data from time series to brain map form. Then, they used the VGG-16 depth model for feature extraction. For facial expressions, 30-dimensional facial features were formed by detecting the distance between feature points based on the facial feature points. The extracted EEG features and facial features were used as inputs to the LSTM model after PCA dimensionality reduction in order to achieve continuous emotion recognition based on EEG and facial expression. The classification accuracy is 54.22%, which is 10.7% and 5.39% higher than EEG and facial expression emotion recognition, respectively.

In ^[144], speech and textual were combined for emotion recognition. For speech information, CNN, RNN and attention mechanism were used for speech coding. For textual information, the Bert model was used for textual content coding. After obtaining the speech and textual content features, fusion was performed at the feature level. Results show that corporate with variable text benefit from the combination, which is 9.1% better than speech only and 1.7% better than text only. Reference ^[145] proposed a regularized deep fusion model for emotion recognition based on multiple types of physiological signals. First, feature extraction was performed from different types of physiological signals, with feature embedding using kernel matrix. Then, a specific representation of each physiological signal was learned from the feature embedding based on the depth model. Finally, a global fusion layer with regularization terms was designed for feature fusion. The experimental results show than the best performance was obtained when fusing all of the modalities both on arousal and valence, irrespective of the dataset. Reference ^[146] also performed emotion recognition based on multiple types of physiological signals. The time domain and frequency domain features of physiological signals have different activation levelswhere the activation level in the time domain can reflect the brain activity. High and low activation levels correspond to positive and negative emotions, respectively. In the frequency domain, negative emotions correspond to high activation. In addition, there is heterogeneity and correlation in multimodal physiological data. Heterogeneity refers to the differences between various signal properties collected from different organs. Correlation refers to the relationship between channels of the same modality or different modalities. Existing methods struggle to utilize the complementarity, heterogeneity, and correlation of features in the time-space-frequency domain of multimodal time series for emotion detection.

To address this, researchers proposed HetEmotionNet, a heterogeneous graphical neural network that aims to achieve the simultaneous modeling of the complementarity, heterogeneity, and correlation of features of multimodal data under a unified framework. The model uses GTN and GCN to model the heterogeneity and correlation of multimodal physiological data, respectively, and GRU to extract the dependencies between time-frequency and frequency domains of multimodal physiological data. The result indicates that the effects of only using EEG signals is better than only using PPS, because EEG signals are the main physiological signals used in emotion recognition. Besides, fusing data of different modalities further improves the performance and reaches the best results. References ^[147] and ^[148] combined EEG and eye states for emotion recognition. Reference ^[147] used eye-tracker X120 to capture changes in pupil size and gaze point as eye movement features. Moreover, they fused the eye movement features with EEG features at the feature level and decision level, respectively, in order to verify the effect of different fusion strategies on the accuracy of emotion recognition. The best classification accuracies of 68.5% for three labels of valence and 76.4% for three labels of arousal were obtained using a modality fusion strategy and a support vector machine. Unlike ^[147], who used telemetric eye-tracker X120 to acquire eye states, ^[148] used a head-mounted eye-tracker to acquire eye states. The experimental results show the advantage of EEG features for the recognition of happy emotions. The recognition result of eye state for fear emotion is superior to EEG. This shows that different types of data have complementary advantages. Emotion recognition based on multiple types of data can improve the accuracy of emotion recognition. Reference ^[149] combined EEG and speech for emotion recognition. For EEG signals, 160-dimensional, differential entropy features were extracted in different frequency bands. For speech signals, MFCC coefficients, differential and acceleration coefficients were extracted to form 36-dimensional speech features. After feature extraction, feature fusion was performed at the feature level and decision level, respectively, to achieve emotion recognition. The experimental results show that the accuracy of the multimodal based approach is improved by 1.67% and 25.92% respectively compared to EEG or speech only for emotion recognition. This proves the effectiveness of the multimodality approach.

Although the accuracy of motion recognition can be improved based on multimodal data, the multimodal data requires the use of different devices for raw data acquisition, which increases the economic cost. Furthermore, due to the difference in data acquisition frequency between different devices, it is necessary to align the relevant state data acquired, as well as data alignment and annotation will increase the manpower cost. Emotions are complex and can be simultaneously reflected in changes between different state data. In the case that the data cannot be accurately aligned, it will affect the accuracy of emotion recognition. In addition, further summary and analysis of existing research work can be found that existing research work focuses more on the recognition of basic emotions. The basic emotions cannot comprehensively reflect the changes in students' emotions during the learning process. In addition, learning emotions such as confusion, thinking and other types of learning emotions that occur during the learning process are not included in the basic emotions. Therefore, the existing methods have some limitations. In the next study, the basic emotion categories can be subdivided by combining the multi-type state data extracted from the video to achieve fine-grained emotion recognition on the one hand. On the other hand, the basic emotion categories are extended to achieve the recognition of learned emotions.

4. The application of emotion recognition technology in intelligent education

4.1. The application of contact physiological signals detection in the field of education

Contact physiological Signals are detected with a high degree of accuracy, which can provide data support for analyzing students' learning state, improving students' academic performance, detecting changes in students' emotional state during the learning process, and improving students' participation in the classroom. Heart rate variability can be used to recognize academic performance, cognitive load, learning anxiety and physical activity participation level. Reference ^[150] investigated the relationship between adolescents' personality and heart rate variability and academic performance by analyzing the changes in heart rate variability of 91 seventh-grade students before, during, and after watching stress-inducing videos. Furthermore, students' ability to adapt and recover from stress can be judged by the changes in HRV. The experimental results showed that heart rate variability decreased during the viewing of the evoked video. Heart rate variability increased during the recovery period after the end of the video. Extroverted students and introverted students with greater increases in heart rate variability during the recovery period had better academic performance. Reference ^[151] investigated the relationship between changes in heart rate variability and academic performance through a game. Analyzing the trend of heart rate variability collected during the game revealed that being positively motivated by the game led to an increase in heart rate variability. This indicates that students' concentration is improved.Heart rate variability is also associated with mental load. Heart rate variability decreases as mental load increases and high-achieving students have lower mental load compared with underperforming students. The performance of learning outcomes may be influenced by cognitive load and prior knowledge together. When cognitive load significantly exceeds the prior knowledge, it can have a negative impact on student's learning outcomes. In ^[152], heart rate was used to detect students' and teachers' cognitive load during the chemistry course. Results showed that students' heart rate variability increased as cognitive load increased. Differently, although there is an increase in cognitive load, there is no significant difference between the changes in teachers' heart rate and cognitive load.

In ^[153], ATS emotional aid system was proposed to detect students' emotions through heart rate variability, which can improve student's learning outcomes. In ^[154], heart rate was collected to investigate whether it was possible to relieve anxiety by performing positive thinking exercises via mobile phone. Results show that anxiety can be alleviated based on mindfulness practice.

Physiological signals can also be used to analyze a student's motion state. Reference ^[155] aims to investigate whether physical education classes can increase the amount of physical activity among adolescent students, which can lead to a healthy lifestyle. ActiGraph was used during the experiment to record students' heart rate and step count. Results show that students who participated in physical education classes had more total daily exercise, more exercise intensity, and were able to meet the required exercise standards in greater numbers than students who did not participate in physical education classes. In addition, exercise duration can be effectively increased through human intervention. It is a challenge to increase girls' participation and perceptual ability in physical education. Compared with boys, girls have lower levels of participation in team sports, such as football and basketball. Reference ^[156] investigated whether the use of single-gender grouping strategy could enhance girls' participation and perceptual abilities in physical education. During the experiment, heart rate was monitored as a cue to judge activity intensity. The results showed a significant improvement in girls' perceptual abilities based on single-gender grouping strategy.

4.2. The application of eye states recognition in the field of education

Gaze points can truly reflect the region of interest during the class. The application of eye states can be divided into the following two categories. First, eye state can be used to analyze the impact of different presentation methods of the same knowledge content on students' learning effects, which can optimize the teaching methods.

The presentation mode of learning content may affect student's learning outcomes. Teachers can develop personalized learning styles and content presentations based on students' different learning preferences. Reference ^[157] investigated the effects of graphical and flux-based presentations on the understanding of the physical concept of divergence by analyzing the students' line of sight. The results of the experiment showed that better performance can be obtained by combining graphical and flux-based explanations. In addition, the effect of the learning style chosen by the students was better than the one assigned by the teacher. Reference ^[158] investigated the effect of different layouts of text and images in instructional materials on student's learning outcomes. The experimental results showed that the closer the text and image position, the higher the learning efficiency. Moreover, the use of text to supplement the content of images helps to improve the retention of the teaching content. Reference ^[159] investigated the effect of viewing visual learning process data on students' learning outcomes. They randomly divided students into two groups. The results of the pre-test showed no difference in prior knowledge and comprehension between the two groups. One group of students was able to view eye-tracking data from their learning process. Then, both groups of students reread the new material. The results of the post-test showed that the students who viewed the learning process data had stronger text processing and comprehension abilities regarding the learning content. In ^[160], the impact of virtual teacher and gaze guidance was combined to investigate the learning outcomes in e-learning. Results show that learning performance was not affected by virtual teacher and students were able to effectively allocate their attention between virtual teacher and the learning content. In ^[161], the impact of cognitive level on students' knowledge acceptance was studied. Sixty-two students participated in the experiment. They were asked to read the pre-revised and post-revised mathematics textbooks and then gaze points were analyzed. Results show that students with prior knowledge exhibited superior cognitive processing abilities for the revised materials compared to the original materials, while students lacking prior knowledge showed no significant differences.

Second, eye states can be used to analyze the differences in eye states between high-achieving students and lagging students, so as to optimize learning habits. Academic performance can be affected by learning habits. By analyzing the gaze points of high-achieving and low-achieving students during the learning process, it is possible to provide learning guidance for low-achieving students and optimize their learning habits. Research in ^[162] was studying the reading habits of elementary school students. Results show that high-achieving students paid more attention to graphical information and they had greater learning efficiency and comprehension. In ^[163], the change in pupil diameter was used to analyze the relationship between the difficulty of problems and cognitive level. Results show that pupil diameter will change as the difficulty of the problem increases. Reference ^[164] investigated the common and individual characteristics of eye movement patterns of students with different reading habits. Analyzing the gaze points, it was found that students with high reading ability had smooth visual sweeps and high accuracy in locating key information.

4.3. The application of facial expression recognition in the field of education

Emotions not only affect students' attention, memory and decision making, but also influence learning motivation and interest. The main purpose of facial emotion recognition in the field of education is to detect student's learning engagement in the learning process, which can help teachers timely adjust the teaching content and strategy. Engagement includes emotion engagement, behavior engagement and learning engagement. Facial expressions are often used to determine students' emotional engagement in the learning process. Unlike the six basic emotions of anger, disgust, fear, happiness, sadness, and surprise proposed by Ekman, academic emotion refers to the facial expressions that students show during the learning process. It includes confusion, boredom, sleepiness, concentration, and anxiety. Existing research work has been done more on designing different depth models to detect the six basic emotions proposed by Ekman. The detection of basic emotions can provide some support for teachers to judge students' emotional engagement. However, basic emotions cannot fully reflect academic emotions and it cannot judge students' emotional engagement. Although basic emotion datasets are available for model training, there is a lack of learning-based emotion recognition datasets, which limits the development of academic emotion recognition. Micro-expression can accurately reflect students' emotions, but micro-expression is short with low amplitude, which makes it difficult to detect. Therefore, less research work has been done to achieve the analysis of emotional engagement based on micro-expression in the field of education. Reference ^[165] proposed a micro-expression recognition model based on a hybrid neural network with bimodal spatial-temporal feature representation for micro-expression recognition in e-learning. The model consists of two modalities. The first modality models the changing geometry and dynamics of the face. First, feature points are detected for facial images. Then, the facial feature points are combined into vectors. Finally, the dynamic geometric features of the face are learned using DNN. The second modality is to model the facial appearance and dynamic changes. First, CNN was used to extract the appearance features. Then, the extracted appearance features are used as inputs to LSTM to learn the dynamic features of micro-expression changes. Finally, the learned features from both modalities are fused at the feature level for micro-expression recognition.

Head posture and eye state are combined with facial expression to recognize student's behavioral engagement. Gaze tracking is a method of recording real, natural, and objective user behavior, which can analyze students' behavioral engagement by analyzing head posture, gaze points, etc. In ^[166], blink frequency, body posture, and facial expression were extracted from videos during online exam to recognize student's behavioral engagement. Different from ^[166], in ^[167], facial expressions, hand gestures, and body postures were extracted to analyze both individual and overall behavioral engagement.

A single engagement cannot fully reflect the learning state. In ^[168], emotional and behavioral engagement were combined to recognize learning engagement. First, facial landmarks were extracted from video and head posture was calculated based on the extracted landmarks. Second, action units were detected from the images. Third, on the one hand head posture and gaze points were combined to recognize behavioral engagement, on the other hand action units were used to recognize emotional engagement. Finally, behavioral engagement and emotional engagement were combined to recognize learning engagement.

4.4. The real-world applications of emotion recognition in the field of education

The application of emotion recognition systems in the field of education carries significant importance. First, emotion recognition systems assist educators in understanding students' emotional states and needs, enabling them to provide personalized learning support. By monitoring students' emotional responses in real-time, the system can adjust learning resources, feedback, and teaching strategies to cater for individual learning needs, thereby enhancing learning outcomes and engagement. Second, emotion recognition systems help identify students' emotional issues and mental well-being. Educators can detect emotional distress early and provide appropriate support and guidance. This positively impacts students' emotion management, mental health, and overall well-being. Third, emotion recognition systems aid educators in refining teaching methods and assessment approaches. By understanding students' emotional responses during the learning process, educators can adapt their teaching strategies, provide more accurate assessment and feedback, cater to students' emotional needs, and improve assessment accuracy. Fourth, emotion recognition systems can be utilized to create emotionally intelligent learning environments. By recognizing students' emotional states, the system can automatically adjust factors such as music, colors, and ambiance within the learning environment to foster a positive emotional atmosphere, thereby enhancing students' emotional engagement and learning outcomes. Finally, emotion recognition systems contribute to increased student engagement and motivation. By promptly understanding students' emotional states, educators can provide relevant incentives and support based on their emotional needs, encouraging active participation in learning activities and enhancing students' interest and motivation.

There are many emotion recognition systems used in the field of education. EmotionTracker is an application that utilizes emotion recognition technology in the education field. It is a system designed to monitor and track students' emotional states in real-time. Through various sensors, such as facial expression analysis, voice tone analysis, or physiological measurements, EmotionTracker captures data related to students' emotions during learning activities. The purpose of EmotionTracker is to provide educators with valuable insights into students' emotional experiences and needs. By analyzing the collected data, the system can identify patterns and trends, allowing educators to better understand how students are feeling during different learning tasks, lessons, or assessments. EmoLearn is a specific application that leverages emotion recognition technology in the education domain. It is a platform designed to deliver personalized learning experiences based on students' emotional states and needs. The primary goal of EmoLearn is to create a learning environment that takes into account students' emotions and tailors' instructional content and strategies. By utilizing emotion recognition algorithms, the system identifies and analyzes students' emotions in real-time during their learning activities. The system can dynamically adjust the learning materials, pacing, and difficulty level to match each student's emotional state. Affectiva is a company that specializes in emotion recognition technology and provides solutions for emotion AI. They have developed advanced algorithms and software tools to detect and analyze human emotions through facial expressions, voice tone, and physiological signals. Affectiva's emotion recognition technology can also assist in assessing students' emotional well-being and mental health. By detecting signs of stress, frustration, or other negative emotions, educators and counselors can intervene and provide appropriate support to students in need.

5. Discussion and conclusions

In this paper, the detailed related datasets and technologies are used for emotion recognition. The performance of trusted emotion recognition can be improved by combining multiple types of signals. However, existing emotion recognition methods that combine physiological information, speech, micro-expressions, and posture have some limitations for e-learning. In the context of online education, video can senselessly record learners' behaviors during the learning process. Extracting features that authentically capture changes in emotion holds significant practical importance for the future implementation of e-learning. However, how to integrate different types of signals and leverage the complementary information between different data is a challenge that should be addressed.

Apart from technical aspects, it is crucial to consider different student populations, including students with disabilities or from diverse cultural backgrounds, addressing the accessibility and inclusivity issues of trusted emotion recognition technologies in education. The application of emotion recognition technologies needs to consider the needs of students with disabilities to ensure their equal participation in the learning process. For instance, adaptive interfaces or assistive tools can be provided for students with visual or hearing impairments to interact and express emotions with the technology. Additionally, collaborating with representatives and experts from the disability community to understand their needs and perspectives is essential in ensuring the accessibility and effectiveness of the technology. The accuracy and applicability of emotion recognition technologies may vary across different cultural backgrounds. Interpretation of facial expressions and body language can differ in various cultures. Therefore, caution should be exercised in the use of emotion recognition technologies in education to avoid imposing specific cultural norms of emotional expression as the standard. Developing and training emotion recognition algorithms should account for diversity and cultural differences to ensure the applicability and accuracy of the technology across different cultural backgrounds. Collecting and utilizing diverse datasets is crucial to enhance the inclusivity of emotion recognition technologies. The datasets should include samples from diverse age groups, gender, races, cultures, and disability populations. This helps to reduce data biases and improve the universality of the technology. Additionally, protecting data privacy and ensuring the representativeness of the data are important considerations. Relevant training and education are essential to improve educators' and students' understanding and acceptance of emotion recognition technologies. Educators should learn about the limitations and potential biases of the technology and how to correctly interpret and use the results of emotion recognition. Moreover, educators should educate students about the diversity of emotional expression and foster a learning environment that is inclusive and respectful of different ways of expressing emotions. In the field of education, the accessibility and inclusivity of emotion recognition technologies are key to ensuring that every student gets benefits. Considering the needs of students with disabilities and those from diverse cultural backgrounds, along with incorporating diverse datasets, is crucial for addressing accessibility and inclusivity challenges in emotion recognition technologies. Additionally, offering adaptive tools and training can further enhance our ability to tackle these issues effectively in education. This ensures that all students can equally avail themselves of the benefits offered by these technologies.

With advancements in technology, video-based emotion recognition devices have made significant progress. These devices analyze facial expressions, speech, and body language to infer an individual's emotional state. Emotion recognition technology holds great potential for various fields, including education, healthcare, marketing, and human-computer interaction. Firstly, attention should be given to the applicability of emotion recognition equipment. While emotion recognition technology has shown promising accuracy in laboratory settings, it faces challenges in real-world applications. Factors such as lighting conditions, noise, and individual differences can impact the accuracy of emotion recognition. Therefore, further improvements and optimization of algorithms are necessary to enhance the applicability and accuracy of the equipment in practical scenarios. Second, the economic significance of emotion recognition equipment should be considered. As the demand for emotion recognition technology grows, the economic implications of these devices become more apparent. In the education sector, for instance, emotion recognition equipment can help teachers gain a better understanding of students' emotional states, enabling personalized adjustments to teaching strategies and enhancing learning outcomes. Nevertheless, with the proliferation of devices and their widening applications, it becomes imperative to consider cost-effectiveness factors such as equipment costs, maintenance expenses, and data privacy and security concerns. In conclusion, it is important to focus on the applicability and economic significance of video-based emotion recognition equipment. By continuously improving technology, addressing practical challenges, and weighing cost-effectiveness factors, we can harness the potential of emotion recognition equipment and achieve broader applications across various domains.

Use of AI tools declaration

The authors declare that they have not used Artificial Intelligence (AI) tools in the creation of this article.

Acknowledgments

This work was supported by the Humanities and Social Science Fund of Ministry of Education (22YJA880091), CN.

Conflict of interest

The authors declare there is no conflict of interest.

References

[1]	A. Schuchat, B. Swaminathan, C. V. Broome, Epidemiology of human listeriosis, Clin. Microbiol. Rev., 4 (1991), 169–183. https://doi.org/10.1128/cmr.4.2.169 doi: 10.1128/cmr.4.2.169
[2]	K. Hu, S. Renly, S. Edlund, M. Davis, J. Kaufman, A modeling framework to accelerate food-borne outbreak investigations, Food Control, 59 (2015), 53–58. https://doi.org/10.1016/j.foodcont.2015.05.017 doi: 10.1016/j.foodcont.2015.05.017
[3]	H. Hof, History and epidemiology of listeriosis, FEMS Immunol. Med. Microbiol., 35 (2003), 199–202. https://doi.org/10.1016/S0928-8244(02)00471-6 doi: 10.1016/S0928-8244(02)00471-6
[4]	C. W. Chukwu, J. Mushanyu, M. L. Juga, A mathematical model for co-dynamics of listeriosis and bacterial meningitis diseases, Commun. Math. Biol. Neurosci., 2020. https://doi.org/10.28919/cmbn/5060
[5]	WHO, World Health Organization Accessed 2019-12-16. Available from: https://www.who.int/news-room/fact-sheets/detail/listeriosis.
[6]	P. J. Witbooi, C. Africa, A. Christoffels, I. H. I. Ahmed, A population model for the 2017/18 listeriosis outbreak in South Africa, PLoS One, 15 (2020), e0229901. https://doi.org/10.1371/journal.pone.0229901 doi: 10.1371/journal.pone.0229901
[7]	J. K. K. Asamoah, E. Addai, Y. D. Arthur, E. Okyere, A fractional mathematical model for listeriosis infection using two kernels, Decis. Anal. J., 6 (2023), 100191. https://doi.org/10.1016/j.dajour.2023.100191 doi: 10.1016/j.dajour.2023.100191
[8]	C. W. Chukwu, F. Nyabadza, A theoretical model of listeriosis driven by cross contamination of ready-to-eat food products, Int. J. Math. Math. Sci., (2020), 1–14. https://doi.org/10.1155/2020/9207403
[9]	S. Osman, O. D. Makinde, D. M. Theuri, Stability analysis and modelling of listeriosis dynamics in human and animal populations, Global J. Pure Appl. Math., 14 (2018), 115–138.
[10]	C. W. Chukwu, F. Nyabadza, Modeling the potential role of media campaigns on the control of listeriosis, preprint, medRxiv: 2020.12.22.20248698. https://doi.org/10.1101/2020.12.22.20248698
[11]	A. Gray, D. Greenhalgh, L. Hu, X. Mao, J. Pan, A stochastic differential equation SIS epidemic model, SIAM J. Appl. Math., 71 (2011), 876–902. https://doi.org/10.1137/10081856X doi: 10.1137/10081856X
[12]	F. Zhang, X. Zhang, The threshold of a stochastic avian–human influenza epidemic model with psychological effect, Physica A, 492 (2018), 485–495. https://doi.org/10.1016/j.physa.2017.10.043 doi: 10.1016/j.physa.2017.10.043
[13]	Q. Liu, D. Jiang, T. Hayat, A. Alsaedi, Dynamics of a stochastic multigroup SIQR epidemic model with standard incidence rates, J. Franklin Inst., 356 (2019), 2960–2993. https://doi.org/10.1016/j.jfranklin.2019.01.038 doi: 10.1016/j.jfranklin.2019.01.038
[14]	S. Osman, D. Otoo, C. Sebil, Analysis of listeriosis transmission dynamics with optimal control, Appl. Math., 11 (2020), 712–737. https://doi.org/10.4236/am.2020.117048 doi: 10.4236/am.2020.117048
[15]	C. W. Chukwu, F. Nyabadza, J. K. K. Asamoah, A mathematical model and optimal control for listeriosis disease from ready-to-eat food products, Int. J. Comput. Sci. Math., 17 (2023), 39–49. https://doi.org/10.1504/IJCSM.2023.130421 doi: 10.1504/IJCSM.2023.130421
[16]	R. Akella, P. R. Kumar, Optimal control of production rate in a failure prone manufacturing system, IEEE Trans. Autom. Control, 31 (1986), 116–126. https://doi.org/10.1109/TAC.1986.1104206 doi: 10.1109/TAC.1986.1104206
[17]	Q. Gan, R. Xu, P. Yang, Travelling waves of a delayed SIRS epidemic model with spatial diffusion, Nonlinear Anal. Real World Appl., 12 (2011), 52–68. https://doi.org/10.1016/j.nonrwa.2010.05.035 doi: 10.1016/j.nonrwa.2010.05.035
[18]	S. Jana, S. K. Nandi, T. K. Kar, Complex dynamics of an SIR epidemic model with saturated incidence rate and treatment, Acta Biotheor., 64 (2016), 65–84. https://doi.org/10.1007/s10441-015-9273-9 doi: 10.1007/s10441-015-9273-9
[19]	X. Zhou, Stochastic near-optimal controls: necessary and sufficient conditions for near-optimality, SIAM J. Control Optim., 36 (1998), 929–947. https://doi.org/10.1137/S0363012996302664 doi: 10.1137/S0363012996302664
[20]	F. H. Clarke, Optimization and Nonsmooth Analysis, Society for Industrial and Applied Mathematics, 1990.
[21]	I. Ekeland, On the variational principle, J. Math. Anal. Appl., 47 (1974), 324–353. https://doi.org/10.1016/0022-247X(74)90025-0
[22]	Y. Wang, Z. Wu, Necessary and sufficient conditions for near-optimality of stochastic delay systems, Int. J. Control, 91 (2018), 1730–1744. https://doi.org/10.1080/00207179.2017.1327725 doi: 10.1080/00207179.2017.1327725
[23]	M. Hafayed, S. Abbas, On near-optimal mean-field stochastic singular controls: Necessary and sufficient conditions for near-optimality, J. Optim. Theory Appl., 160 (2014), 778–808. https://doi.org/10.1007/s10957-013-0361-1 doi: 10.1007/s10957-013-0361-1
[24]	X. Zhang, R. Yuan, Sufficient and necessary conditions for stochastic near-optimal controls: A stochastic chemostat model with non-zero cost inhibiting, Appl. Math. Modell., 78 (2020), 601–626. https://doi.org/10.1016/j.apm.2019.10.013 doi: 10.1016/j.apm.2019.10.013
[25]	F. H. Clarke, Nonsmooth analysis and optimization, in Proceedings of the International Congress of Mathematicians, 5 (1983), 847–853.
[26]	D. J. Higham, An algorithmic introduction to numerical simulation of stochastic differential equations, SIAM Rev., 43 (2001), 525–546. https://doi.org/10.1137/S0036144500378302 doi: 10.1137/S0036144500378302
[27]	R. Buchholz, H. Engel, E. Kammann, F. Tr $\ddot{o}$ ltzsch, Erratum to: On the optimal control of the Schl $\ddot{o}$ gl-model, Comput. Optim. Appl., 56 (2013), 187–188. https://doi.org/10.1007/s10589-013-9570-7 doi: 10.1007/s10589-013-9570-7
[28]	W. W. Hager, H. Zhang, Algorithm 851: CG_DESCENT, a conjugate gradient method with guaranteed descent, ACM Trans. Math. Software, 32 (2006), 113–137. https://doi.org/10.1145/1132973.1132979 doi: 10.1145/1132973.1132979
[29]	C. W. Chukwu, F. Nyabadza, A theoretical model of listeriosis driven by cross contamination of ready-to-eat food products, Int. J. Math. Math. Sci., 2020 (2020), 1–14. https://doi.org/10.1155/2020/9207403 doi: 10.1155/2020/9207403
[30]	C. W. Chukwu, F. Nyabadza, Mathematical modeling of listeriosis incorporating effects of awareness programs, Math. Models Comput. Simul., 13 (2021), 723–741. https://doi.org/10.1134/S2070048221040116 doi: 10.1134/S2070048221040116
[31]	G. Zhang, Q. Zhu, Event-triggered optimal control for nonlinear stochastic systems via adaptive dynamic programming, Nonlinear Dyn., 105 (2021), 387–401. https://doi.org/10.1007/s11071-021-06624-8 doi: 10.1007/s11071-021-06624-8
[32]	X. Mao, Stochastic Differential Equations and Applications, Elsevier, 2007. https://doi.org/10.1533/9780857099402

Reader Comments

Your name:*

Email:*
© 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)