
The focus of this paper was to explore the stability issues associated with delayed neural networks (DNNs). We introduced a novel approach that departs from the existing methods of using quadratic functions to determine the negative definite of the Lyapunov-Krasovskii functional's (LKFs) derivative ˙V(t). Instead, we proposed a new method that utilizes the conditions of positive definite quadratic function to establish the positive definiteness of LKFs. Based on this approach, we constructed a novel the relaxed LKF that contains delay information. In addition, some combinations of inequalities were extended and used to reduce the conservatism of the results obtained. The criteria for achieving delay-dependent asymptotic stability were subsequently presented in the framework of linear matrix inequalities (LMIs). Finally, a numerical example confirmed the effectiveness of the theoretical result.
Citation: Guoyi Li, Jun Wang, Kaibo Shi, Yiqian Tang. Some novel results for DNNs via relaxed Lyapunov functionals[J]. Mathematical Modelling and Control, 2024, 4(1): 110-118. doi: 10.3934/mmc.2024010
[1] | Min Li, Ke Chen, Yunqing Bai, Jihong Pei . Skeleton action recognition via graph convolutional network with self-attention module. Electronic Research Archive, 2024, 32(4): 2848-2864. doi: 10.3934/era.2024129 |
[2] | Longkui Jiang, Yuru Wang, Weijia Li . Regress 3D human pose from 2D skeleton with kinematics knowledge. Electronic Research Archive, 2023, 31(3): 1485-1497. doi: 10.3934/era.2023075 |
[3] | Suhua Wang, Zhen Huang, Hongjie Ji, Huinan Zhao, Guoyan Zhou, Xiaoxin Sun . PM2.5 hourly concentration prediction based on graph capsule networks. Electronic Research Archive, 2023, 31(1): 509-529. doi: 10.3934/era.2023025 |
[4] | Zhiming Cai, Liping Zhuang, Jin Chen, Jinhua Jiang . Lightweight high-performance pose recognition network: HR-LiteNet. Electronic Research Archive, 2024, 32(2): 1145-1159. doi: 10.3934/era.2024055 |
[5] | Yongsheng Lei, Meng Ding, Tianliang Lu, Juhao Li, Dongyue Zhao, Fushi Chen . A novel approach for enhanced abnormal action recognition via coarse and precise detection stage. Electronic Research Archive, 2024, 32(2): 874-896. doi: 10.3934/era.2024042 |
[6] | Muhammad Ahmad Amin, Yongjian Hu, Jiankun Hu . Analyzing temporal coherence for deepfake video detection. Electronic Research Archive, 2024, 32(4): 2621-2641. doi: 10.3934/era.2024119 |
[7] | Yi Deng, Zhanpeng Yue, Ziyi Wu, Yitong Li, Yifei Wang . TCN-Attention-BIGRU: Building energy modelling based on attention mechanisms and temporal convolutional networks. Electronic Research Archive, 2024, 32(3): 2160-2179. doi: 10.3934/era.2024098 |
[8] | Jiange Liu, Yu Chen, Xin Dai, Li Cao, Qingwu Li . MFCEN: A lightweight multi-scale feature cooperative enhancement network for single-image super-resolution. Electronic Research Archive, 2024, 32(10): 5783-5803. doi: 10.3934/era.2024267 |
[9] | Bingjie Zhang, Junchao Yu, Zhe Kang, Tianyu Wei, Xiaoyu Liu, Suhua Wang . An adaptive preference retention collaborative filtering algorithm based on graph convolutional method. Electronic Research Archive, 2023, 31(2): 793-811. doi: 10.3934/era.2023040 |
[10] | Chengyong Yang, Jie Wang, Shiwei Wei, Xiukang Yu . A feature fusion-based attention graph convolutional network for 3D classification and segmentation. Electronic Research Archive, 2023, 31(12): 7365-7384. doi: 10.3934/era.2023373 |
The focus of this paper was to explore the stability issues associated with delayed neural networks (DNNs). We introduced a novel approach that departs from the existing methods of using quadratic functions to determine the negative definite of the Lyapunov-Krasovskii functional's (LKFs) derivative ˙V(t). Instead, we proposed a new method that utilizes the conditions of positive definite quadratic function to establish the positive definiteness of LKFs. Based on this approach, we constructed a novel the relaxed LKF that contains delay information. In addition, some combinations of inequalities were extended and used to reduce the conservatism of the results obtained. The criteria for achieving delay-dependent asymptotic stability were subsequently presented in the framework of linear matrix inequalities (LMIs). Finally, a numerical example confirmed the effectiveness of the theoretical result.
The volume of video data has significantly increased in recent years, which provides great amounts of data for video behavior recognition [1]. However, the complexity of human posture, variations in view and background interference affect the recognition accuracy. Collection of skeletal data has become simpler due to continued improvement of depth cameras and human posture prediction algorithms. In addition, skeleton information has certain robustness to background, illumination and occlusion, and thus it is widely used in human behavior recognition studies.
Currently, there are two general categories of behavior recognition methods: traditional manual design methods [2] and behavior recognition methods based on deep learning [3]. The principle of the traditional manual design method is to manually extract the spatial and temporal features that represent human behavior [4]. Traditional methods mainly utilize machine learning methods such as the Support Vector Machine (SVM) and Probability Graph model for behavior recognition. Zhang et al. [5] developed a motion-based model known as Motion Context using image representation techniques, and they established a human behavior template for behavior matching. Niebles et al. [6] proposed an unsupervised learning method for human action categories by extracting space-time interest points of behavioral changes. Wang et al. [7] designed a video model based on dense trajectories and motion boundary descriptors to capture the local motion information of the video. The above methods are robust to occlusion and illumination, but they cannot handle view variations with high computational complexity and low speed.
Early behavior recognition methods based on the skeleton used the traditional methods to manually extract skeletal data to simulate the behavior of the human body. Vemulapalli et al. [8] developed a skeletal model that explicitly simulated the 3D geometric relationships between various body parts using rotations and translations in 3D space. Hussein et al. [9] used a covariance matrix for skeleton joint locations over time as a discriminative descriptor for a sequence to recognize human action. Ofli et al. [10] represented human actions by automatically selecting a few skeletal joints that were deemed to be the most informative based on highly interpretable measures such as the mean or variance of joint angle trajectories. Xia et al. [11] utilized histograms of 3D joint locations as the representations of postures for human action recognition. The methods based on artificial skeleton extraction no longer meet the requirements of high precision. The application of deep learning models can automatically extract features and avoid the complexity and differences of manual features. Thus, behavior recognition methods based on deep learning are an important development trend.
At present, the skeleton behavior recognition methods based on deep learning generally focus on constructing a skeleton graph with joints as vertices and skeletons as edges. A Convolution Neural Network (CNN) [12,13,14,15] is then utilized to extract the spatial and temporal features of behavior. These methods can extract the spatial features of adjacent joints and the temporal features of the same joints in adjacent frames. However, the conventional methods mainly involve optimization of spatial maps and disregard the optimization of temporal maps. In the time dimension, these methods only obtain the correlation between the same joints in adjacent frames without considering the relationships between adjacent joints in adjacent frames.
Aiming at the above problems, we propose a novel video behavior recognition method based on Actional-Structural Graph Convolution and a Temporal Extension Module, which can simultaneously optimize the spatial and temporal features. Inspired by the Spatio-Temporal Graph Convolution Neural Network (STGCN), the Actional-Structural Graph Convolution [16] is exploited to extract relevant features of distant joints in the spatial dimension, to achieve the optimization of spatial graphs. Then, the Temporal Extension Module [17] is applied for extraction of features from the temporal dimension, which not only can process the same joints between frames but also can pay attention to multiple adjacent joints between frames. This is helpful to ensure extraction of more abundant temporal features. Furthermore, the attention mechanism [18] is introduced to obtain more important information of joints, frames and channels and remove redundant feature information, thus further improving the performance of our method.
The contributions of this paper are summarized as follows:
ⅰ. A video behavior recognition method is proposed based on Actional-Structural Graph Convolution and a Temporal Extension Module, which simultaneously optimizes the spatial and temporal features.
ⅱ. In the spatial dimension, Actional-Structural Graph Convolution is composed of two networks: action graph convolution and structural graph convolution. Among them, the action graph convolution extracts rich spatial features by capturing the correlations between distant joint features, whereas the structural graph convolution extends the existing skeleton graphs to obtain the spatial features of multiple adjacent joints.
ⅲ. In the time dimension, the Temporal Extension Module is introduced. The conventional methods only gain the same joint features of adjacent frames. Nevertheless, our method acquires the joint features of the same position and adjacent positions in adjacent frames, which expands on the temporal graphs to extract more abundant temporal features.
ⅳ. A large number of experiments are carried out on two standard behavior recognition datasets to evaluate the effectiveness and feasibility of our proposed method. Compared with some existing behavior recognition methods, the experimental results show that our method can achieve better results.
The complexity of human posture, occlusion and illumination affects behavior recognition results. With the application of depth cameras and the progress of posture estimation algorithms, skeletal data has been proved to be an effective source of behavior information. Moreover, skeletal data minimizes the effects of irrelevant factors such as occlusion, illumination and human clothing on behavior recognition in RGB images. Therefore, some scholars are committed to combining deep learning methods with skeletal data to improve behavior recognition. Generally, there are three kinds of neural networks that are used for skeleton-based behavior recognition: Convolution Neural Networks (CNNs), Recurrent Neural Networks (RNNs) and Graph Convolution Neural Networks (GCNs).
Skeleton-based behavior recognition methods using CNNs mainly convert three-dimensional skeletal data into pseudo images for processing. Kim and Reiter [19] used a class of models known as Temporal Convolutional Neural Networks (TCNs) to explicitly learn readily interpretable spatio-temporal representations for 3D human action recognition. Ke et al. [20] introduced a method for 3D action recognition with skeleton sequences, which transformed each skeleton sequence into three clips consisting of several frames for spatial temporal feature learning using deep neural networks. Liu et al. [21] presented an enhanced skeleton visualization method for view-invariant human action recognition based on CNNs. Li et al. [22] designed a multi-scale dilated convolutional neural network for the classification of skeleton images. Hu et al. [23] systemized behavior recognition algorithms based on deep learning in recent years. These methods based on CNNs are relatively simple, and some light CNN models are fast and less computationally intensive. However, the models have low accuracy, and they cannot effectively filter the background noise in the data.
Skeleton-based behavior recognition methods using RNNs extract sequential information to solve sequence problems. A RNN has a memory module that can utilize the temporal information, and thus the vertices of the network can be effectively connected. Liu et al. [24] extended RNNs in the spatial domain and the temporal domain to better analyze the hidden sources of action-related information within the human skeleton sequences in both of the two domains simultaneously. Liu et al. [25] proposed the Global Context-Aware Attention LSTM (GCA-LSTM) for 3D action recognition, which focused on the informative joints in the action sequence with the assistance of global contextual information. Wang et al. [26] presented a video architecture, termed as Temporal Difference Network (TDN), which mainly captured multi-scale temporal information for effective action recognition. Liu et al. [27] merged the results of the motion history images input into VGG-16 and the RGB image input into the Faster R-CNN algorithm for human abnormal behavior recognition. Si et al. [28] introduced a model with spatial reasoning and temporal stack learning (SR-TSL) for skeleton-based action recognition, which consisted of a spatial reasoning network (SRN) and a temporal stack learning network (TSLN). The methods based on RNNs can effectively maintain the context information and have high recognition accuracy. However, they have limitations such as gradient explosion and gradient disappearance, and they cannot eliminate the redundant information of joint data.
A graph is composed of vertices and edges. The feature information of each vertex in the graph is independent, and any two vertices in the graph may have a relationship with irregular data structures. Because graph data does not have translation invariance, it is challenging to apply GCNs to skeletal data. Yang et al. [29] proposed a channel adaptive merging module specific for the human skeleton graph, which can adaptively and efficiently merge the vertices from the same part of the skeleton graph. Chen et al. [30] presented a multi-scale spatial graph convolution (MS-GC) module and a multi-scale temporal graph convolution (MT-GC) module to enhance the receptive field of the model in spatial and temporal dimensions. Ding et al. [31] put forward a temporal segment graph convolutional network (TS-GCN) for skeleton-based action recognition. Shi et al. [32] raised a two-stream adaptive graph convolutional network (2s-AGCN) for skeleton-based action recognition. Zhang et al. [33] designed a simple but effective semantics-guided neural network (SGN) for skeleton-based action recognition, which explicitly introduced the high-level semantics of joints into the network to enhance the feature representation capability. Si et al. [34] gave an Attention Enhanced Graph Convolutional LSTM Network (AGC-LSTM) for human action recognition based on skeleton data, which can capture discriminative features in spatial configuration and temporal dynamics. Miao et al. [35] proposed a graph convolutional operator referred to as a central difference graph convolution (CDGC) for skeleton-based action recognition, which aggregated node information and gradient information similar to a vanilla graph convolutional operation. Chen et al. [36] presented a Channel-wise Topology Refinement Graph Convolution (CTR-GC) to dynamically learn different topologies and effectively aggregate joint features in different channels for skeleton-based action recognition.
GCNs have been widely used for skeleton-based behavior recognition due to their excellent modeling ability for non-Euclidean data. Because skeletal data is embedded in the form of graphs, rather than vectors or images, RNNs and CNNs have poor representation of skeletal data compared with GCNs. As the graph convolution is a local operation, it can only utilize the short-range joint dependencies and short-term trajectory, and it can fail to directly model the distant joints' relations and long-range temporal information that are essential in distinguishing various behaviors. Furthermore, the fixed network would cause many redundancies in the representation of behaviors and deteriorate the performance. Moreover, the redundant features may hinder the model from focusing on significant features. To mitigate the above issues, this paper puts forward a novel method to extract the behavior features from different joints, frames and channels, which is essential for skeleton-based behavior recognition in videos.
Recently, skeleton-based behavior recognition has made great progress, but many problems still remain unsolved. For example, the representations of skeleton sequences captured by most of the previous methods lack spatial structure information and detailed temporal dynamics features. Consequently, we utilize Actional-Structural Graph Convolution to get rich spatial features by capturing the correlations of distant joint features and extend the existing skeleton graph to obtain the spatial features of multiple adjacent joints. Then, the Temporal Extension Module (TEM) is used for gaining the joint features of the same position and adjacent positions in adjacent frames. Meanwhile, we apply three attention mechanisms to improve the performance of video behavior recognition.
The architecture of our method is shown in Figure 1. The proposed method takes the skeletal data as the input of the whole network. Our basic network structure is composed of nine basic units, each of which is successively composed of a spatial graph convolution module, an attention mechanism module and a temporal extension module. In order to learn spatial and temporal features more effectively from important joints, frames and channels, we integrate the above three modules into a network structure.
Specifically, we apply nine layers of integrated modules in total. Taking one layer network for an example, the skeletal data is first input into the spatial graph convolution, consisting of action graph convolution and structural graph convolution. After that, the features are processed through a batch normalization (BN) layer and ReLU activation function. Then, the attention mechanism is adopted to focus on important frames, joints and channels for realizing the optimization of features, which includes three parts: spatial attention, temporal attention and channel attention. Finally, the TEM can further extract the temporal features. Like the spatial graph convolution, this module also processes the features through the BN and ReLU activation function. Behind the whole backbone network, the features go through the Global Ave-Pooling layer, and the behavior recognition results are obtained through the Softmax layer.
The spatial graph convolution used in this paper is improved on the basis of spatio-temporal graph convolution. A graph G = (V, E) is constructed based on a skeletal sequence, where V represents the set of all vertices, and E includes two parts: action graph convolution and structural graph convolution. In this way, the features of distant joints and adjacent joints can be extracted to enrich the features of the spatial dimensions.
The encoder-decoder structure [37] of the Neural Relational Inference (NRI) model is used to realize the message passing between distant joints, as shown in Figure 2. The encoder takes the form of a GNN with multiple rounds of node-to-edge (v → e) and edge-to-node (e → v) message passing to get the relevance between distant joints. According to the connections of distant joints obtained by the encoder, the decoder runs multiple GNNs in parallel for extracting the features of joints. The features obtained by action connection are convoluted with the convolution kernel, which is called action graph convolution. Action graph convolution realizes message passing between distant joints.
The function of the encoder is to acquire the message passing between joints according to the skeleton graph, which is calculated by Eqs (1) to (3). Eq (1) represents the connection features between joints, Eq (2) represents the aggregation of joint features, and Eq (3) represents the probability of joint connection.
Q(k+1)i,j=f(k)e(f(k)v(p(k)i)⊕(f(k)v(p(k)j))) | (1) |
p(k+1)i=F(Q(k+1)i,:)⊕p(k)i | (2) |
Ai,j:=softmax(Q(K)i,j+rτ)∈RC | (3) |
where fe(⋅) denotes the Multi Layer Perceptron acting on edges, fv(⋅) denotes the Multi Layer Perceptron acting on joints, Pi denotes the i-th feature of the joint, ⊕ is vector splicing, k represents the number of iterations, and Qi,j indicates the connection feature of joint i and joint j. F(⋅) denotes the aggregation operation. r stands for a random vector, and τ is used to control the discretization of probability.
The connection features between joints are aggregated and then spliced with the original features to gain the features after message passing. The function of the decoder is to extract joint features according to the connection probability between joints obtained by the encoder. The decoder is calculated by Eqs (4) and (5):
Qti,j=∑Cc=1Ai,j,cf(c)e(f(c)v(xti)⊕f(c)v(xtj)) | (4) |
pti=F(Qti,:)⊕pti | (5) |
where C represents the number of connections between joints, Ai,j,c denotes the connection probability between joints obtained by the encoder, and t represents t-th frame. The encoder-decoder operation can capture the dependence between distant joints.
The natural connection matrix of the existing methods is extended by high-order form. Existing methods only focus on one adjacent joint and ignore the feature relationship with other adjacent joints. Fortunately, the information of multiple adjacent joints can be obtained through structural graph convolution. Taking the joint features obtained by structural connection as the convolution kernel for the convolution operation is structural graph convolution. The calculation of structural graph convolution is shown in Eq (6):
Xout=L∑l=1∑p∈PM(l)∘A(p,l)1XinW(p,l), | (6) |
where A1=D−1A, A is the adjacency matrix, D is the stiffness matrix of A for normalization, Xin represents the input features of the joint, l represents the numbers of adjacent joints, M(l) denotes trainable weights of edges, W(p,l) denotes trainable weights to obtain the importance of features, and ∘ is the Hadamard product to realize the multiplication of corresponding matrix elements.
Furthermore, action graphs and structural graphs are linked to realize the extraction of spatial features. Because action connection only considers the coordinates of joints, while structural connection only considers the connection of joints in physical structure, their contributions do not affect each other and meet the linear relationship. Through the message passing of distant joints and multiple adjacent joints, Actional-Structural Graph Convolution enriches the spatial features and improves the performance of the model.
Most of the existing methods concentrate upon optimizing the spatial graph but ignore the optimization of the temporal graph. Temporal modeling is still a challenge for behavior recognition in videos. The traditional temporal graph is shown in Figure 3(a). The blue spots in the figure indicate the joints, the black lines represent the connections of joints in the spatial dimension, and the green lines show the connection in the time dimension. It can be seen that the traditional temporal graphs only concatenate the same joints of adjacent frames in the time dimension, without the information of adjacent joints in adjacent frames. They generally use the simple convolution network, which cannot provide abundant temporal features.
Hence, we introduce the Temporal Extension Module, as shown in Figure 3(b). The green lines in the figure represent the connections of the same joints in adjacent frames, and the red lines represent the connections of adjacent joints in adjacent frames. In the time dimension, we expand the sampling range to achieve message passing of adjacent joints in adjacent frames. The calculation of TEM is shown in Eqs (7) and (8):
fout(vti)=∑v(t−1)j∈BT(vti)1Zti(v(t−1)j)fin(v(t−1)j)⋅w(l(t−1)i(v(t−1)j)) | (7) |
BT(vti)={v(t−1)j|d(v(t−1)j,v(t−1)i)⩽DT} | (8) |
where l(t−1)i(v(t−1)j) denotes the label mapping of node i relative to node j in the t-1 frame, fin(v(t−1)j) represents the input feature of node j in the t-1 frame, and w(⋅) indicates the weight vector. BT(vti) represents sampling range, d(v(t−1)j,v(t−1)i) is the minimum length of the path from the node vi to vj, and DT denotes the maximum length of inter-frame sampling. When DT = 1, it means only one adjacent node.
The sample set is divided into three subsets, as shown in Figure 4. The black cross in the figure shows the center of gravity of the whole skeleton. The same joint as the previous frame is the root node, which is marked as 0. The adjacency node whose distance to the center of gravity is closer than the root node represents the centripetal motion feature, which is marked as 1. The adjacency node whose distance to the center of gravity is farther than the root node represents the centrifugal motion feature, which is marked as 2. Extended sampling range is conducive to extracting discriminative temporal features for improving the accuracy of behavior recognition.
To explore the internal relationships of the data and highlight the important features, three types of attention mechanisms are applied as attention mechanism modules for skeleton-based behavior recognition, as shown in Figure 5. This module includes three parts: spatial attention, temporal attention and channel attention.
The spatial attention module assigns different weights according to the participation of each joint and extracts more important joint information. The calculation is shown in Eq (9):
Ms=σ(gs(AvgPool(fin))) | (9) |
where fin represents input features, AvgPool(⋅) indicates the average pooling of features, gs is a one-dimensional convolution operation, σ denotes the Sigmoid activation function, and Ms represents attention mapping. The input features are multiplied by Ms in the form of residuals to refine the joint features, and the weights of different joints are adaptively adjusted by training the network. Similar to the spatial attention, the calculations of temporal attention and channel attention are shown in Eqs (10) and (11):
Mt=σ(gt(AvgPool(fin))) | (10) |
Mc=σ(W2(δ(W1(AvgPool(fin))))) | (11) |
Our attention mechanism is a typically mixed attention mechanism that can first infer the attention map along the three dimensions of spatial, temporal and channel and then weight the input feature map by the attention map to complete the adaptive feature optimization.
We implement the experiment by the PyTorch library. Our basic network structure is composed of nine basic units, each of which is successively composed of a spatial graph convolution module, an attention mechanism module and a temporal extension module. Among the nine units, the feature dimensions are 64, 64, 64,128,128,128,256,256,256, sequentially. The strides of convolution are 1, 1, 1, 2, 1, 1, 2, 1, 1, respectively. In order to optimize our network, the stochastic gradient descent algorithm (SGD) is selected as the optimization function, and the initial momentum is set to 0.9.
In the process of training, we contrast the output of the Softmax classifier to the original label and update the parameters by error back propagation. The probability of back propagation is set to 0.5. Cross entropy loss is used as the loss function. The configuration of the computer is as follows: The CPU is an Intel Core i7-9700K, and the VGA is an NVIDIA GeForce GTX1080Ti. This shows that our method can complete the experiment without GPU, which reduces computing costs and requirements.
In this section, we execute experiments to demonstrate the performances of the Actional-Structural Graph Convolution, Attention Mechanisms and Temporal Extension Module, respectively. Furthermore, the proposed method is compared with superior behavior recognition methods on two standard datasets: NTU-RGB+D and Kinetics.
The behavior recognition datasets NTU-RGB+D [38] and Kinetics [39] were used to train and test our method. NTU-RGB+D contains 60 different action classes, including daily, mutual and health-related actions. The dataset consists of 56,880 RGB+D video samples, captured from 40 different human subjects, using a Microsoft Kinect v2 sensor. Four major data modalities are provided by this sensor: depth maps, 3D joint information, RGB frames and IR sequences. In this paper, we only use the 3D joint information. Joint information consists of 3-dimensional locations of 25 major body joints. The configuration of body joints is illustrated in Figure 6.
Two standards are served as performance evaluation in NTU-RGB+D: cross-subject (CS) and cross-view (CV). In cross-subject evaluation, they split the 40 subjects into training and testing groups. Each group consists of 20 subjects, and the training and testing sets incorporate 40,320 and 16,560 samples, respectively. For cross-view evaluation, all the samples of camera 1 are for testing, and samples of cameras 2 and 3 are for training. The training and testing sets contain 37,920 and 18,960 samples, respectively. Sample frames of the NTU-RGB+D dataset are illustrated in Figure 7.
Kinetics provides a large scale high quality dataset. The dataset has 400 human action classes, with 400–1150 clips for each action, each from a unique video. Each clip lasts around 10 s. The current version includes 306,245 videos. The actions are human focused and cover a broad range of classes, including human-object interactions as well as human-human interactions. This dataset supplies the original video clips. In our experiment, the public OpenPose toolbox is used to extract the positions of 18 joints in each frame, and the joint labels are from 0 to 17, as shown in Figure 8.
The training sets and testing sets of the Kinetics dataset are 240,000 video clips and 200,000 video clips, respectively. In the testing sets, top-1 and top-5 are used to evaluate the performance of the method. Top-1 takes the corresponding action with the highest probability of classification as the prediction result, and top-5 shows the top five actions with the highest probability of classification. When the ground truth matches one of the top five, the prediction is correct. Example classes from the Kinetics dataset are shown in Figure 9.
Compared with the latest behavior recognition method on the CS standard and CV standard of the NTU-RGB+D dataset, the results are shown in Table 1. [8] extracted features manually, [36,40] adopted the methods based on RNNs, [41,42,43,44] employed the methods based on CNNs, and [16,28,45,46,47] used the methods based on GCNs. It can be seen from Table 1 that our method has improved the accuracy of behavior recognition, whose performance is better than the existing methods. This is because our method can not only capture the message passing of distant joints in the same frame and adjacent joints in different frames but also extract more important joint, frame and channel information, which is helpful to improve the performance of behavior recognition.
Method | CS | CV |
Lie Group [8] | 50.1% | 52.8% |
H-RNN [40] | 59.1% | 64.0% |
Deep LSTM [38] | 60.7% | 67.3% |
PA-LSTM [38] | 62.9% | 70.3% |
Two-Stream 3DCNN [41] | 66.8% | 72.6% |
Visualize CNN [42] | 76.0% | 82.6% |
CNN + Motion + Trans [44] | 83.2% | 89.3% |
3scale ResNet152 [44] | 85.0% | 92.3% |
STGCN [45] | 81.5% | 88.3% |
SDGCN [46] | 84.0% | 91.4% |
SR-TSL [28] | 84.8% | 92.4% |
AS-GCN [16] | 86.8% | 94.2% |
RA-GCNv2 [47] | 87.3% | 93.6% |
SAR-NAS [48] | 86.4% | 94.3% |
TS-SAN [49] | 87.2% | 92.7% |
MANs [50] | 82.7% | 93.2% |
Our method | 87.5% | 94.7% |
In addition, the comparative experiments on the Kinetics dataset are shown in Table 2. [51] is based on handmade feature extraction, [16,45] used the methods based on GCNs, and [19] utilized the methods based on CNNs. From Table 2, we can conclude that the accuracy of our method on top-1 and top-5 is higher than the existing methods, which verifies the effectiveness of the behavior recognition method based on Actional-Structural Graph convolution and TEM proposed in this paper.
Method | top-1 | top-5 |
Feature Enc [51] | 14.9% | 25.8% |
Deep LSTM [38] | 16.4% | 35.3% |
TCN [19] | 20.3% | 40.0% |
STGCN [45] | 30.7% | 52.8% |
AS-GCN [16] | 34.8% | 56.5% |
Our method | 35.7% | 57.8% |
Ablation experiments were carried out on the NTU-RGB+D dataset. In this experiment, the performances of spatial graph convolution, attention mechanism and temporal extension module in our network architecture were proved, respectively.
The availability of action graph convolution and structural graph convolution are validated. First, we contrast the method with action graph convolution to the behavior recognition method based on STGCN. The comparison results of action graph convolution are shown in Table 3. The experiment shows that the accuracy of the method with action graph convolution is higher, which means that capturing the dependence between distant joints makes for behavior recognition.
Method | CS | CV |
STGCN | 81.5% | 88.3% |
STGCN + Action | 83.2% | 90.3% |
Second, we verify the effectiveness of structural graph convolution, as shown in Table 4. The structural graph convolution introduced in this paper takes the STGCN as the baseline framework. In this experiment, the number of capturing adjacent joints is set to 1–5 on the spatial graph. When it is 1, the corresponding structural graph convolution is the same as the STGCN. From Table 4, we can find that as the order increases, the accuracy of recognition will improve, which shows that capturing the message passing of multiple adjacent joints is conducive to behavior recognition. However, when the order is 5, the accuracy begins to decrease, because with the increase of the order, the method cannot obtain local features well.
Method | CS | CV |
STGCN | 81.5% | 88.3% |
STGCN + 1-order Structure | 81.5% | 88.3% |
STGCN + 2-order Structure | 82.2% | 89.1% |
STGCN + 3-order Structure | 83.4% | 89.6% |
STGCN + 4-order Structure | 84.2% | 90.2% |
STGCN + 5-order Structure | 83.5% | 90.1% |
Finally, we test the effect of combining action graph convolution and structural graph convolution, as shown in Table 5. It can be seen from Table 5 that when the order of structural connection is 4, the performance of the model is best. Therefore, the spatial graph convolution in this paper uses action graph convolution + 4-order structural graph convolution, named STGCN-AS in the following.
Method | CS | CV |
STGCN | 81.5% | 88.3% |
STGCN + 1-order Structure + Action | 83.2% | 90.3% |
STGCN + 2-order Structure + Action | 83.7% | 91.2% |
STGCN + 3-order Structure + Action | 84.4% | 92.3% |
STGCN + 4-order Structure + Action (STGCN-AS) | 86.1% | 93.2% |
STGCN + 5-order Structure + Action | 84.2% | 92.0% |
The significances of three attention mechanisms are confirmed in the framework of STGCN-AS, respectively, as shown in Table 6. The experimental results show that the three attention mechanisms are all useful for improving the performance of the method, which achieves the best results when the mixed attention mechanism is added. That is, collecting the features from important joints, frames and channels is beneficial for behavior recognition.
Method | CS | CV |
STGCN-AS | 86.1% | 93.2% |
STGCN-AS + Spatial Attention | 86.4% | 93.3% |
STGCN-AS + Temporal Attention | 86.2% | 93.2% |
STGCN-AS + Channel Attention | 86.3% | 93.4% |
STGCN-AS + Mixed Attention | 86.5% | 93.5% |
The temporal extension module based on STGCN-AS was verified, as shown in Table 7. The results show that adding the temporal extension module can enhance the performance of the method, because this module can expand the range of feature extraction in the time dimension and enrich the temporal features.
Method | CS | CV |
STGCN-AS | 86.1% | 93.2% |
STGCN-AS + TEM | 86.7% | 93.8% |
The above experiments have proved the applicability of the three modules proposed in this paper. The overall model of this paper is arranged according to the order of spatial graph convolution, attention mechanism and temporal extension module. The experimental results are shown in Table 8. It can be seen from Table 8 that optimizing the spatio-temporal map and paying attention to important joints, frames and channels are beneficial for improving the performance of behavior recognition. When adding one of three modules, the accuracy is improved slightly, but the accuracy of the overall model reaches an improvement of 1.4–1.5% in terms of both the CS and CV standards compared with the baseline method.
Method | CS | CV |
STGCN-AS | 86.1% | 93.2% |
STGCN-AS + Attention | 86.5% | 93.5% |
STGCN-AS + TEM | 86.7% | 93.8% |
STGCN-AS + Attention + TEM | 87.5% | 94.7% |
In summary, skeleton-based action recognition is an important task that requires the adequate understanding of movement characteristics of a human action from the given skeleton sequence. Hence, we propose a novel method based on STGCN-AS and a temporal extension module to extract the abundant features of different joints, frames and channels, which is significant for video behavior recognition. The effectiveness of the proposed method is verified by a large number of comparative experiments, and the performances of the three modules are demonstrated by a series of ablation experiments, respectively.
The GCN has shown promising performance for behavior recognition due to its strengths in modeling the dependencies and dynamics in sequential data. Therefore, we propose a novel method based on STGCN-AS and a temporal extension module to extract the abundant features of different joints, frames and channels. STGCN-AS is utilized to get rich spatial features by capturing the correlations of distant joint features, and the existing skeleton graphs are extended to obtain the spatial features of multiple adjacent joints. Then, the TEM is used for gaining the joint features of the same position and adjacent positions in adjacent frames. Meanwhile, three attention mechanisms are applied to improve the performance of video behavior recognition. Although our method has achieved good recognition accuracy on the NTU-RGB+D and Kinetics datasets, there are still deficiencies. Facial expressions, gestures and other features are not involved in our method. In the future, we will combine gesture recognition with behavior recognition to further improve the performance of our method and research the features of multi-person interaction.
This work was supported by the National Natural Science Foundation of China (61907007), Fund of Jilin Provincial Science and Technology Department (20210201077GX).
The authors declare no conflict of interest.
[1] |
Z. Tan, J. Chen, Q. Kang, M. Zhou, A. Abdullah, S. Khaled, Dynamic embedding projection-gated convolutional neural networks for text classification, IEEE Trans. Neural Networks Learn. Syst., 33 (2022), 973–982. https://doi.org/10.1109/TNNLS.2020.3036192 doi: 10.1109/TNNLS.2020.3036192
![]() |
[2] |
W. Niu, C. Ma, X. Sun, M. Li, Z. Gao, A brain network analysis-based double way deep neural network for emotion recognition, IEEE Trans. Neural Syst. Rehabil. Eng., 31 (2023), 917–925. https://doi.org/10.1109/TNSRE.2023.3236434 doi: 10.1109/TNSRE.2023.3236434
![]() |
[3] |
J. C. G. Diaz, H. Zhao, Y. Zhu, P. Samuel, H. Sebastian, Recurrent neural network equalization for wireline communication systems, IEEE Trans. Circuits Syst. II, 69 (2022), 2116–2120. https://doi.org/10.1109/TCSII.2022.3152051 doi: 10.1109/TCSII.2022.3152051
![]() |
[4] |
X. Li, R. Guo, J. Lu, T. Chen, X. Qian, Causality-driven graph neural network for early diagnosis of pancreatic cancer in non-contrast computerized tomography, IEEE Trans. Med. Imag., 42 (2023), 1656–1667. https://doi.org/10.1109/TMI.2023.3236162 doi: 10.1109/TMI.2023.3236162
![]() |
[5] |
F. Fang, Y. Liu, J. H. Park, Y. Liu, Outlier-resistant nonfragile control of T-S fuzzy neural networks with reaction-diffusion terms and its application in image secure communication, IEEE Trans. Fuzzy Syst., 31 (2023), 2929–2942. https://doi.org/10.1109/TFUZZ.2023.3239732 doi: 10.1109/TFUZZ.2023.3239732
![]() |
[6] |
Z. Zhang, J. Liu, G. Liu, J. Wang, J. Zhang, Robustness verification of swish neural networks embedded in autonomous driving systems, IEEE Trans. Comput. Soc. Syst., 10 (2023), 2041–2050. https://doi.org/10.1109/TCSS.2022.3179659 doi: 10.1109/TCSS.2022.3179659
![]() |
[7] |
S. Zhou, H. Xu, G. Zhang, T. Ma, Y. Yang, Leveraging deep convolutional neural networks pre-trained on autonomous driving data for vehicle detection from roadside LiDAR data, IEEE Trans. Intell. Transp. Syst., 23 (2022), 22367–22377. https://doi.org/10.1109/TITS.2022.3183889 doi: 10.1109/TITS.2022.3183889
![]() |
[8] |
Y. Bai, T. Chaolu, S. Bilige, The application of improved physics-informed neural network (IPINN) method in finance, Nonlinear Dyn., 107 (2022), 3655–3667. https://doi.org/10.1007/s11071-021-07146-z doi: 10.1007/s11071-021-07146-z
![]() |
[9] |
G. Rajchakit, R. Sriraman, Robust passivity and stability analysis of uncertain complex-valued impulsive neural networks with time-varying delays, Neural Process. Lett., 33 (2021), 581–606. https://doi.org/10.1007/s11063-020-10401-w doi: 10.1007/s11063-020-10401-w
![]() |
[10] |
A. Pratap, R. Raja, R. P. Agarwal, J. Alzabut, M. Niezabitowski, H. Evren, Further results on asymptotic and finite-time stability analysis of fractional-order time-delayed genetic regulatory networks, Neurocomputing, 475 (2022), 26–37. https://doi.org/10.1016/j.neucom.2021.11.088 doi: 10.1016/j.neucom.2021.11.088
![]() |
[11] |
G. Rajchakit, R. Sriraman, N. Boonsatit, P. Hammachukiattikul, C. P. Lim, P. Agarwal, Global exponential stability of Clifford-valued neural networks with time-varying delays and impulsive effects, Adv. Differ. Equations, 208 (2021), 26–37. https://doi.org/10.1186/s13662-021-03367-z doi: 10.1186/s13662-021-03367-z
![]() |
[12] |
H. Lin, H. Zeng, X. Zhang, W. Wang, Stability analysis for delayed neural networks via a generalized reciprocally convex inequality, IEEE Trans. Neural Networks Learn. Syst., 34 (2023), 7191–7499. https://doi.org/10.1109/TNNLS.2022.3144032 doi: 10.1109/TNNLS.2022.3144032
![]() |
[13] |
Z. Zhang, X. Zhang, T. Yu, Global exponential stability of neutral-type Cohen-Grossberg neural networks with multiple time-varying neutral and discrete delays, Neurocomputing, 490 (2022), 124–131. https://doi.org/10.1016/j.neucom.2022.03.068 doi: 10.1016/j.neucom.2022.03.068
![]() |
[14] |
H. Wang, Y. He, C. Zhang, Type-dependent average dwell time method and its application to delayed neural networks with large delays, IEEE Trans. Neural Networks Learn. Syst., 35 (2024), 2875–2880. https://doi.org/10.1109/TNNLS.2022.3184712 doi: 10.1109/TNNLS.2022.3184712
![]() |
[15] |
Z. Sheng, C. Lin, B. Chen, Q. Wang, Asymmetric Lyapunov-Krasovskii functional method on stability of time-delay systems, Int. J. Robust Nonlinear Control, 31 (2021), 2847–2854. https://doi.org/10.1002/rnc.5417 doi: 10.1002/rnc.5417
![]() |
[16] |
L. Guo, S. Huang, L. Wu, Novel delay-partitioning approaches to stability analysis for uncertain Lur'e systems with time-varying delays, J. Franklin Inst., 358 (2021), 3884–3900. https://doi.org/10.1016/j.jfranklin.2021.02.030 doi: 10.1016/j.jfranklin.2021.02.030
![]() |
[17] |
J. H. Kim, Further improvement of Jensen inequality and application to stability of time-delayed systems, Automatica, 64 (2016), 3884–3900. https://doi.org/10.1016/j.automatica.2015.08.025 doi: 10.1016/j.automatica.2015.08.025
![]() |
[18] |
J. Chen, X. Zhang, J. H. Park, S. Xu, Improved stability criteria for delayed neural networks using a quadratic function negative-definiteness approach, IEEE Trans. Neural Networks Learn. Syst., 33 (2020), 1348–1354. https://doi.org/10.1109/TNNLS.2020.3042307 doi: 10.1109/TNNLS.2020.3042307
![]() |
[19] |
G. Kong, L. Guo, Stability analysis of delayed neural networks based on improved quadratic function condition, Neurocomputing, 524 (2023), 158–166. https://doi.org/10.1016/j.neucom.2022.12.012 doi: 10.1016/j.neucom.2022.12.012
![]() |
[20] |
Z. Zhai, H. Yan, S. Chen, C. Chen, H. Zeng, Novel stability analysis methods for generalized neural networks with interval time-varying delay, Inf. Sci., 635 (2023), 208–220. https://doi.org/10.1016/j.ins.2023.03.041 doi: 10.1016/j.ins.2023.03.041
![]() |
[21] |
T. Lee, J. Park, M. Park, O. Kwon, H. Jung, On stability criteria for neural networks with time-varying delay using Wirtinger-based multiple integral inequality, J. Franklin Inst., 352 (2015), 5627–5645. https://doi.org/10.1016/j.jfranklin.2015.08.024 doi: 10.1016/j.jfranklin.2015.08.024
![]() |
[22] |
X. Zhang, Q. Han, X. Ge, The construction of augmented Lyapunov-Krasovskii functionals and the estimation of their derivatives in stability analysis of time-delay systems: a survey, Int. J. Syst. Sci., 53 (2022), 2480–2495. https://doi.org/10.1080/00207721.2021.2006356 doi: 10.1080/00207721.2021.2006356
![]() |
[23] |
L. V. Hien, H. Trinh, Refined Jensen-based inequality approach to stability analysis of time-delay systems, IET Control Theory Appl., 9 (2015), 2188–2194. https://doi.org/10.1049/iet-cta.2014.0962 doi: 10.1049/iet-cta.2014.0962
![]() |
[24] |
F. Yang, J. He, L. Li, Matrix quadratic convex combination for stability of linear systems with time-varying delay via new augmented Lyapunov functional, 2016 12th World Congress on Intelligent Control and Automation, 2016, 1866–1870. https://doi.org/10.1109/WCICA.2016.7578791 doi: 10.1109/WCICA.2016.7578791
![]() |
[25] |
C. Zhang, Y. He, L. Jiang, M. Wu, Stability analysis for delayed neural networks considering both conservativeness and complexity, IEEE Trans. Neural Networks Learn. Syst., 27 (2016), 1486–1501. https://doi.org/10.1109/TNNLS.2015.2449898 doi: 10.1109/TNNLS.2015.2449898
![]() |
[26] |
S. Ding, Z. Wang, Y. Wu, H. Zhang, Stability criterion for delayed neural networks via Wirtinger-based multiple integral inequality, Neurocomputing, 214 (2016), 53–60. https://doi.org/10.1016/j.neucom.2016.04.058 doi: 10.1016/j.neucom.2016.04.058
![]() |
[27] |
B. Yang, J. Wang, X. Liu, Improved delay-dependent stability criteria for generalized neural networks with time-varying delays, Inf. Sci., 214 (2017), 299–312. https://doi.org/10.1016/j.ins.2017.08.072 doi: 10.1016/j.ins.2017.08.072
![]() |
[28] |
B. Yang, J. Wang, J. Wang, Stability analysis of delayed neural networks via a new integral inequality, Neural Networks, 88 (2017), 49–57. https://doi.org/10.1016/j.neunet.2017.01.008 doi: 10.1016/j.neunet.2017.01.008
![]() |
[29] |
C. Hua, Y. Wang, S. Wu, Stability analysis of neural networks with time-varying delay using a new augmented Lyapunov-Krasovskii functional, Neurocomputing, 332 (2019), 1–9. https://doi.org/10.1016/j.neucom.2018.08.044 doi: 10.1016/j.neucom.2018.08.044
![]() |
1. | Xiaoping Zhao, Liwen Jiang, Adam Slowik, Zhenman Zhang, Yu Xue, Evolving blocks by segmentation for neural architecture search, 2024, 32, 2688-1594, 2016, 10.3934/era.2024092 |
Method | CS | CV |
Lie Group [8] | 50.1% | 52.8% |
H-RNN [40] | 59.1% | 64.0% |
Deep LSTM [38] | 60.7% | 67.3% |
PA-LSTM [38] | 62.9% | 70.3% |
Two-Stream 3DCNN [41] | 66.8% | 72.6% |
Visualize CNN [42] | 76.0% | 82.6% |
CNN + Motion + Trans [44] | 83.2% | 89.3% |
3scale ResNet152 [44] | 85.0% | 92.3% |
STGCN [45] | 81.5% | 88.3% |
SDGCN [46] | 84.0% | 91.4% |
SR-TSL [28] | 84.8% | 92.4% |
AS-GCN [16] | 86.8% | 94.2% |
RA-GCNv2 [47] | 87.3% | 93.6% |
SAR-NAS [48] | 86.4% | 94.3% |
TS-SAN [49] | 87.2% | 92.7% |
MANs [50] | 82.7% | 93.2% |
Our method | 87.5% | 94.7% |
Method | CS | CV |
STGCN | 81.5% | 88.3% |
STGCN + Action | 83.2% | 90.3% |
Method | CS | CV |
STGCN | 81.5% | 88.3% |
STGCN + 1-order Structure | 81.5% | 88.3% |
STGCN + 2-order Structure | 82.2% | 89.1% |
STGCN + 3-order Structure | 83.4% | 89.6% |
STGCN + 4-order Structure | 84.2% | 90.2% |
STGCN + 5-order Structure | 83.5% | 90.1% |
Method | CS | CV |
STGCN | 81.5% | 88.3% |
STGCN + 1-order Structure + Action | 83.2% | 90.3% |
STGCN + 2-order Structure + Action | 83.7% | 91.2% |
STGCN + 3-order Structure + Action | 84.4% | 92.3% |
STGCN + 4-order Structure + Action (STGCN-AS) | 86.1% | 93.2% |
STGCN + 5-order Structure + Action | 84.2% | 92.0% |
Method | CS | CV |
STGCN-AS | 86.1% | 93.2% |
STGCN-AS + Spatial Attention | 86.4% | 93.3% |
STGCN-AS + Temporal Attention | 86.2% | 93.2% |
STGCN-AS + Channel Attention | 86.3% | 93.4% |
STGCN-AS + Mixed Attention | 86.5% | 93.5% |
Method | CS | CV |
STGCN-AS | 86.1% | 93.2% |
STGCN-AS + TEM | 86.7% | 93.8% |
Method | CS | CV |
STGCN-AS | 86.1% | 93.2% |
STGCN-AS + Attention | 86.5% | 93.5% |
STGCN-AS + TEM | 86.7% | 93.8% |
STGCN-AS + Attention + TEM | 87.5% | 94.7% |
Method | CS | CV |
Lie Group [8] | 50.1% | 52.8% |
H-RNN [40] | 59.1% | 64.0% |
Deep LSTM [38] | 60.7% | 67.3% |
PA-LSTM [38] | 62.9% | 70.3% |
Two-Stream 3DCNN [41] | 66.8% | 72.6% |
Visualize CNN [42] | 76.0% | 82.6% |
CNN + Motion + Trans [44] | 83.2% | 89.3% |
3scale ResNet152 [44] | 85.0% | 92.3% |
STGCN [45] | 81.5% | 88.3% |
SDGCN [46] | 84.0% | 91.4% |
SR-TSL [28] | 84.8% | 92.4% |
AS-GCN [16] | 86.8% | 94.2% |
RA-GCNv2 [47] | 87.3% | 93.6% |
SAR-NAS [48] | 86.4% | 94.3% |
TS-SAN [49] | 87.2% | 92.7% |
MANs [50] | 82.7% | 93.2% |
Our method | 87.5% | 94.7% |
Method | top-1 | top-5 |
Feature Enc [51] | 14.9% | 25.8% |
Deep LSTM [38] | 16.4% | 35.3% |
TCN [19] | 20.3% | 40.0% |
STGCN [45] | 30.7% | 52.8% |
AS-GCN [16] | 34.8% | 56.5% |
Our method | 35.7% | 57.8% |
Method | CS | CV |
STGCN | 81.5% | 88.3% |
STGCN + Action | 83.2% | 90.3% |
Method | CS | CV |
STGCN | 81.5% | 88.3% |
STGCN + 1-order Structure | 81.5% | 88.3% |
STGCN + 2-order Structure | 82.2% | 89.1% |
STGCN + 3-order Structure | 83.4% | 89.6% |
STGCN + 4-order Structure | 84.2% | 90.2% |
STGCN + 5-order Structure | 83.5% | 90.1% |
Method | CS | CV |
STGCN | 81.5% | 88.3% |
STGCN + 1-order Structure + Action | 83.2% | 90.3% |
STGCN + 2-order Structure + Action | 83.7% | 91.2% |
STGCN + 3-order Structure + Action | 84.4% | 92.3% |
STGCN + 4-order Structure + Action (STGCN-AS) | 86.1% | 93.2% |
STGCN + 5-order Structure + Action | 84.2% | 92.0% |
Method | CS | CV |
STGCN-AS | 86.1% | 93.2% |
STGCN-AS + Spatial Attention | 86.4% | 93.3% |
STGCN-AS + Temporal Attention | 86.2% | 93.2% |
STGCN-AS + Channel Attention | 86.3% | 93.4% |
STGCN-AS + Mixed Attention | 86.5% | 93.5% |
Method | CS | CV |
STGCN-AS | 86.1% | 93.2% |
STGCN-AS + TEM | 86.7% | 93.8% |
Method | CS | CV |
STGCN-AS | 86.1% | 93.2% |
STGCN-AS + Attention | 86.5% | 93.5% |
STGCN-AS + TEM | 86.7% | 93.8% |
STGCN-AS + Attention + TEM | 87.5% | 94.7% |