Research on gesture recognition algorithm based on MME-P3D

Hongmei Jin; Ning He; Boyu Liu; Zhanli Li; Hongmei Jin; Ning He; Boyu Liu; Zhanli Li

doi:10.3934/mbe.2024158

Mathematical Biosciences and Engineering

2024, Volume 21, Issue 3: 3594-3617. doi: 10.3934/mbe.2024158

Previous Article Next Article

Research article Special Issues

Research on gesture recognition algorithm based on MME-P3D

College of Computer Science and Technology, Xi'an University of Science and Technology, Xi'an 710054, China

Academic Editor: Shangce Gao

Received: 30 November 2023 Revised: 19 January 2024 Accepted: 30 January 2024 Published: 05 February 2024

A Multiscale-Motion Embedding Pseudo-3D (MME-P3D) gesture recognition algorithm has been proposed to tackle the issues of excessive parameters and high computational complexity encountered by existing gesture recognition algorithms deployed in mobile and embedded devices. The algorithm initially takes into account the characteristics of gesture motion information, integrating the channel attention (CE) mechanism into the pseudo-3D (P3D) module, thereby constructing a P3D-C feature extraction network that can efficiently extract spatio-temporal feature information while reducing the complexity of the algorithmic model. To further enhance the understanding and learning of the global gesture movement's dynamic information, a Multiscale Motion Embedding (MME) mechanism is subsequently designed. The experimental findings reveal that the MME-P3D model achieves recognition accuracies reaching up to 91.12% and 83.06% on the self-constructed conference gesture dataset and the publicly available Chalearn 2013 dataset, respectively. In comparison with the conventional 3D convolutional neural network, the MME-P3D model demonstrates a significant advantage in terms of parameter count and computational requirements, which are reduced by as much as 82% and 83%, respectively. This effectively addresses the limitations of the original algorithms, making them more suitable for deployment on embedded and mobile devices and providing a more effective means for the practical application of hand gesture recognition technology.

Keywords:

Citation: Hongmei Jin, Ning He, Boyu Liu, Zhanli Li. Research on gesture recognition algorithm based on MME-P3D[J]. Mathematical Biosciences and Engineering, 2024, 21(3): 3594-3617. doi: 10.3934/mbe.2024158

Related Papers:

[1]	Shuai Cao, Biao Song . Visual attentional-driven deep learning method for flower recognition. Mathematical Biosciences and Engineering, 2021, 18(3): 1981-1991. doi: 10.3934/mbe.2021103
[2]	Weibin Jiang, Xuelin Ye, Ruiqi Chen, Feng Su, Mengru Lin, Yuhanxiao Ma, Yanxiang Zhu, Shizhen Huang . Wearable on-device deep learning system for hand gesture recognition based on FPGA accelerator. Mathematical Biosciences and Engineering, 2021, 18(1): 132-153. doi: 10.3934/mbe.2021007
[3]	Xiaoguang Liu, Mingjin Zhang, Jiawei Wang, Xiaodong Wang, Tie Liang, Jun Li, Peng Xiong, Xiuling Liu . Gesture recognition of continuous wavelet transform and deep convolution attention network. Mathematical Biosciences and Engineering, 2023, 20(6): 11139-11154. doi: 10.3934/mbe.2023493
[4]	Yuanyao Lu, Kexin Li . Research on lip recognition algorithm based on MobileNet + attention-GRU. Mathematical Biosciences and Engineering, 2022, 19(12): 13526-13540. doi: 10.3934/mbe.2022631
[5]	Huiying Zhang, Jiayan Lin, Lan Zhou, Jiahui Shen, Wenshun Sheng . Facial age recognition based on deep manifold learning. Mathematical Biosciences and Engineering, 2024, 21(3): 4485-4500. doi: 10.3934/mbe.2024198
[6]	Xiao Ma, Xuemei Luo . Finger vein recognition method based on ant colony optimization and improved EfficientNetV2. Mathematical Biosciences and Engineering, 2023, 20(6): 11081-11100. doi: 10.3934/mbe.2023490
[7]	Boyang Wang, Wenyu Zhang . ACRnet: Adaptive Cross-transfer Residual neural network for chest X-ray images discrimination of the cardiothoracic diseases. Mathematical Biosciences and Engineering, 2022, 19(7): 6841-6859. doi: 10.3934/mbe.2022322
[8]	Zhigao Zeng, Cheng Huang, Wenqiu Zhu, Zhiqiang Wen, Xinpan Yuan . Flower image classification based on an improved lightweight neural network with multi-scale feature fusion and attention mechanism. Mathematical Biosciences and Engineering, 2023, 20(8): 13900-13920. doi: 10.3934/mbe.2023619
[9]	Shangbin Li, Yu Liu . Human motion recognition based on Nano-CMOS Image sensor. Mathematical Biosciences and Engineering, 2023, 20(6): 10135-10152. doi: 10.3934/mbe.2023444
[10]	Qiaokang Liang, Jianzhong Peng, Zhengwei Li, Daqi Xie, Wei Sun, Yaonan Wang, Dan Zhang . Robust table recognition for printed document images. Mathematical Biosciences and Engineering, 2020, 17(4): 3203-3223. doi: 10.3934/mbe.2020182

Abstract

1. Introduction

With the continuous advancement of computer science, human-computer interaction and communication with various intelligent devices have become integral aspects of daily life ^[1]. Gesture language, as a unique form of communication, has garnered widespread attention due to its natural and intuitive characteristics. Although it may not be as convenient as spoken communication, gestures can still accurately convey users' emotional information. As an indispensable key technology for future human-computer interaction, vision-based gesture recognition technology has emerged as a current research focus. However, the specificity, diversity, and polysemy of gestures themselves, coupled with the complexity of the human hand structure and limitations in computer vision technology, have made vision-based gesture recognition a challenging research domain that has attracted numerous researchers to dedicate their efforts to it ^[2].

Gesture recognition techniques can be broadly classified into two primary categories: wearable sensor-based and computer vision-based. Initially, gesture recognition methods relied on electromagnetic gloves and other wired devices directly connected to the computer. In this approach, hand information was transmitted to the computer recognition system for further processing. Xue et al. ^[3] employed Cyber Glove data gloves in conjunction with a hybrid method to identify ten distinct types of gestures. Although this method demonstrated a high degree of accuracy in gesture detection, its practical applicability is constrained due to factors such as the costly nature of data gloves and the cumbersome wearing process ^[4]. In 2020, Zhang et al. ^[5] developed a flexible wearable data glove for acquiring human gesture data and employed a radial basis function neural network for gesture capture and recognition, achieving 88.73% recognition accuracy. In contrast, vision-based gesture recognition technology has gradually reduced its dependence on hardware devices since its development began in the 1990s. Dardas et al. ^[6] addressed the challenge of gesture tracking and recognition in complex scenarios by extracting hand keypoints using SIFT features and SVM classifiers and training a model to recognize ten different gestures. However, this method requires keypoint extraction prior to recognition, which is less efficient to execute and often necessitates the design of some effective feature extraction schemes to enhance gesture recognition performance.

The progression of computer hardware and software has facilitated the extensive application of deep learning ^[7], which has also opened up new avenues of exploration in the field of gesture recognition. A prior scholarly investigation by Barros et al. ^[8] proposed a multi-channel convolutional neural network model for real-time gesture recognition that enhanced the classification features with a cubic convolutional kernel. Gnanapriya et al. ^[9] proposed an enhanced two-stage integrated model combining U-NET and convolutional neural networks for gesture segmentation and recognition. With the progression of technological tools, research on gesture recognition methods has also made substantial strides. Miao et al. ^[10] employed the ResC3D convolutional neural network in addressing dynamic gesture recognition, merging the advantages of the residual network and the 3D convolutional neural network ^[11]. This approach can extract spatio-temporal features while learning deep information. In the ChaLearn LAP of 2017 ^[12], it achieved favorable results in the multimodal isolated gesture recognition challenge. Wang et al. ^[13] employed gesture contour features extracted using the slope difference distribution (SDD) method for recognition. Initially, the hand contour was extracted, followed by the calculation of the peaks and valleys of the hand contour through the SDD algorithm for model matching recognition. Gao et al. ^[14] improved the 2D hand pose estimation based on the OpenPose method, developed a fast 3D hand pose estimation approach, and utilized a weighted fusion method to combine RGB, depth, and 3D skeleton data of the gesture. Finally, they employed the 3DCNN+ConvLSTM framework to recognize and classify the combined dynamic gesture data, effectively enhancing the recognition performance. However, the algorithmic model of this method is large and unsuitable for deployment on mobile and embedded devices. Li et al. ^[15] utilized millimeter waves for gesture recognition and devised a data enhancement framework to compute the correlation between signal and gesture changes. They also segmented the gesture image to enhance computational efficiency. The method extracts spatio-temporal information from dynamic windows for gesture recognition, further advancing the development of gesture recognition technology. Currently, behavior recognition techniques rooted in skeleton data have achieved commendable recognition outcomes ^{[16,17,18,19,20]}, as they provide granular details concerning the positioning of human joints and movement trajectories. This attribute is particularly instrumental for the accurate recognition of intricate and continuous actions. However, it should be noted that the domain of gesture recognition is inherently more circumscribed compared to the broader scope of behavior recognition. Consequently, temporal sequences extracted using information from the skeleton network tend to exhibit a lesser degree of variation. Therefore, directly transplanting these methods onto gesture recognition tasks may encounter certain inherent limitations.

Since convolutional neural networks (CNNs) have demonstrated remarkable accuracy in the domain of image classification tasks, a growing number of researchers have ventured into investigating their application to video understanding, particularly within the realm of gesture recognition. Although both motion recognition and image classification are fundamentally classification problems, they present numerous challenges and intricacies when dealing with sequential video frames due to the distinct natures of video data and the differing types of feature information that must be extracted. In action recognition contexts, it is imperative not only to consider spatial feature details within each video frame—such as hand position, hand morphology, and environmental features—but also temporal dynamics between frames, including the kinematic trend of the hand movement. This necessitates a holistic approach that captures both spatial and temporal aspects effectively.

In this study, we propose the MME-P3D algorithm. Firstly, a P3D-C network is designed for end-to-end gesture recognition. This network combines the channel attention mechanism CE with the P3D network to model the channel relationship of input features, obtaining the channel information weight distribution of the features. By strengthening useful channel features and suppressing irrelevant ones, it enhances the feature extraction capability of the P3D-C network. Subsequently, the Multiscale-Motion Excitation (MME) mechanism for pooling motion attention is integrated into the P3D-C network. Explicit motion features are constructed by computing the feature differences between two adjacent frames, focusing on and extracting the temporal feature information throughout the entire gesture movement process. This enables the algorithmic model to better understand and learn the dynamic information during gesture movement, significantly improving the accuracy and efficiency of gesture recognition. The primary contributions of the MME-P3D-based gesture recognition algorithm are briefly summarized as follows:

1) We designed a feature extraction network, P3D-C, tailored to the characteristics of gesture motion information. The network employs a pseudo-3D convolution structure to simulate 3*3*3 convolution for spatio-temporal feature extraction, effectively reducing the number of parameters. Concurrently, we integrated the channel attention (CE) mechanism into the P3D convolution to further enhance the feature extraction capability of the P3D convolution block.

2) We developed a multi-scale motion attention mechanism, MME, that constructs explicit motion features by computing feature differences between adjacent frames. This significantly reduces the number of parameters and computation required by the model while substantially improving the performance and efficiency of gesture recognition.

The remaining sections of the article are organized as follows: The first part presents a review of the state-of-the-art research related to gesture recognition. The second part briefly introduces the lightweight technology of convolutional neural network. The third part provides a detailed description of the MME-P3D gesture recognition algorithm. The fourth part showcases and analyzes the results of comparative experiments. Lastly, the fifth part summarizes the algorithm.

2. Related works

Gesture recognition is essentially a form of image classification that necessitates two pivotal stages: feature extraction and subsequent classification. In the initial phase, the extraction process entails discerning critical attributes that distinctly characterize a gesture, such as contours, textures, and colors, from the input visual data. The derived features are then subjected to classification in order to differentiate between various types of gestures. The CNNs have emerged as a pivotal technology for achieving both efficient and accurate gesture recognition in this domain.

With the pervasive use of mobile and embedded devices, deploying CNNs on edge devices holds substantial practical significance and relevance. However, the relentless pursuit of heightened recognition precision and performance has led to an increasing depth in network model layers, escalating complexity, surging numbers of parameters, and computational demands. Consequently, there's a decrease in the inference speed of these models, along with a substantial occupation of memory and computational resources. Given the inherently constrained computational and storage capacities typical of mobile and embedded systems, deploying these resource-intensive models proves challenging, limiting their applicability and impeding widespread adoption. Thus, striking an optimal balance among accuracy, inference speed, and model size becomes imperative. This has rendered the adaptation of convolutional neural network structures a pressing research topic in the academic community. In recent years, numerous research endeavors have focused on reducing the number of parameters and operations in models by optimizing 3D convolutional structures. Xu et al. ^[21] proposed an online lightweight two-stage framework for accurate detection and classification of dynamic gestures for a single RGB camera on raw video streams in real scenarios, which solves the challenge of fast and accurate recognition of gestures in real systems. Qiu et al. ^[22] introduced a pseudo-3D convolutional network that employs a pseudo-3D convolutional structure to simulate 3D convolutional operations, effectively addressing the issue of oversized network models caused by traditional 3D convolution. This improvement enhances the efficiency and performance of classification and recognition tasks, with a multitude of experimental results verifying the validity and feasibility of the pseudo-3D convolutional structure. Moreover, deep learning models such as R (2+1) D ^[23] and S3D ^[24] have also been extensively analyzed in numerous experiments. These studies demonstrate that it is feasible to decompose the 3D convolution operation into a 2D convolution in the spatial dimension and a 1D convolution in the temporal dimension, thereby combining spatial feature information with temporal feature information. This approach significantly reduces the number of parameters and computational complexity, improving the algorithmic model's efficiency and enhancing the network's robustness. However, while converting the convolution operation can effectively reduce the number of parameters and improve computational efficiency, due to the limitations of kernel size, this optimized neural network can only extract encoded short-term motion feature information within a small and fixed-length time-domain window. As a result, it is difficult to obtain the complete time-series motion feature information for the entire action process, which may decrease gesture recognition accuracy to some extent.

3. MME-P3D gesture recognition algorithm

The MME-P3D gesture recognition algorithm is mainly composed of CE-P3D convolution and ME attention mechanisms, combined with classification tasks for training and optimization. To reduce the number of parameters and computation in the algorithm model, we employ a P3D convolution kernel to simulate 3D convolution for extracting spatio-temporal features of gesture actions. Subsequently, global spatio-temporal information is modeled through multi-scale channel attention to strengthen valid information while suppressing invalid information, thereby enhancing the feature extraction capability of the algorithmic network. Furthermore, we integrate the MME as an adjunct, which constructs explicit motion features by calculating feature differences between adjacent frames. This aids the algorithmic model in better understanding and learning dynamic information during gesture movement. The overall architecture of this network model is depicted in Figure 1.

Figure 1. Overall framework diagram of MME-P3D gesture action recognition network.

Item	Data specification
Modalitites	RGB
Total number of videos	2071
Total number of freams	64,317
Number of classes	5
Number of actors	10
Avg.duration of videos	18
Avg.number of videos per class	364

Item	Data specification	Model	Number of blocks
	N = 2	28.67	61.05
MME-P3D	N = 4	33.93	73.67
	N = 6	39.19	86.29
	N = 8	44.45	98.91

Methods	Input frame number	Resolution	Accuracy (%)	Parameter quantity/M	FLOPs/G
C3D	32	112*112	93.37	189.11	237.68
MoblieNet	16	112*112	85.24	53.72	88.04
I3D	16	112*112	90.65	110.52	132.55
3DResnet50	16	112*112	82.27	123.87	455.23
Method of Ref. ^[21]	16	112*112	88.21	42.33	79.56
MME-P3D	16	112*112	91.12	33.93	76.37
Note: Bold font is the best value for each column.

[1]	Y. Zhang, J. Wang, X. Wang, H. Jing, Z. Sun, Y. Cai, Static hand gesture recognition method based on the vision transformer, Multimedia Tools Appl., 82 (2023), 1–20. https://doi.org/10.1007/s11042-023-14732-3 doi: 10.1007/s11042-023-14732-3
[2]	T. Zhang, Application of AI-based real-time gesture recognition and embedded system in the design of English major teaching, Wireless Netw., 2021 (2021), 1–13. https://doi.org/10.1007/s11276-021-02693-0 doi: 10.1007/s11276-021-02693-0
[3]	Y. Xue, Y. Yu, K. Yin, P. Li, S. Xie, Z. Ju, Human in-hand motion recognition based on multi-modal perception information fusion, IEEE Sens. J., 22 (2022), 6793–6805. https://doi.org/10.1109/JSEN.2022.3148992 doi: 10.1109/JSEN.2022.3148992
[4]	M. S. Amin, S. T. H. Rizvi, M. M. Hossain, A comparative review on applications of different sensors for sign language recognition, J. Imaging, 8 (2022), 1–48. https://doi.org/10.3390/jimaging8040098 doi: 10.3390/jimaging8040098
[5]	Y. Zhang, Y. Huang, X. Sun, Y. Zhao, X. Guo, P. Liu, et al., Static and dynamic human arm/hand gesture capturing and recognition via multiinformation fusion of flexible strain sensors, IEEE Sens. J., 20 (2020), 6450–6459. https://doi.org/10.1109/JSEN.2020.2965580 doi: 10.1109/JSEN.2020.2965580
[6]	N. H. Dardas, N. D. Georganas, Real-time hand gesture detection and recognition using bag-of-features and support vector machine techniques, IEEE Trans. Instrum. Meas., 60 (2011), 3592–3607. https://doi.org/10.1109/TIM.2011.2161140 doi: 10.1109/TIM.2011.2161140
[7]	B. Qiang, Y. Zhai, M. Zhou, X. Yang, B. Peng, Y. Wang, et al., SqueezeNet and fusion network-based accurate fast fully convolutional network for hand detection and gesture recognition, IEEE Access, 9 (2021), 77661–77674. https://doi.org/10.1109/ACCESS.2021.3079337 doi: 10.1109/ACCESS.2021.3079337
[8]	P. Barros, S. Magg, C. Weber, S. Wermter, A multichannel convolutional neural network for hand posture recognition, in Artificial Neural Networks and Machine Learning–ICANN 2014: 24th International Conference on Artificial Neural Networks, Hamburg, Germany, September 15–19, 2014. Proceedings 24, (2014), 403–410. https://doi.org/10.1007/978-3-319-11179-7_51
[9]	S. Gnanapriya, K. Rahimunnisa, A hybrid deep learning model for real time hand gestures recognition, Intell. Autom. Soft Comput., 36 (2023), 763–767. https://doi.org/10.32604/iasc.2023.032832 doi: 10.32604/iasc.2023.032832
[10]	Q. Miao, Y. Li, W. Ouyang, Z. Ma, X. Xu, W. Shi, et al., Multimodal gesture recognition based on the resc3d network, in 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), (2017), 3047–3055. https://doi.org/10.1109/ICCVW.2017.360
[11]	D. Tran, L. Bourdev, R. Fergus, L. Torresani, M. Paluri, Learning spatiotemporal features with 3D convolutional networks, in 2015 IEEE International Conference on Computer Vision (ICCV), (2015), 4489–4497. https://doi.org/10.1109/TIP.2021.3092828
[12]	J. Wan, S. Escalera, G. Anbarjafari, H. Jair Escalante, X. Baró, I. Guyon, et al., Results and analysis of chalearn lap multi-modal isolated and continuous gesture recognition, and real versus fake expressed emotions challenges, in Proceedings of the IEEE International Conference on Computer Vision Workshops, (2017), 3189–3197. https://doi.org/10.1109/ICCVW.2017.377
[13]	Z. Z. Wang, Automatic and robust hand gesture recognition by SDD features based model matching, Appl. Intell., 52 (2022), 11288–11299. https://doi.org/10.1007/s10489-021-02933-y doi: 10.1007/s10489-021-02933-y
[14]	Q. Gao, Y. Chen, Z. Ju, Y. Liang, Dynamic hand gesture recognition based on 3D hand pose estimation for human–robot interaction, IEEE Sens. J., 18 (2021), 17421–17430. https://doi.org/10.1109/JSEN.2021.3059685 doi: 10.1109/JSEN.2021.3059685
[15]	Y. Li, D. Zhang, J. Chen, J. Wan, D. Zhang, Y. Hu, et al., Towards domain-independent and real-time gesture recognition using mmwave signal, IEEE Trans. Mob. Comput., 22 (2022), 7355–7369. https://doi.org/10.1109/TMC.2022.3207570 doi: 10.1109/TMC.2022.3207570
[16]	M. Wang, X. Li, S. Chen, X. Zhang, L. Ma, Y. Zhang, Learning representations by contrastive spatio-temporal clustering for skeleton-based action recognition, IEEE Trans. Multimedia, 2023 (2023), 1–14. https://doi.org/10.1109/TMM.2023.3307933 doi: 10.1109/TMM.2023.3307933
[17]	C. Pang, X. Gao, Z. Chen, L. Lyu, Self-adaptive graph with nonlocal attention network for skeleton-based action recognition, IEEE Trans. Neural Networks Learn. Syst., 2023 (2023), 1–13. https://doi.org/10.1109/TNNLS.2023.3298950 doi: 10.1109/TNNLS.2023.3298950
[18]	P. Geng, X. Lu, C. Hu, H. Liu, L. Lyu, Focusing fine-grained action by self-attention-enhanced graph neural networks with contrastive learning, IEEE Trans. Circuits Syst. Video Technol., 33 (2023), 4754–4768. https://doi.org/10.1109/TCSVT.2023.3248782 doi: 10.1109/TCSVT.2023.3248782
[19]	W. Song, T. Chu, S. Li, N. Li, A. Hao, H. Qin, Joints-centered spatial-temporal features fused skeleton convolution network for action recognition, IEEE Trans. Multimedia, 2023 (2023), 1–15. https://doi.org/10.1109/TMM.2023.3324835 doi: 10.1109/TMM.2023.3324835
[20]	C. Pang, X. Lu, L. Lyu, Skeleton-based action recognition through contrasting two-stream spatial-temporal networks, IEEE Trans. Multimedia, 25 (2023), 8699–8711. https://doi.org/10.1109/TMM.2023.3239751 doi: 10.1109/TMM.2023.3239751
[21]	C. Xu, X. Wu, M. Wang, F. Qiu, Y. Liu, J. Ren, Improving dynamic gesture recognition in untrimmed videos by an online lightweight framework and a new gesture dataset ZJUGesture, Neurocomputing, 523 (2023), 58–68. https://doi.org/10.1016/j.neucom.2022.12.022 doi: 10.1016/j.neucom.2022.12.022
[22]	Z. Qiu, T. Yao, T. Mei, Learning spatio-temporal representation with pseudo-3D residual networks, in Proceedings of the IEEE International Conference on Computer Vision, (2017), 5533–5541. https://doi.org/10.1109/ICCV.2017.590
[23]	D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, M. Paluri, A closer look at spatiotemporal convolutions for action recognition, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2018), 6450–6459. https://doi.org/10.1109/CVPR.2018.00675
[24]	S. Xie, C. Sun, J. Huang, Z. Tu, K. Murphy, Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification, in Proceedings of the European Conference on Computer Vision (ECCV), (2018), 318–335. https://doi.org/10.1007/978-3-030-01267-0_19
[25]	Z. Huang, X. Shi, C. Zhang, Q. Wang, K. C. Cheung, Flowformer: A transformer architecture for optical flow, in European Conference on Computer Vision, (2022), 668–685. https://doi.org/10.1007/978-3-031-19790-1_40
[26]	J. Wang, X. Li, J. Li, Q. Sun, H. Wang, NGCU: A new RNN model for time-series data prediction, Big Data Res., 27 (2022), 100296. https://doi.org/10.1016/j.bdr.2021.100296 doi: 10.1016/j.bdr.2021.100296
[27]	T. Huynh-The, C. H. Hua, N. A. Tu, D. S. Kim, Learning 3D spatiotemporal gait feature by convolutional network for person identification, Neurocomputing, 397 (2020), 192–202. https://doi.org/10.1016/j.neucom.2020.02.048 doi: 10.1016/j.neucom.2020.02.048
[28]	B. Zhou, C. Han, T. Guo, Convergence of stochastic gradient descent in deep neural network, Acta Math. Appl. Sin., 37 (2021), 126–136. https://doi.org/10.1007/s10255-021-0991-2 doi: 10.1007/s10255-021-0991-2
[29]	A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, et al., Mobilenets: Efficient convolutional neural networks for mobile vision applications, preprint, arXiv: 1704.04861. https://doi.org/10.48550/arXiv.1704.04861
[30]	J. Carreira, A. Zisserman, Quo vadis, action recognition? a new model and the kinetics dataset, preprint, arXiv: 1705.07750. https://doi.org/10.48550/arXiv.1705.07750
[31]	H. Kataoka, T. Wakamiya, K. Hara, Y. Satoh, Would mega-scale datasets further enhance spatiotemporal 3D CNNs, preprint, arXiv: 2004.04968. https://doi.org/10.48550/arXiv.2004.04968

Mathematical Biosciences and Engineering

Research on gesture recognition algorithm based on MME-P3D

Related Papers:

Abstract

1. Introduction

2. Related works

3. MME-P3D gesture recognition algorithm

3.1. P3D-C network

3.2. Multiscale motor attention

4. Experiments

4.1. Dataset

4.1.1. Chalearn 2013 dataset

4.1.2. Self-constructed dataset

4.2. Data preprocessing

4.3. Parallelism analysis

4.4. Ablation studies

4.5. Comparative experimental analysis

5. Conclusions

Use of AI tools declaration

Acknowledgments

Conflict of interest

References

Reader Comments

通讯作者: 陈斌, bchen63@163.com

Metrics

Figures and Tables

Other Articles By Authors

Related pages

Tools

Export File

Citation

Format

Content

Catalog

Abstract

1. Introduction

2. Related works

3. MME-P3D gesture recognition algorithm

3.1. P3D-C network

3.2. Multiscale motor attention

4. Experiments

4.1. Dataset

4.1.1. Chalearn 2013 dataset

4.1.2. Self-constructed dataset

4.2. Data preprocessing

4.3. Parallelism analysis

4.4. Ablation studies

4.5. Comparative experimental analysis

5. Conclusions

Use of AI tools declaration

Acknowledgments

Conflict of interest

References