Skeleton-based action recognition is an important but challenging task in the study of video understanding and human-computer interaction. However, existing methods suffer from two deficiencies. On the one hand, most methods usually involve manually designed convolution kernel which cannot capture spatial-temporal joint dependencies of complex regions. On the other hand, some methods just use the self-attention mechanism, ignoring its theoretical explanation. In this paper, we proposed a unified spatio-temporal graph convolutional network with a self-attention mechanism (SA-GCN) for low-quality motion video data with fixed viewing angle. SA-GCN can extract features efficiently by learning weights between joint points of different scales. Specifically, the proposed self-attention mechanism is end-to-end with mapping strategy for different nodes, which not only characterizes the multi-scale dependencies of joints, but also integrates the structural features of the graph and an ability of self-learning fusion features. Moreover, the attention mechanism proposed in this paper can be theoretically explained by GCN to some extent, which is usually not considered in most existing models. Extensive experiments on two widely used datasets, NTU-60 RGB+D and NTU-120 RGB+D, demonstrated that SA-GCN significantly outperforms a series of existing mainstream approaches in terms of accuracy.
Citation: Min Li, Ke Chen, Yunqing Bai, Jihong Pei. Skeleton action recognition via graph convolutional network with self-attention module[J]. Electronic Research Archive, 2024, 32(4): 2848-2864. doi: 10.3934/era.2024129
Skeleton-based action recognition is an important but challenging task in the study of video understanding and human-computer interaction. However, existing methods suffer from two deficiencies. On the one hand, most methods usually involve manually designed convolution kernel which cannot capture spatial-temporal joint dependencies of complex regions. On the other hand, some methods just use the self-attention mechanism, ignoring its theoretical explanation. In this paper, we proposed a unified spatio-temporal graph convolutional network with a self-attention mechanism (SA-GCN) for low-quality motion video data with fixed viewing angle. SA-GCN can extract features efficiently by learning weights between joint points of different scales. Specifically, the proposed self-attention mechanism is end-to-end with mapping strategy for different nodes, which not only characterizes the multi-scale dependencies of joints, but also integrates the structural features of the graph and an ability of self-learning fusion features. Moreover, the attention mechanism proposed in this paper can be theoretically explained by GCN to some extent, which is usually not considered in most existing models. Extensive experiments on two widely used datasets, NTU-60 RGB+D and NTU-120 RGB+D, demonstrated that SA-GCN significantly outperforms a series of existing mainstream approaches in terms of accuracy.
[1] | M. Vrigkas, C. Nikou, I. A. Kakadiaris, A review of human activity recognition methods, Front. Rob. AI, 2 (2015), 28. https://doi.org/10.3389/frobt.2015.00028 doi: 10.3389/frobt.2015.00028 |
[2] | Z. Sun, Q. Ke, H. Rahmani, M. Bennamoun, G. Wang, J. Liu, Human action recognition from various data modalities: A review, IEEE Trans. Pattern Anal. Mach. Intell., 45 (2022), 3200–3225. https://doi.org/10.1109/TPAMI.2022.3183112 doi: 10.1109/TPAMI.2022.3183112 |
[3] | W. Lin, M. T. Sun, R. Poovandran, Human activity recognition for video surveillance, in 2008 IEEE International Symposium on Circuits and Systems (ISCAS), IEEE, (2008), 2737–2740. https://doi.org/10.1109/ISCAS.2008.4542023 |
[4] | W. Hu, D. Xie, Z. Fu, W. Zeng, S. Maybank, Semantic-based surveillance video retrieval, IEEE Trans. Image Process., 16 (2007), 1168–1181. https://doi.org/10.1109/TIP.2006.891352 doi: 10.1109/TIP.2006.891352 |
[5] | I. Rodomagoulakis, N. Kardaris, V. Pitsikalis, E. Mavroudi, A. Katsamanis, A. Tsiami, et al., Multimodal human action recognition in assistive human-robot interaction, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, (2016), 2702–2706. https://doi.org/10.1109/ICASSP.2016.7472168 |
[6] | K. Simonyan, A. Zisserman, Two-stream convolutional networks for action recognition in videos, in Advances in Neural Information Processing Systems 27 (NIPS 2014), 27 (2014). |
[7] | J. Zhu, Z. Zhu, W. Zou, End-to-end video-level representation learning for action recognition, in 2018 24th International Conference on Pattern Recognition (ICPR), IEEE, (2018), 645–650. https://doi.org/10.1109/ICPR.2018.8545710 |
[8] | M. R. Sudha, K. Sriraghav, S. Manisha, S. G. Jacob, S. Manisha, Approaches and applications of virtual reality and gesture recognition: A review, Int. J. Ambient Comput. Intell., 8 (2017), 1–18. https://doi.org/10.4018/IJACI.2017100101 doi: 10.4018/IJACI.2017100101 |
[9] | J. Zhu, W. Zou, Z. Zhu, Y. Hu, Convolutional relation network for skeleton-based action recognition, Neurocomputing, 370 (2019), 109–117. https://doi.org/10.1016/j.neucom.2019.08.043 doi: 10.1016/j.neucom.2019.08.043 |
[10] | L. Shi, Y. Zhang, J. Cheng, H. Lu, Skeleton-based action recognition with multi-stream adaptive graph convolutional networks, IEEE Trans. Image Process., 29 (2020), 9532–9545. https://doi.org/10.1109/TIP.2020.3028207 doi: 10.1109/TIP.2020.3028207 |
[11] | K. Cheng, Y. Zhang, X. He, J. Cheng, H. Lu, Extremely lightweight skeleton-based action recognition with shiftgcn++, IEEE Trans. Image Process., 30 (2021), 7333–7348. https://doi.org/10.1109/TIP.2021.3104182 doi: 10.1109/TIP.2021.3104182 |
[12] | M. Wang, X. Li, S. Chen, X. Zhang, L. Ma, Y. Zhang, Learning representations by contrastive spatio-temporal clustering for skeleton-based action recognition, IEEE Trans. Multimedia, 26 (2023), 3207–3220. https://doi.org/10.1109/TMM.2023.3307933 doi: 10.1109/TMM.2023.3307933 |
[13] | C. Pang, X. Gao, Z. Chen, L. Lyu, Self-adaptive graph with nonlocal attention network for skeleton-based action recognition, IEEE Trans. Neural Networks Learn. Syst., 2023 (2023), 1–13. https://doi.org/10.1109/TNNLS.2023.3298950 doi: 10.1109/TNNLS.2023.3298950 |
[14] | M. Trascau, M. Nan, A. M. Florea, Spatio-temporal features in action recognition using 3D skeletal joints, Sensors, 19 (2019), 1–15. https://doi.org/10.3390/s19020423 doi: 10.3390/s19020423 |
[15] | P. Geng, X. Lu, C. Hu, H. Liu, L. Lyu, Focusing fine-grained action by self-attention-enhanced graph neural networks with contrastive learning, IEEE Trans. Circuits Syst. Video Technol., 33 (2023), 4754–4768. https://doi.org/10.1109/TCSVT.2023.3248782 doi: 10.1109/TCSVT.2023.3248782 |
[16] | T. Xu, W. Takano, Graph stacked hourglass networks for 3d human pose estimation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021), 16105–16114. |
[17] | B. Doosti, S. Naha, M. Mirbagheri, D. J. Crandall, Hope-net: A graph-based model for hand-object pose estimation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), 6608–6617. |
[18] | K. Cheng, Y. Zhang, X. He, W. Chen, J. Cheng, H. Lu, Skeleton-based action recognition with shift graph convolutional network, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), 183–192. |
[19] | M. Li, S. Chen, Y. Zhao, Y. Zhang, Y. Wang, Q. Tian, Dynamic multi-scale graph neural networks for 3D skeleton based human motion prediction, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), 214–223. |
[20] | S. Zhang, W. Zhao, Z. Guan, X. Peng, J. Peng, Keypoint-Graph-Driven Learning Framework for Object Pose Estimation, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2021), 1065–1073. |
[21] | L. Li, W. Zheng, Z. Zhang, Y. Huang, L. Wang, Skeleton-based relational modeling for action recognition, preprint, arXiv: 1805.02556, 2018. |
[22] | W. Zheng, L. Li, Z. Zhang, Y. Huang, L. Wang, Relational network for skeleton-based action recognition, in 2019 IEEE International Conference on Multimedia and Expo (ICME), IEEE, (2019), 826–831. https://doi.org/10.1109/ICME.2019.00147 |
[23] | Q. Ke, M. Bennamoun, S. An, F. Sohel, F. Boussaid, A new representation of skeleton sequences for 3D action recognition, in roceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017), 3288–3297. |
[24] | T. S. Kim, A. Reiter, Interpretable 3D human action analysis with temporal convolutional networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, IEEE, (2017), 20–28. |
[25] | S. Yan, Y. J. Xiong, D. Lin, Spatial temporal graph convolutional networks for skeleton-based action recognition, in Thirty-Second AAAI Conference on Artificial Intelligence, 32 (2018). https://doi.org/10.1609/aaai.v32i1.12328 |
[26] | A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., Attention is all you need, in Advances in Neural Information Processing Systems 30 (NIPS 2017), (2017), 30. |
[27] | L. Shi, Y. Zhang, J. Cheng, H. Lu, Two-stream adaptive graph convolutional networks for skeleton-based action recognition, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2019), 12026–12035. |
[28] | C. Wang, C. Deng, On the global self-attention mechanism for graph convolutional networks, in 2020 25th International Conference on Pattern Recognition (ICPR), IEEE, (2021), 8531–8538. https://doi.org/10.1109/ICPR48806.2021.9412456 |
[29] | A. Shahroudy, J. Liu, T. Ng, G. Wang, NTU RGB+D: A large scale dataset for 3D human activity analysis, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), 1010–1019. |
[30] | J. Liu, A. Shahroudy, M. Perez, G. Wang, L. Duan, A. C. Kot, NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding, in IEEE Transactions on Pattern Analysis and Machine Intelligence, 42 (2019), 2684–2701. https://doi.org/10.1109/TPAMI.2019.2916873 |
[31] | M. Defferrard, X. Bresson, P. Vandergheynst, Convolutional neural networks on graphs with fast localized spectral filtering, in Advances in Neural Information Processing Systems 29 (NIPS 2016), (2016), 29. |
[32] | M. Niepert, M. Ahmed, K. Kutzkov, Learning convolutional neural networks for graphs, in Proceedings of The 33rd International Conference on Machine Learning, PMLR, (2016), 2014–2023. |
[33] | B. Li, X. Li, Z. Zhang, F. Wu, Spatio-temporal graph routing for skeleton-based action recognition, in Proceedings of the AAAI Conference on Artificial Intelligence, 33 (2019), 8561–8568. https://doi.org/10.1609/aaai.v33i01.33018561 |
[34] | T. Li, R. Zhang, Q. Li, Multi scale temporal graph networks for skeleton-based action recognition, preprint, arXiv: 2012.02970, 2020. https://doi.org/10.48550/arXiv.2012.02970 |
[35] | H. Touvron, M. Cord, A. Sablayrolles, G. Synnaeve, H. Jégou, Going deeper with image transformers, in Proceedings of the IEEE/CVF International Conference on Computer Vision, (2021), 32–42. |
[36] | H. Zhang, I. Goodfellow, D. Metaxas, A. Odena, Self-attention generative adversarial networks, in Proceedings of the 36th International Conference on Machine Learning, PMLR, (2019), 7354–7363. |
[37] | Y. Rao, J. Lu, J. Zhou, Attention-aware deep reinforcement learning for video face recognition, in Proceedings of the IEEE International Conference on Computer Vision (ICCV), (2017), 3931–3940. |
[38] | H. Larochelle, G. E. Hinton, Learning to combine foveal glimpses with a third-order Boltzmann machine, in Advances in Neural Information Processing Systems 23 (NIPS 2010), (2010), 23. |
[39] | F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, et al., Residual attention network for image classification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017), 3156–3164. |
[40] | J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2018), 7132–7141. |
[41] | K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, et al., Show, attend and tell: Neural image caption generation with visual attention, in Proceedings of the 32nd International Conference on Machine Learning, PMLR, (2015), 2048–2057. |
[42] | M. E. Hussein, M. Torki, M. A. Gowayyed, M. El-Saban, Human action recognition using a temporal hierarchy of covariance descriptors on 3D joint locations, in Twenty-Third International Joint Conference on Artificial Intelligence, 2013. |
[43] | J. Liu, A. Shahroudy, D. Xu, G. Wang, Spatio-temporal lstm with trust gates for 3D human action recognition, European Conference on Computer Vision, Springer, Cham, (2016), 816–833. https://doi.org/10.1007/978-3-319-46487-9_50 |
[44] | M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, Q. Tian, Actional-structural graph convolutional networks for skeleton-based action recognition, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2019), 3595–3603. |
[45] | C. Chen, X. Zhao, J. Wang, D. Li, Y. Guan, J. Hong, Dynamic graph convolutional network for assembly behavior recognition based on attention mechanism and multi-scale feature fusion, Sci. Rep., 12 (2022), 1–13. https://doi.org/10.1038/s41598-022-11206-8 doi: 10.1038/s41598-022-11206-8 |
[46] | W. Peng, X. Hong, H. Chen, G. Zhao, Learning graph convolutional network for skeleton-based human action recognition by neural searching, in Proceedings of the AAAI Conference on Artificial Intelligence, 34 (2020), 2669–2676. https://doi.org/10.1609/aaai.v34i03.5652 |
[47] | F. Shi, C. Lee, L. Qiu, Y. Zhao, T. Shen, S. Muralidhar, et al., Star: Sparse transformer-based action recognition, preprint, arXiv: 2017.07089, 2021. https://doi.org/10.48550/arXiv.2107.07089 |
[48] | H. Zhang, H. Geng, G. Yang, Two-stream transformer encoders for skeleton-based action recognition, in 6th International Technical Conference on Advances in Computing, Control and Industrial Engineering (CCIE 2021), Springer, 920 (2022), 272–281. https://doi.org/10.1007/978-981-19-3927-3_26 |
[49] | Y. Meng, M. Shi, W. Yang, Skeleton action recognition based on tranformer adaptive graph convolution, in Journal of Physics: Conference Series, 2170 (2022), 012007. https://doi.org/10.1088/1742-6596/2170/1/012007 |
[50] | W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, et al., The kinetics human action video dataset, preprint, arXiv: 1705.06950, 2017. https://doi.org/10.48550/arXiv.1705.06950 |
[51] | X. Qin, R. Cai, J. Yu, C. He, X. Zhang, An efficient self-attention network for skeleton-based action recognition, Sci. Rep., 12 (2022), 1–10. https://doi.org/10.1038/s41598-022-08157-5 doi: 10.1038/s41598-022-08157-5 |
[52] | T. N. Kipf, M. Welling, Semi-supervised classification with graph convolutional networks, preprint, arXiv: 1609.02907, 2016. https://doi.org/10.48550/arXiv.1609.02907 |
[53] | Z. Chen, S. Li, B. Yang, Q. Li, H. Liu, Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition, in Proceedings of the AAAI Conference on Artificial Intelligence, 35 (2021), 1113–1122. https://doi.org/10.1609/aaai.v35i2.16197 |
[54] | Z. Liu, H. Zhang, Z. Chen, Z. Wang, W. Ouyang, Disentangling and unifying graph convolutions for skeleton-based action recognition, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020), 143–152. |