Three-dimensional human pose estimation is a key technology in many computer vision tasks. Regressing a 3D pose from 2D images is a challenging task, especially for applications in natural scenes. Recovering the 3D pose from a monocular image is an ill-posed problem itself; moreover, most of the existing datasets have been captured in a laboratory environment, which means that the model trained by them cannot generalize well to in-the-wild data. In this work, we improve the 3D pose estimation performance by introducing the attention mechanism and a calibration network. The attention model will capture the channel-wise dependence, so as to enhance the depth analysis ability of the model. The multi-scale pose calibration network adaptively learns body structure and motion characteristics, and will therefore rectify the estimation results. We tested our model on the Human 3.6M dataset for quantitive evaluation, and the experimental results show the proposed methods with higher accuracy. In order to test the generalization capability for in-the-wild applications, we also report the qualitative results on the natural scene Leeds Sports Pose dataset; the visualization results show that the estimated results are more reasonable than the baseline model.
Citation: Longkui Jiang, Yuru Wang, Xinhe Ji. Calibrated deep attention model for 3D pose estimation in the wild[J]. Electronic Research Archive, 2023, 31(3): 1556-1569. doi: 10.3934/era.2023079
Three-dimensional human pose estimation is a key technology in many computer vision tasks. Regressing a 3D pose from 2D images is a challenging task, especially for applications in natural scenes. Recovering the 3D pose from a monocular image is an ill-posed problem itself; moreover, most of the existing datasets have been captured in a laboratory environment, which means that the model trained by them cannot generalize well to in-the-wild data. In this work, we improve the 3D pose estimation performance by introducing the attention mechanism and a calibration network. The attention model will capture the channel-wise dependence, so as to enhance the depth analysis ability of the model. The multi-scale pose calibration network adaptively learns body structure and motion characteristics, and will therefore rectify the estimation results. We tested our model on the Human 3.6M dataset for quantitive evaluation, and the experimental results show the proposed methods with higher accuracy. In order to test the generalization capability for in-the-wild applications, we also report the qualitative results on the natural scene Leeds Sports Pose dataset; the visualization results show that the estimated results are more reasonable than the baseline model.
[1] | J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, et al., Single image 3D interpreter network, in European Conference on Computer Vision, (2016), 365–382. https://doi.org/10.1007/978-3-319-46466-4_22 |
[2] | C. H. Chen, D. Ramanan, 3D human pose estimation = 2D pose estimation+ matching, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017), 5759–5767. https://doi.org/10.1109/CVPR.2017.610 |
[3] | D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, et al., Monocular 3D human pose estimation in the wild using improved CNN supervision, in 2017 International Conference on 3D Vision (3DV), (2017), 506–516. https://doi.org/10.1109/3DV.2017.00064 |
[4] | D. Pavllo, C. Feichtenhofer, D. Grangier, M. Auli, 3D human pose estimation in video with tem- poral convolutions and semi-supervised training, in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2019), 7753–7762. https://doi.org/10.1109/CVPR.2019.00794 |
[5] | Y. Cheng, B. Yang, B. Wang, W. Yan, R. Tan, Occlusion-aware networks for 3D human pose estima- tion in video, in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), (2019), 723–732. https://doi.org/10.1109/ICCV.2019.00081 |
[6] | G. Moon, J. Y. Chang, K. M. Lee, Camera distance-aware top-down approach for 3D multi- person pose estimation from a single RGB image, in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), (2019). https://doi.org/10.1109/ICCV.2019.01023 |
[7] | I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, et al., Generative adversarial nets, in Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS), 2 (2014), 2672-2680. |
[8] | W. Yang, W. Ouyang, X. Wang, J. Ren, H. Li, X. Wang, 3D human pose estimation in the wild by adversarial learning, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2018), 5255–5264. https://doi.org/10.1109/CVPR.2018.00551 |
[9] | B. Wandt, B. Rosenhahn, RepNet: Weakly supervised training of an adversarial reprojection network for 3D human pose estimation, in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2019), 7774–7783. https://doi.org/10.1109/CVPR.2019.00797 |
[10] | B. Wandt, M. Rudolph, P. Zell, H. Rhodin, B. Rosenhahn, CanonPose: Self-supervised monocular 3D human pose estimation in the wild, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021), 13289–13299. https://doi.org/10.1109/CVPR46437.2021.01309 |
[11] | S. Amin, M. Andriluka, M. Rohrbach, B. Schiele, Multi-view pictorial structures for 3D human pose estimation, in British Machine Vision Conference, (2013), 1–12. |
[12] | H. Qiu, C. Wang, J. Wang, N. Wang, W. Zeng, Cross view fusion for 3D human pose estimation. in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), (2019), 4342–4351. https://doi.org/10.1109/ICCV.2019.00444 |
[13] | H. Ci, C. Wang, X. Ma, Y. Wang, Optimizing network structure for 3D human pose estimation, in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), (2019), 2262–2271. https://doi.org/10.1109/ICCV.2019.00235 |
[14] | X. Ma, J. Su, C. Wang, H. Ci, Y. Wang, Context modeling in 3D human pose estimation: A unified perspective, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021), 6234–6243. https://doi.org/10.1109/CVPR46437.2021.00617 |
[15] | X. Zhou, Q. Huang, X. Sun, X. Xue, Y. Wei, Towards 3D human pose estimation in the wild: A weakly-supervised approach, in 2017 IEEE International Conference on Computer Vision (ICCV), (2017), 398–407. https://doi.org/10.1109/ICCV.2017.51 |
[16] | Q. Zhou, B. Zhong, X. Liu, R. Ji, Attention-based neural architecture search for person re- identification, IEEE Trans. Neural Networks Learn. Syst., 33 (2022), 6627–6639. https://doi.org/10.1109/TNNLS.2021.3082701 doi: 10.1109/TNNLS.2021.3082701 |
[17] | H. Guo, Z. Ren, Y. Wu, G. Hua, Q. Ji, Uncertainty-based spatial-temporal attention for online action detection, in European Conference on Computer Vision (ECCV), (2022), 69–86. https://doi.org/10.1007/978-3-031-19772-7_5 |
[18] | X. Chu, W. Yang, W. Ouyang, C. Ma, A. L Yuille, X. Wang, Multi-context attention for human pose estimation, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017), 5669–5678. https://doi.org/10.1109/CVPR.2017.601 |
[19] | K. Su, D. Yu, Z. Xu, X. Geng, C. Wang, Multi-person pose estimation with enhanced channel-wise and spatial information, in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2019), 5667–5675. https://doi.org/10.1109/CVPR.2019.00582 |
[20] | Z. Cui, T. Song, Y. Wang, Q. Ji, Knowledge augmented deep neural networks for joint facial expression and action unit recognition, in Proceedings of the 34th International Conference on Neural Information Processing Systems, (2020), 14338–14349. |
[21] | Z. Cui, P. Kapanipathi, K. Talamadupula, T. Gao, Q. Ji, Type-augmented relation prediction in knowledge graphs, in Proceedings of the AAAI Conference on Artificial Intelligence, 35 (2021), 7151–7159. https://doi.org/10.1609/aaai.v35i8.16879 |
[22] | M. R. Ronchi, O. M. Aodha, R. Eng, P. Perona, It's all relative: Monocular 3D Human Pose Estimation from weakly supervised data, preprint, arXiv: 1805.06880. |
[23] | V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab, S. Ilic, 3D pictorial structures revisited: multiple human pose estimation, IEEE Trans. Pattern Anal. Mach. Intell., 38 (2016), 1929–1942. https://doi.org/10.1109/TPAMI.2015.2509986 doi: 10.1109/TPAMI.2015.2509986 |
[24] | A. Newell, K. Yang, J. Deng, Stacked Hourglass Networks for human pose estimation, in European Conference on Computer Vision, 9912 (2016), 483–499. https://doi.org/10.1007/978-3-319-46484-8_29 |
[25] | K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), 770–778. https://doi.org/10.1109/CVPR.2016.90 |
[26] | S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in Proceedings of the 32nd International Conference on International Conference on Machine Learning, 37 (2015). 448–456. |
[27] | R. Arora, A. Basu, P. Mianjy, A. Mukherjee, Understanding deep neural networks with rectified linear units, preprint, arXiv: 1611.01491. |
[28] | Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, Q. Hu, ECA-Net: Efficient channel attention for deep convolutional neural networks, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020), 11531–11539. https://doi.org/10.1109/CVPR42600.2020.01155 |
[29] | M. Schuster, K. K. Paliwal, Bidirectional recurrent neural networks, IEEE Trans. Signal Process., 45 (1997), 2673–2681. https://doi.org/10.1109/78.650093 doi: 10.1109/78.650093 |
[30] | A. J. Haug, Bayesian Estimation and Tracking: A Practical Guide, Wiley, Hoboken, 2012. |
[31] | J. Pearl, Fusion, propagation, and structuring in belief networks, Artif. Intell., 29 (1986), 241–288. https://doi.org/10.1016/0004-3702(86)90072-X doi: 10.1016/0004-3702(86)90072-X |
[32] | H. T. Ma, Z. Yang, J. F. Griffith, P. C. Leung, R. Y. W. Lee, A new method for determining lumbar spine motion using bayesian belief network, Med. Biol. Eng. Comput., 46 (2008), 333–340. https://doi.org/10.1007/s11517-008-0318-y doi: 10.1007/s11517-008-0318-y |
[33] | D. Tome, C. Russell, L. Agapito, Lifting from the deep: Convolutional 3D pose estimation from a single image, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017), 5689–5698. https://doi.org/10.1109/CVPR.2017.603 |
[34] | J. Martinez, R. Hossain, J. Romero, J. J. Little, A simple yet effective baseline for 3D human pose estimation, in 2017 IEEE International Conference on Computer Vision (ICCV), (2017), 2659–2668. https://doi.org/10.1109/ICCV.2017.288 |
[35] | D. Drover, M. v Rohith, C. H. Chen, A. Agrawal, A. Tyagi, P. H. Cong, Can 3D pose be learned from 2D projections alone?, in European Conference on Computer Vision, (2018), 78–94. https://doi.org/10.1007/978-3-030-11018-5_7 |