Research article Special Issues

Calibrated deep attention model for 3D pose estimation in the wild

  • Received: 28 October 2022 Revised: 23 December 2022 Accepted: 29 December 2022 Published: 18 January 2023
  • Three-dimensional human pose estimation is a key technology in many computer vision tasks. Regressing a 3D pose from 2D images is a challenging task, especially for applications in natural scenes. Recovering the 3D pose from a monocular image is an ill-posed problem itself; moreover, most of the existing datasets have been captured in a laboratory environment, which means that the model trained by them cannot generalize well to in-the-wild data. In this work, we improve the 3D pose estimation performance by introducing the attention mechanism and a calibration network. The attention model will capture the channel-wise dependence, so as to enhance the depth analysis ability of the model. The multi-scale pose calibration network adaptively learns body structure and motion characteristics, and will therefore rectify the estimation results. We tested our model on the Human 3.6M dataset for quantitive evaluation, and the experimental results show the proposed methods with higher accuracy. In order to test the generalization capability for in-the-wild applications, we also report the qualitative results on the natural scene Leeds Sports Pose dataset; the visualization results show that the estimated results are more reasonable than the baseline model.

    Citation: Longkui Jiang, Yuru Wang, Xinhe Ji. Calibrated deep attention model for 3D pose estimation in the wild[J]. Electronic Research Archive, 2023, 31(3): 1556-1569. doi: 10.3934/era.2023079

    Related Papers:

  • Three-dimensional human pose estimation is a key technology in many computer vision tasks. Regressing a 3D pose from 2D images is a challenging task, especially for applications in natural scenes. Recovering the 3D pose from a monocular image is an ill-posed problem itself; moreover, most of the existing datasets have been captured in a laboratory environment, which means that the model trained by them cannot generalize well to in-the-wild data. In this work, we improve the 3D pose estimation performance by introducing the attention mechanism and a calibration network. The attention model will capture the channel-wise dependence, so as to enhance the depth analysis ability of the model. The multi-scale pose calibration network adaptively learns body structure and motion characteristics, and will therefore rectify the estimation results. We tested our model on the Human 3.6M dataset for quantitive evaluation, and the experimental results show the proposed methods with higher accuracy. In order to test the generalization capability for in-the-wild applications, we also report the qualitative results on the natural scene Leeds Sports Pose dataset; the visualization results show that the estimated results are more reasonable than the baseline model.



    加载中


    [1] J. Wu, T. Xue, J. J. Lim, Y. Tian, J. B. Tenenbaum, A. Torralba, et al., Single image 3D interpreter network, in European Conference on Computer Vision, (2016), 365–382. https://doi.org/10.1007/978-3-319-46466-4_22
    [2] C. H. Chen, D. Ramanan, 3D human pose estimation = 2D pose estimation+ matching, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017), 5759–5767. https://doi.org/10.1109/CVPR.2017.610
    [3] D. Mehta, H. Rhodin, D. Casas, P. Fua, O. Sotnychenko, W. Xu, et al., Monocular 3D human pose estimation in the wild using improved CNN supervision, in 2017 International Conference on 3D Vision (3DV), (2017), 506–516. https://doi.org/10.1109/3DV.2017.00064
    [4] D. Pavllo, C. Feichtenhofer, D. Grangier, M. Auli, 3D human pose estimation in video with tem- poral convolutions and semi-supervised training, in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2019), 7753–7762. https://doi.org/10.1109/CVPR.2019.00794
    [5] Y. Cheng, B. Yang, B. Wang, W. Yan, R. Tan, Occlusion-aware networks for 3D human pose estima- tion in video, in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), (2019), 723–732. https://doi.org/10.1109/ICCV.2019.00081
    [6] G. Moon, J. Y. Chang, K. M. Lee, Camera distance-aware top-down approach for 3D multi- person pose estimation from a single RGB image, in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), (2019). https://doi.org/10.1109/ICCV.2019.01023
    [7] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, et al., Generative adversarial nets, in Proceedings of the 27th International Conference on Neural Information Processing Systems (NIPS), 2 (2014), 2672-2680.
    [8] W. Yang, W. Ouyang, X. Wang, J. Ren, H. Li, X. Wang, 3D human pose estimation in the wild by adversarial learning, in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2018), 5255–5264. https://doi.org/10.1109/CVPR.2018.00551
    [9] B. Wandt, B. Rosenhahn, RepNet: Weakly supervised training of an adversarial reprojection network for 3D human pose estimation, in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2019), 7774–7783. https://doi.org/10.1109/CVPR.2019.00797
    [10] B. Wandt, M. Rudolph, P. Zell, H. Rhodin, B. Rosenhahn, CanonPose: Self-supervised monocular 3D human pose estimation in the wild, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021), 13289–13299. https://doi.org/10.1109/CVPR46437.2021.01309
    [11] S. Amin, M. Andriluka, M. Rohrbach, B. Schiele, Multi-view pictorial structures for 3D human pose estimation, in British Machine Vision Conference, (2013), 1–12.
    [12] H. Qiu, C. Wang, J. Wang, N. Wang, W. Zeng, Cross view fusion for 3D human pose estimation. in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), (2019), 4342–4351. https://doi.org/10.1109/ICCV.2019.00444
    [13] H. Ci, C. Wang, X. Ma, Y. Wang, Optimizing network structure for 3D human pose estimation, in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), (2019), 2262–2271. https://doi.org/10.1109/ICCV.2019.00235
    [14] X. Ma, J. Su, C. Wang, H. Ci, Y. Wang, Context modeling in 3D human pose estimation: A unified perspective, in 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2021), 6234–6243. https://doi.org/10.1109/CVPR46437.2021.00617
    [15] X. Zhou, Q. Huang, X. Sun, X. Xue, Y. Wei, Towards 3D human pose estimation in the wild: A weakly-supervised approach, in 2017 IEEE International Conference on Computer Vision (ICCV), (2017), 398–407. https://doi.org/10.1109/ICCV.2017.51
    [16] Q. Zhou, B. Zhong, X. Liu, R. Ji, Attention-based neural architecture search for person re- identification, IEEE Trans. Neural Networks Learn. Syst., 33 (2022), 6627–6639. https://doi.org/10.1109/TNNLS.2021.3082701 doi: 10.1109/TNNLS.2021.3082701
    [17] H. Guo, Z. Ren, Y. Wu, G. Hua, Q. Ji, Uncertainty-based spatial-temporal attention for online action detection, in European Conference on Computer Vision (ECCV), (2022), 69–86. https://doi.org/10.1007/978-3-031-19772-7_5
    [18] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L Yuille, X. Wang, Multi-context attention for human pose estimation, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017), 5669–5678. https://doi.org/10.1109/CVPR.2017.601
    [19] K. Su, D. Yu, Z. Xu, X. Geng, C. Wang, Multi-person pose estimation with enhanced channel-wise and spatial information, in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2019), 5667–5675. https://doi.org/10.1109/CVPR.2019.00582
    [20] Z. Cui, T. Song, Y. Wang, Q. Ji, Knowledge augmented deep neural networks for joint facial expression and action unit recognition, in Proceedings of the 34th International Conference on Neural Information Processing Systems, (2020), 14338–14349.
    [21] Z. Cui, P. Kapanipathi, K. Talamadupula, T. Gao, Q. Ji, Type-augmented relation prediction in knowledge graphs, in Proceedings of the AAAI Conference on Artificial Intelligence, 35 (2021), 7151–7159. https://doi.org/10.1609/aaai.v35i8.16879
    [22] M. R. Ronchi, O. M. Aodha, R. Eng, P. Perona, It's all relative: Monocular 3D Human Pose Estimation from weakly supervised data, preprint, arXiv: 1805.06880.
    [23] V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab, S. Ilic, 3D pictorial structures revisited: multiple human pose estimation, IEEE Trans. Pattern Anal. Mach. Intell., 38 (2016), 1929–1942. https://doi.org/10.1109/TPAMI.2015.2509986 doi: 10.1109/TPAMI.2015.2509986
    [24] A. Newell, K. Yang, J. Deng, Stacked Hourglass Networks for human pose estimation, in European Conference on Computer Vision, 9912 (2016), 483–499. https://doi.org/10.1007/978-3-319-46484-8_29
    [25] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016), 770–778. https://doi.org/10.1109/CVPR.2016.90
    [26] S. Ioffe, C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in Proceedings of the 32nd International Conference on International Conference on Machine Learning, 37 (2015). 448–456.
    [27] R. Arora, A. Basu, P. Mianjy, A. Mukherjee, Understanding deep neural networks with rectified linear units, preprint, arXiv: 1611.01491.
    [28] Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, Q. Hu, ECA-Net: Efficient channel attention for deep convolutional neural networks, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020), 11531–11539. https://doi.org/10.1109/CVPR42600.2020.01155
    [29] M. Schuster, K. K. Paliwal, Bidirectional recurrent neural networks, IEEE Trans. Signal Process., 45 (1997), 2673–2681. https://doi.org/10.1109/78.650093 doi: 10.1109/78.650093
    [30] A. J. Haug, Bayesian Estimation and Tracking: A Practical Guide, Wiley, Hoboken, 2012.
    [31] J. Pearl, Fusion, propagation, and structuring in belief networks, Artif. Intell., 29 (1986), 241–288. https://doi.org/10.1016/0004-3702(86)90072-X doi: 10.1016/0004-3702(86)90072-X
    [32] H. T. Ma, Z. Yang, J. F. Griffith, P. C. Leung, R. Y. W. Lee, A new method for determining lumbar spine motion using bayesian belief network, Med. Biol. Eng. Comput., 46 (2008), 333–340. https://doi.org/10.1007/s11517-008-0318-y doi: 10.1007/s11517-008-0318-y
    [33] D. Tome, C. Russell, L. Agapito, Lifting from the deep: Convolutional 3D pose estimation from a single image, in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2017), 5689–5698. https://doi.org/10.1109/CVPR.2017.603
    [34] J. Martinez, R. Hossain, J. Romero, J. J. Little, A simple yet effective baseline for 3D human pose estimation, in 2017 IEEE International Conference on Computer Vision (ICCV), (2017), 2659–2668. https://doi.org/10.1109/ICCV.2017.288
    [35] D. Drover, M. v Rohith, C. H. Chen, A. Agrawal, A. Tyagi, P. H. Cong, Can 3D pose be learned from 2D projections alone?, in European Conference on Computer Vision, (2018), 78–94. https://doi.org/10.1007/978-3-030-11018-5_7
  • Reader Comments
  • © 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(1105) PDF downloads(91) Cited by(0)

Article outline

Figures and Tables

Figures(7)  /  Tables(2)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog