AD-DETR: DETR with asymmetrical relation and decoupled attention in crowded scenes

Yueming Huang; Guowu Yuan; Yueming Huang; Guowu Yuan

doi:10.3934/mbe.2023633

Mathematical Biosciences and Engineering

2023, Volume 20, Issue 8: 14158-14179. doi: 10.3934/mbe.2023633

Previous Article Next Article

Research article Special Issues

AD-DETR: DETR with asymmetrical relation and decoupled attention in crowded scenes

Yueming Huang ^1,2,
Guowu Yuan ^{1,2
,
,}

1.
School of Information Science and Engineering, Yunnan University, Kunming 650504, China
2.
Yunnan Key Laboratory of Intelligent Systems and Computing, Kunming 650504, China

Academic Editor: Jerry Chun-Wei Lin

Received: 12 April 2023 Revised: 23 May 2023 Accepted: 15 June 2023 Published: 26 June 2023

Pedestrian detection in crowded scenes is widely used in computer vision. However, it still has two difficulties: 1) eliminating repeated predictions (multiple predictions corresponding to the same object); 2) false detection and missing detection due to the high scene occlusion rate and the small visible area of detected pedestrians. This paper presents a detection framework based on DETR (detection transformer) to address the above problems, and the model is called AD-DETR (asymmetrical relation detection transformer). We find that the symmetry in a DETR framework causes synchronous prediction updates and duplicate predictions. Therefore, we propose an asymmetric relationship fusion mechanism and let each query asymmetrically fuse the relative relationships of surrounding predictions to learn to eliminate duplicate predictions. Then, we propose a decoupled cross-attention head that allows the model to learn to restrict the range of attention to focus more on visible regions and regions that contribute more to confidence. The method can reduce the noise information introduced by the occluded objects to reduce the false detection rate. Meanwhile, in our proposed asymmetric relations module, we establish a way to encode the relative relation between sets of attention points and improve the baseline. Without additional annotations, combined with the deformable-DETR with Res50 as the backbone, our method can achieve an average precision of 92.6%, MR$ ^{-2} $ of 40.0% and Jaccard index of 84.4% on the challenging CrowdHuman dataset. Our method exceeds previous methods, such as Iter-E2EDet (progressive end-to-end object detection), MIP (one proposal, multiple predictions), etc. Experiments show that our method can significantly improve the performance of the query-based model for crowded scenes, and it is highly robust for the crowded scene.
- pedestrian detection,
- crowded object detection,
- DETR,
- end-to-end detector,
- crowded pedestrian scene,
- relation net,
- attention mechanism,
- symmetry
Citation: Yueming Huang, Guowu Yuan. AD-DETR: DETR with asymmetrical relation and decoupled attention in crowded scenes[J]. Mathematical Biosciences and Engineering, 2023, 20(8): 14158-14179. doi: 10.3934/mbe.2023633

Related Papers:

Abstract

Pedestrian detection in crowded scenes is widely used in computer vision. However, it still has two difficulties: 1) eliminating repeated predictions (multiple predictions corresponding to the same object); 2) false detection and missing detection due to the high scene occlusion rate and the small visible area of detected pedestrians. This paper presents a detection framework based on DETR (detection transformer) to address the above problems, and the model is called AD-DETR (asymmetrical relation detection transformer). We find that the symmetry in a DETR framework causes synchronous prediction updates and duplicate predictions. Therefore, we propose an asymmetric relationship fusion mechanism and let each query asymmetrically fuse the relative relationships of surrounding predictions to learn to eliminate duplicate predictions. Then, we propose a decoupled cross-attention head that allows the model to learn to restrict the range of attention to focus more on visible regions and regions that contribute more to confidence. The method can reduce the noise information introduced by the occluded objects to reduce the false detection rate. Meanwhile, in our proposed asymmetric relations module, we establish a way to encode the relative relation between sets of attention points and improve the baseline. Without additional annotations, combined with the deformable-DETR with Res50 as the backbone, our method can achieve an average precision of 92.6%, MR$ ^{-2} $ of 40.0% and Jaccard index of 84.4% on the challenging CrowdHuman dataset. Our method exceeds previous methods, such as Iter-E2EDet (progressive end-to-end object detection), MIP (one proposal, multiple predictions), etc. Experiments show that our method can significantly improve the performance of the query-based model for crowded scenes, and it is highly robust for the crowded scene.

References

[1]	J. Redmon, S. Divvala, R. Girshick, A. Farhadi, You only look once: Unified, real-time object detection, in Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), 779–788. https://doi.org/10.1109/CVPR.2016.91
[2]	C. Fu, W. Liu, A. Ranga, A. Tyagi, A. C. Berg, DSSD: Deconvolutional single shot detector, preprint, arXiv: 1701.06659.
[3]	J. Redmon, A. Farhadi, Yolov3: An incremental improvement, preprint, arXiv: 1804.02767.
[4]	X. Chu, A. Zheng, X. Zhang, J. Sun, Detection in crowded scenes: One proposal, multiple predictions, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), 12211–12220. https://doi.org/10.1109/CVPR42600.2020.01223
[5]	T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, S. J. Belongie, Feature pyramid networks for object detection, in 2017 IEEE Conference on Computer Vision and Pattern Recognition, (2017), 936–944, https://doi.org/10.1109/CVPR.2017.106
[6]	S. Liu, D. Huang, Y. Wang, Adaptive NMS: refining pedestrian detection in a crowd, in IEEE Conference on Computer Vision and Pattern Recognition, (2019), 6459–6468, https://doi.org/10.1109/CVPR.2019.00662
[7]	N. Bodla, B. Singh, R. Chellappa, L. S. Davis, Soft-nms - improving object detection with one line of code, in IEEE International Conference on Computer Vision, (2017), 5562–5570. https://doi.org/10.1109/ICCV.2017.593
[8]	S. Liu, F. Li, H. Zhang, X. Yang, X. Qi, H. Su, et al., DAB-DETR: dynamic anchor boxes are better queries for DETR, in The Tenth International Conference on Learning Representations, 2022.
[9]	X. Zhu, W. Su, L. Lu, B. Li, X. Wang, J. Dai, Deformable DETR: deformable transformers for end-to-end object detection, preprint, arXiv: 2010.04159.
[10]	F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, L. Zhang, DN-DETR: accelerate DETR training by introducing query denoising, in IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), 13609–13617. https://doi.org/10.1109/CVPR52688.2022.01325
[11]	N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, S. Zagoruyko, End-to-end object detection with transformers, in Computer Vision – ECCV 2020: 16th European Conference, (2020), 213–229. https://doi.org/10.1007/978-3-030-58452-8_13
[12]	P. Sun, R. Zhang, Y. Jiang, T. Kong, C. Xu, W. Zhan, et al., Sparse R-CNN: end-to-end object detection with learnable proposals, in IEEE Conference on Computer Vision and Pattern Recognition, (2021), 14454–14463. https://doi.org/10.1109/CVPR46437.2021.01422
[13]	H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, et al., DINO: DETR with improved denoising anchor boxes for end-to-end object detection, preprint, arXiv: 2203.03605.
[14]	T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, et al., Microsoft COCO: common objects in context, in Computer Vision - ECCV 2014 - 13th European Conference, (2014), 740–755. https://doi.org/10.1007/978-3-319-10602-1_48
[15]	M. Everingham, S. M. Eslami, L. V. Gool, C. Williams, J. Winn, A. Zisserman, The pascal visual object classes challenge: A retrospective, Int. J. Comput. Vision, 111 (2015), 98–136. https://doi.org/10.1007/s11263-014-0733-5. doi: 10.1007/s11263-014-0733-5
[16]	S. Ren, K. He, R. B. Girshick, J. Sun, Faster R-CNN: towards real-time object detection with region proposal networks, IEEE Trans. Pattern Anal. Mach. Intell., 39 (2017), 1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031 doi: 10.1109/TPAMI.2016.2577031
[17]	S. Shao, Z. Zhao, B. Li, T. Xiao, G. Yu, X. Zhang, et al., Crowdhuman: A benchmark for detecting human in a crowd, preprint, arXiv: 1805.00123.
[18]	C. Zhou, J. Yuan, Bi-box regression for pedestrian detection and occlusion estimation, in Computer Vision - ECCV 2018 - 15th European Conference, (2018), 138–154. https://doi.org/10.1007/978-3-030-01246-5_9
[19]	X. Huang, Z. Ge, Z. Jie, O. Yoshie, NMS by representative region: Towards crowded pedestrian detection by proposal pairing, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), 10747–10756. https://doi.org/10.1109/CVPR42600.2020.01076
[20]	S. Zhang, J. Yang, B. Schiele, Occluded pedestrian detection through guided attention in cnns, in 2018 IEEE Conference on Computer Vision and Pattern Recognition, (2018), 6995–7003. https://doi.org/10.1109/CVPR.2018.00731
[21]	Y. Pang, J. Xie, M. H. Khan, R. M. Anwer, F. S. Khan, L. Shao, Mask-guided attention network for occluded pedestrian detection, in 2019 IEEE/CVF International Conference on Computer Vision, (2019), 4966–4974. https://doi.org/10.1109/ICCV.2019.00507
[22]	M. Lin, C. Li, X. Bu, M. Sun, C. Lin, J. Yan, et al., Detr for crowd pedestrian detection, preprint, arXiv: 2012.06785.
[23]	A. Zheng, Y. Zhang, X. Zhang, X. Qi, J. Sun, Progressive end-to-end object detection in crowded scenes, in IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2022), 847–856. https://doi.org/10.1109/CVPR52688.2022.00093
[24]	A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, et al., Attention is all you need, in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, (2017), 5998–6008.
[25]	P. Dollár, C. Wojek, B. Schiele, P. Perona, Pedestrian detection: An evaluation of the state of the art, IEEE Trans. Pattern Anal. Mach. Intell., 34 (2012), 743–761. https://doi.org/10.1109/TPAMI.2011.155 doi: 10.1109/TPAMI.2011.155
[26]	W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, et al., SSD: single shot multibox detector, in Computer Vision - ECCV 2016 - 14th European Conference, (2016), 21–37. https://doi.org/10.1007/978-3-319-46448-0_2
[27]	K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in 2016 IEEE Conference on Computer Vision and Pattern Recognition, (2016), 770–778. https://doi.org/10.1109/CVPR.2016.90
[28]	O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, et al., Imagenet large scale visual recognition challenge, Int. J. Comput. Vis., 115 (2015), 211–252. https://doi.org/10.1007/s11263-015-0816-y doi: 10.1007/s11263-015-0816-y
[29]	T. Lin, P. Goyal, R. B. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in IEEE International Conference on Computer Vision, (2017), 2999–3007. https://doi.org/10.1109/ICCV.2017.324
[30]	S. Zhang, C. Chi, Y. Yao, Z. Lei, S. Z. Li, Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection, in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, (2020), 9756–9765. https://doi.org/10.1109/CVPR42600.2020.00978
[31]	Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, et al., Swin transformer: Hierarchical vision transformer using shifted windows, in 2021 IEEE/CVF International Conference on Computer Vision, (2021), 9992–10002. https://doi.org/10.1109/ICCV48922.2021.00986

Reader Comments

Your name:*

Email:*
© 2023 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)