Video-based person re-identification with complementary local and global features using a graph transformer

Hai Lu; Enbo Luo; Yong Feng; Yifan Wang; Hai Lu; Enbo Luo; Yong Feng; Yifan Wang

doi:10.3934/mbe.2024293

Mathematical Biosciences and Engineering

2024, Volume 21, Issue 7: 6694-6709. doi: 10.3934/mbe.2024293

Previous Article Next Article

Research article Special Issues

Video-based person re-identification with complementary local and global features using a graph transformer

Electric Power Research Institute of Yunnan Power Grid Co., Ltd., Kunming 650217, China

Received: 08 March 2024 Revised: 16 June 2024 Accepted: 05 July 2024 Published: 23 July 2024

In recent years, significant progress has been made in video-based person re-identification (Re-ID). The key challenge in video person Re-ID lies in effectively constructing discriminative and robust person feature representations. Methods based on local regions utilize spatial and temporal attention to extract representative local features. However, prior approaches often overlook the correlations between local regions. To leverage relationships among different local regions, we have proposed a novel video person Re-ID representation learning approach based on a graph transformer, which facilitates contextual interactions between relevant region features. Specifically, we construct a local relation graph to model intrinsic relationships between nodes representing local regions. This graph employs the architecture of a transformer for feature propagation, iteratively refining region features and considering information from adjacent nodes to obtain partial feature representations. To learn compact and discriminative representations, we have further proposed a global feature learning branch based on a vision transformer to capture the relationships between different frames in a sequence. Additionally, we designed a dual-branch interaction network based on multi-head fusion attention to integrate frame-level features extracted by both local and global branches. Finally, the concatenated global and local features, after interaction, are used for testing. We evaluated the proposed method on three datasets, namely iLIDS-VID, MARS, and DukeMTMC-VideoReID. Experimental results demonstrate competitive performance, validating the effectiveness of our proposed approach.
- video,
- person re-identification,
- graph,
- transformer
Citation: Hai Lu, Enbo Luo, Yong Feng, Yifan Wang. Video-based person re-identification with complementary local and global features using a graph transformer[J]. Mathematical Biosciences and Engineering, 2024, 21(7): 6694-6709. doi: 10.3934/mbe.2024293

Related Papers:

Abstract

In recent years, significant progress has been made in video-based person re-identification (Re-ID). The key challenge in video person Re-ID lies in effectively constructing discriminative and robust person feature representations. Methods based on local regions utilize spatial and temporal attention to extract representative local features. However, prior approaches often overlook the correlations between local regions. To leverage relationships among different local regions, we have proposed a novel video person Re-ID representation learning approach based on a graph transformer, which facilitates contextual interactions between relevant region features. Specifically, we construct a local relation graph to model intrinsic relationships between nodes representing local regions. This graph employs the architecture of a transformer for feature propagation, iteratively refining region features and considering information from adjacent nodes to obtain partial feature representations. To learn compact and discriminative representations, we have further proposed a global feature learning branch based on a vision transformer to capture the relationships between different frames in a sequence. Additionally, we designed a dual-branch interaction network based on multi-head fusion attention to integrate frame-level features extracted by both local and global branches. Finally, the concatenated global and local features, after interaction, are used for testing. We evaluated the proposed method on three datasets, namely iLIDS-VID, MARS, and DukeMTMC-VideoReID. Experimental results demonstrate competitive performance, validating the effectiveness of our proposed approach.

References

[1]	H. Li, K. Xu, J. Li, Z. Yu, Dual-stream reciprocal disentanglement learning for domain adaptation person re-identification, Knowl. Based Syst., 251 (2022), 109315. https://doi.org/10.1016/j.knosys.2022.109315 doi: 10.1016/j.knosys.2022.109315
[2]	S. Yan, Y. Zhang, M. Xie, D. Zhang, Z. Yu, Cross-domain person re-identification with pose-invariant feature decomposition and hypergraph structure alignment, Neurocomputing, 467 (2022), 229–241. https://doi.org/10.1016/j.neucom.2021.09.054 doi: 10.1016/j.neucom.2021.09.054
[3]	H. Li, J. Xu, Z. Yu, J. Luo, Jointly learning commonality and specificity dictionaries for person re-identification, IEEE Trans. Image Process., 29 (2020), 7345–7358. https://doi.org/10.1109/TIP.2020.3001424 doi: 10.1109/TIP.2020.3001424
[4]	Y. Zhang, Y. Wang, H. Li, S. Li, Cross-compatible embedding and semantic consistent feature construction for sketch re-identification, in Proceedings of the 30th ACM International Conference on Multimedia (MM'22), (2022), 3347–3355. https://doi.org/10.1145/3503161.3548224
[5]	H. Li, M. Liu, Z. Hu, F. Nie, Z. Yu, Intermediary-guided bidirectional spatial-temporal aggregation network for video-based visible-infrared person reidentification, IEEE Trans. Circuits Syst. Video Technol., 33 (2023), 4962–4972. https://doi.org/10.1109/TCSVT.2023.3246091 doi: 10.1109/TCSVT.2023.3246091
[6]	A. Subramaniam, A. Nambiar, A. Mittal, Co-segmentation inspired attention networks for video-based person re-identification, in Proceedings of the IEEE/CVF International Conference on Computer Vision, IEEE, (2019), 562–572. https://doi.org/10.1109/ICCV.2019.00065
[7]	D. Wu, M. Ye, G. Lin, X. Gao, J. Shen, Person re-identification by context-aware part attention and multi-head collaborative learning, IEEE Trans. Inf. Forensics Secur., 17 (2021), 115–126. https://doi.org/10.1109/TIFS.2021.3075894 doi: 10.1109/TIFS.2021.3075894
[8]	A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, et al., An image is worth 16x16 words: Transformers for image recognition at scale, in 9th International Conference on Learning Representations (ICLR), (2021). https://doi.org/10.48550/arXiv.2010.11929
[9]	T. Zhang, L. Wei, L. Xie, Z. Zhuang, Y. Zhang, B. Li, et al., Spatiotemporal transformer for video-based person re-identification, preprint, arXiv: 2103.16469.
[10]	X. Liu, P. Zhang, C. Yu, H. Lu, X. Qian, X. Yang, A video is worth three views: Trigeminal transformers for video-based person re-identification, preprint, arXiv: 2104.01745.
[11]	D. Cheng, Y. Gong, X. Chang, W. Shi, A. Hauptmann, N. Zheng, Deep feature learning via structured graph laplacian embedding for person re-identification, Pattern Recognit., 82 (2018), 94–104. https://doi.org/10.1016/j.patcog.2018.05.007 doi: 10.1016/j.patcog.2018.05.007
[12]	A. Barman, S. K. Shah, Shape: A novel graph theoretic algorithm for making consensus-based decisions in person re-identification systems, in Proceedings of the IEEE International Conference on Computer Vision, IEEE, (2017), 1115–1124. https://doi.org/10.1109/ICCV.2017.127
[13]	Y. Shen, H. Li, S. Yi, D. Chen, X. Wang, Person re-identification with deep similarity-guided graph neural network, in Proceedings of the European Conference on Computer Vision (ECCV), Springer, (2018), 486–504. https://doi.org/10.48550/arXiv.1807.099757
[14]	D. Chen, D. Xu, H. Li, N. Sebe, X. Wang, Group consistent similarity learning via deep crf for person re-identification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, (2018), 8649–8658. https://doi.org/10.1109/CVPR.2018.00902
[15]	Y. Yan, Q. Zhang, B. Ni, W. Zhang, M. Xu, X. Yang, Learning context graph for person search, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, (2019), 2153–2162. https://doi.org/10.1109/CVPR.2019.00226
[16]	M. Ye, A. J. Ma, L. Zheng, J. Li, P. C. Yuen, Dynamic label graph matching for unsupervised video re-identification, in Proceedings of the IEEE International Conference on Computer Vision, IEEE, (2017), 5122–5160. https://doi.org/10.1109/ICCV.2017.550
[17]	L. Bao, B. Ma, H. Chang, X. Chen, Preserving structural relationships for person re-identification, in 2019 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), IEEE, (2019), 120–125. https://doi.org/10.1109/ICMEW.2019.00028
[18]	Z. Zhang, H. Zhang, S. Liu, Person re-identification using heterogeneous local graph attention networks, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, (2021), 12131–12140. https://doi.org/10.1109/CVPR46437.2021.01196
[19]	T. N. Kipf, M. Welling, Semi-supervised classification with graph convolutional networks, in International Conference on Learning Representations (ICLR), (2016). https://doi.org/10.48550/arXiv.1609.02907
[20]	P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, Y. Bengio, Graph attention networks, preprint, arXiv: 1710.10903.
[21]	H. Li, S. Yan, Z. Yu, D. Tao, Attribute-identity embedding and self-supervised learning for scalable person re-identification, IEEE Transactions on Circuits and Systems for Video Technology, 30 (2020), 3472–3485. https://doi.org/10.1109/TCSVT.2019.2952550 doi: 10.1109/TCSVT.2019.2952550
[22]	H. Li, N. Dong, Z. Yu, D. Tao, G. Qi, Triple adversarial learning and multi-view imaginative reasoning for unsupervised domain adaptation person re-identification, IEEE Trans. Circuits Syst. Video Technol., 32 (2022), 2814–2830. https://doi.org/10.1109/TCSVT.2021.3099943 doi: 10.1109/TCSVT.2021.3099943
[23]	H. Li, Y. Chen, D. Tao, Z. Yu, G. Qi, Attribute-aligned domain-invariant feature learning for unsupervised domain adaptation person re-identification, IEEE Trans. Forensics Secur., 16 (2021), 1480–1494. https://doi.org/10.1109/TIFS.2020.3036800 doi: 10.1109/TIFS.2020.3036800
[24]	K. Zhou, Y. Yang, A. Cavallaro, T. Xiang, Omni-scale feature learning for person re-identification, in Proceedings of the IEEE/CVF International Conference on Computer Vision, IEEE, (2019), 3701–3711. https://doi.org/10.1109/ICCV.2019.00380
[25]	F. Yu, X. Jiang, Y. Gong, S. Zhao, X. Guo, W. S. Zheng, et al., Devil's in the details: Aligning visual clues for conditional embedding in person re-identification, preprint, arXiv: 2009.05250.
[26]	Z. Zhang, C. Lan, W. Zeng, Z. Chen, Densely semantically aligned person re-identification, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, (2019), 667–676. https://doi.org/10.1109/CVPR.2019.00076
[27]	M. Ye, H. Li, B. Du, J. Shen, L. Shao, S. C. H. Hoi, Collaborative refining for person re-identification with label noise, IEEE Trans. Image Process., 31 (2021), 379–391. https://doi.org/10.1109/TIP.2021.3131937 doi: 10.1109/TIP.2021.3131937
[28]	S. He, H. Luo, P. Wang, F. Wang, H. Li, W. Jiang, Transreid: Transformer-based object re-identification, in Proceedings of the IEEE/CVF International Conference on Computer Vision, IEEE, (2021), 14993–15002. https://doi.org/10.1109/ICCV48922.2021.01474
[29]	T. Mikolov, M. Karafiàt, L. Burget, J. Cernockỳ, S. Khudanpur, Recurrent neural network based language model, in 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010, (2010), 1045–1048. https://doi.org/10.21437/Interspeech.2010-343
[30]	S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput., 9 (1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735 doi: 10.1162/neco.1997.9.8.1735
[31]	N. McLaughlin, J. Martinez del Rincon, P. Miller, Recurrent convolutional network for video-based person re-identification, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, (2016), 1325–1334. https://doi.org/10.1109/CVPR.2016.148
[32]	D. Chung, K. Tahboub, E. J. Delp, A two stream siamese convolutional neural network for person re-identification, in Proceedings of the IEEE International Conference on Computer Vision, IEEE, (2017), 1992–2000. https://doi.org/10.1109/ICCV.2017.218
[33]	Y. Suh, J. Wang, S. Tang, T. Mei, K. M. Lee, Part-aligned bilinear representations for person re-identification, in Proceedings of the European Conference on Computer Vision (ECCV), Springer, (2018), 418–437. https://doi.org/10.1007/978-3-030-01264-9_25
[34]	X. Gu, H. Chang, B. Ma, H. Zhang, X. Chen, Appearance-preserving 3D convolution for video-based person re-identification, in Proceedings of the European Conference on Computer Vision (ECCV), Springer, (2020), 228–243. https://doi.org/10.1007/978-3-030-58536-5_14
[35]	J. Li, S. Zhang, T. Huang, Multi-scale 3d convolution network for video based person re-identification, in Proceedings of the AAAI Conference on Artificial Intelligence, (2019), 8618–8625. https://doi.org/10.1609/aaai.v33i01.33018618
[36]	P. Zhang, J. Xu, Q. Wu, Y. Huang, X. Ben, Learning spatial-temporal representations over walking tracklet for long-term person re-identification in the wild, IEEE Trans. Multimedia, 23 (2020), 3562–3576. https://doi.org/10.1109/TMM.2020.3028461 doi: 10.1109/TMM.2020.3028461
[37]	Y. Zhao, X. Shen, Z. Jin, H. Lu, X. S. Hua, Attribute-driven feature disentangling and temporal aggregation for video person re-identification, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, (2019), 4908–4917. https://doi.org/10.1109/CVPR.2019.00505
[38]	Z. Chang, Z. Yang, Y. Chen, Q. Zhou, S. Zheng, Seq-masks: Bridging the gap between appearance and gait modeling for video-based person re-identification, in 2021 International Conference on Visual Communications and Image Processing (VCIP), IEEE, (2021), 1–5. https://doi.org/10.1109/VCIP53242.2021.9675368
[39]	T. Chai, Z. Chen, A. Li, J. Chen, X. Mei, Y. Wang, Video person re-identification using attribute-enhanced features, IEEE Trans. Circuits Syst. Video Technol., 32 (2022), 7951–7966. https://doi.org/10.1109/TCSVT.2022.3189027 doi: 10.1109/TCSVT.2022.3189027
[40]	L. Wu, Y. Wang, L. Shao, M. Wang, 3-d personvlad: Learning deep global representations for video-based person reidentification, IEEE Trans. Neural Networks Learn. Syst., 30 (2019), 3347–3359. https://doi.org/10.1109/TNNLS.2019.2891244 doi: 10.1109/TNNLS.2019.2891244
[41]	V. Dwivedi, X. Bresson, A generalization of transformer networks to graphs, preprint, arXiv: 2012.0969.
[42]	T. Wang, S. Gong, X. Zhu, S. Wang, Person re-identification by video ranking, in Proceedings of the European Conference on Computer Vision (ECCV), Springer, (2014), 688–703. https://doi.org/10.1007/978-3-319-10593-2_45
[43]	E. Ristani, F. Solera, R. Zou, R. Cucchiara, C. Tomasi, Performance measures and a data set for multi-target, multi-camera tracking, in Proceedings of the European Conference on Computer Vision (ECCV), Springer, (2016), 17–35. https://doi.org/10.1007/978-3-319-48881-3_2
[44]	L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, et al., Mars: A video benchmark for large-scale person re-identification, in Proceedings of the European Conference on Computer Vision (ECCV), Springer, (2016), 868–884. https://doi.org/10.1007/978-3-319-46466-4_52
[45]	Y. Liu, Z. Yuan, W. Zhou, H. Li, Spatial and temporal mutual promotion for video-based person re-identification, in Proceedings of the AAAI Conference on Artificial Intelligence, (2019), 8786–8793. https://doi.org/10.1609/aaai.v33i01.33018786
[46]	R. Zhang, J. Li, H. Sun, Y. Ge, P. Luo, X. Wang, et al., Scan: Self-and-collaborative attention network for video person re-identification, IEEE Trans. Image Process., 28 (2019), 4870–4882. https://doi.org/10.1109/TIP.2019.2911488 doi: 10.1109/TIP.2019.2911488
[47]	G. Chen, Y. Rao, J. Lu, J. Zhou, Temporal coherence or temporal motion: Which is more critical for video-based person re-identification?, in Proceedings of the European Conference on Computer Vision (ECCV), Springer, (2020), 660–676. https://doi.org/10.1007/978-3-030-58598-3_39
[48]	X. Liu, P. Zhang, C. Yu, H. Lu, X. Yang, Watching you: Global-guided reciprocal learning for video-based person re-identification, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE, (2021), 13329–13338. https://doi.org/10.1109/CVPR46437.2021.01313
[49]	A. Aich, M. Zheng, S. Karanam, T. Chen, A. K. Roy-Chowdhury, Z. Wu, Spatio-temporal representation factorization for video-based person re-identification, in Proceedings of the IEEE/CVF International Conference on Computer Vision, IEEE, (2021), 152–162. https://doi.org/10.1109/ICCV48922.2021.00022
[50]	D. Chen, H. Li, T. Xiao, S. Yi, X. Wang, Video person re-identification with competitive snippet-similarity aggregation and co-attentive snippet embedding, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, IEEE, (2018), 1169–1178. https://doi.org/10.1109/CVPR.2018.00128
[51]	X. Liu, C. Yu, P. Zhang, H. Lu, Deeply coupled convolution–transformer with spatial–temporal complementary learning for video-based person re-identification, IEEE Trans. Neural Networks Learn.Syst., (2023), 1–11. https://doi.org/10.1109/TNNLS.2023.3271353
[52]	Z. Tang, R. Zhang, Z. Peng, J. Chen, L. Lin, Multi-stage spatio-temporal aggregation transformer for video person re-identification, IEEE Trans. Multimedia, 25 (2023), 7917–7929. https://doi.org/10.1109/TMM.2022.3231103 doi: 10.1109/TMM.2022.3231103
[53]	X. Zang, G. Li, W. Gao, Multidirection and multiscale pyramid in transformer for video-based pedestrian retrieval, IEEE Trans. Ind. Inf., 18 (2022), 8776–8785. https://doi.org/10.1109/TII.2022.3151766 doi: 10.1109/TII.2022.3151766

Reader Comments

Your name:*

Email:*
© 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)