Human structure modeling for video-based person re-identification without body-part labels

Haotian Chen; Jianyuan Guo; Chao Zhang; Zhouchen Lin; Haotian Chen; Jianyuan Guo; Chao Zhang; Zhouchen Lin

doi:10.3934/bdia.2025015

Big Data and Information Analytics

2025, Volume 9, 328-349. doi: 10.3934/bdia.2025015

Previous Article Next Article

Research article

Human structure modeling for video-based person re-identification without body-part labels

1.
School of Intelligence Science and Technology, Peking University, No. 5 Yiheyuan Road, Haidian District, Beijing 100871, China
2.
North King Information Technology Co., Ltd., 7th Floor, Qingzheng Building, No. 25 Xisanhuan North Road, Haidian District, Beijing 100089, China
3.
City University of Hong Kong, 83 Tat Chee Avenue, Kowloon Tong, Hong Kong 999077, China
4.
State Key Lab of General AI, School of Intelligence Science and Technology, Peking University, No. 5 Yiheyuan Road, Haidian District, Beijing 100871, China
5.
Institute for Artificial Intelligence, Peking University, No. 5 Yiheyuan Road, Haidian District, Beijing 100871, China

Received: 16 November 2025 Revised: 02 December 2025 Accepted: 03 December 2025 Published: 04 December 2025

With the observation that most frames in video-based person re-identification (VReID) capture human figures and that continuous movements intrinsically separate the foreground from the background, this paper demonstrates that it is unnecessary to model human structures with extra key-point estimators that are elaborately pre-trained on costly body-part labels. Specifically, we propose a novel human structure modeling (HTML) module to generate discriminative part-level features from the foreground for VReID without using exclusively annotated part labels. Under the guidance of a simple humanoid topology, HTML extracts coarse body shapes (body proportions) by mapping image patches to both the topological body parts and the background. To compensate for the lack of supervision, a regularization loss is designed for the training of the HTML module to compensate for the lack of part-label supervision. Furthermore, a spatial-temporal part mixer and a spatial-temporal patch mixer are introduced to make its output more discriminative and reliable. Extensive experiments show that our approach achieves competitive performance with a favorable accuracy-efficiency trade-off across multiple benchmarks.
Citation: Haotian Chen, Jianyuan Guo, Chao Zhang, Zhouchen Lin. Human structure modeling for video-based person re-identification without body-part labels[J]. Big Data and Information Analytics, 2025, 9: 328-349. doi: 10.3934/bdia.2025015

Related Papers:

Abstract

With the observation that most frames in video-based person re-identification (VReID) capture human figures and that continuous movements intrinsically separate the foreground from the background, this paper demonstrates that it is unnecessary to model human structures with extra key-point estimators that are elaborately pre-trained on costly body-part labels. Specifically, we propose a novel human structure modeling (HTML) module to generate discriminative part-level features from the foreground for VReID without using exclusively annotated part labels. Under the guidance of a simple humanoid topology, HTML extracts coarse body shapes (body proportions) by mapping image patches to both the topological body parts and the background. To compensate for the lack of supervision, a regularization loss is designed for the training of the HTML module to compensate for the lack of part-label supervision. Furthermore, a spatial-temporal part mixer and a spatial-temporal patch mixer are introduced to make its output more discriminative and reliable. Extensive experiments show that our approach achieves competitive performance with a favorable accuracy-efficiency trade-off across multiple benchmarks.

References

[1]	Ma H, Zhang C, Zhang Y, Li Z, Wang Z, Wei C, (2024) A review on video person re-identification based on deep learning. Neurocomputing 609: 128479. https://doi.org/10.1016/j.neucom.2024.128479 doi: 10.1016/j.neucom.2024.128479
[2]	Saad RSM, Moussa MM, Abdel-Kader NS, Farouk H, Mashaly S, (2024) Deep video-based person re-identification (deep vid-reid): Comprehensive survey. EURASIP J Adv Signal Process 2024: 63. https://doi.org/10.1186/s13634-024-01139-x doi: 10.1186/s13634-024-01139-x
[3]	Liu J, Zha Z, Wu W, Zheng K, Sun Q, (2021) Spatial-temporal correlation and topology learning for person re-identification in videos, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 4370–4379. https://doi.org/10.1109/CVPR46437.2021.00435
[4]	Jiang X, Qiao Y, Yan J, Li Q, Zheng W, Chen D, (2021) SSN3D: Self-separated network to align parts for 3d convolution in video person re-identification, In: Proceedings of the AAAI Conference on Artificial Intelligence, 35: 1691–1699. https://doi.org/10.1609/aaai.v35i2.16262
[5]	Wu J, He L, Liu W, Yang Y, Lei Z, Mei T, et al. (2022) CAViT: Contextual alignment vision transformer for video object re-identification, In: European Conference on Computer Vision, 2022: 549–566. https://doi.org/10.1007/978-3-031-19781-9_32
[6]	Wu P, Wang L, Zhou S, Hua G, Sun C, (2024) Temporal correlation vision transformer for video person re-identification, In: Proceedings of the AAAI Conference on Artificial Intelligence, 38: 6083–6091. https://doi.org/10.1609/aaai.v38i6.28424
[7]	Yu C, Liu X, Wang Y, Zhang P, Lu H, (2024) Tf-clip: Learning text-free clip for video-based person re-identification, In: Proceedings of the AAAI Conference on Artificial Intelligence, 38: 6764–6772. https://doi.org/10.1609/aaai.v38i7.28500
[8]	Su Y, Shi Y, Liu F, Liu X, (2025) Hamobe: Hierarchical and adaptive mixture of biometric experts for video-based person reid, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025: 11525–11536.
[9]	Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D, (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization, In: Proceedings of the IEEE International Conference on Computer Vision, 2017: 618–626. https://doi.org/10.1109/ICCV.2017.74
[10]	Jiang X, Gong Y, Guo X, Yang Q, Huang F, Zheng W, et al., (2020) Rethinking temporal fusion for video-based person re-identification on semantic and time aspect, In: Proceedings of the AAAI Conference on Artificial Intelligence, 34: 11133–11140. https://doi.org/10.1609/aaai.v34i07.6770
[11]	Aich A, Zheng M, Karanam S, Chen T, Roy-Chowdhury AK, Wu Z, (2021) Spatio-temporal representation factorization for video-based person re-identification, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 152–162. https://doi.org/10.1109/ICCV48922.2021.00022
[12]	Liu X, Zhang P, Yu C, Lu H, Yang X, (2021) Watching you: Global-guided reciprocal learning for video-based person re-identification, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 13334–13343. https://doi.org/10.1109/CVPR46437.2021.01306
[13]	Hou R, Chang H, Ma B, Shan S, Chen X, (2020) Temporal complementary learning for video person re-identification, In: European Conference on Computer Vision, 2020: 388–405. https://doi.org/10.1007/978-3-030-58545-7_23
[14]	Hou R, Chang H, Ma B, Huang R, Shan S, (2021) Bicnet-tks: Learning efficient spatial-temporal representation for video person re-identification, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 2014–2023. https://doi.org/10.1109/CVPR46437.2021.00204
[15]	Shu X, Li G, Wei L, Zhong J, Zang X, Zhang S, et al., (2021) Diverse part attentive network for video-based person re-identification, Pattern Recognition Letters, 149: 1–8. https://doi.org/10.1016/j.patrec.2021.07.015
[16]	He T, Jin X, Shen X, Huang J, Chen Z, Hua X, (2021) Dense interaction learning for video-based person re-identification, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 1490–1501. https://doi.org/10.1109/ICCV48922.2021.00153
[17]	Yang J, Zheng W, Yang Q, Chen Y, Tian Q, (2020) Spatial-temporal graph convolutional network for video-based person re-identification, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 3289–3299. https://doi.org/10.1109/CVPR42600.2020.00332
[18]	Li S, Yu H, Hu H, (2020) Appearance and motion enhancement for video-based person re-identification, In: Proceedings of the AAAI Conference on Artificial Intelligence, 34: 11394–11401. https://doi.org/10.1609/aaai.v34i07.6890
[19]	Subramaniam A, Nambiar A, Mittal A, (2019) Co-segmentation inspired attention networks for video-based person re-identification, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 562–572. https://doi.org/10.1109/ICCV.2019.00061
[20]	He K, Zhang X, Ren S, Sun J, (2016) Deep residual learning for image recognition, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016: 770–778. https://doi.org/10.1109/CVPR.2016.90
[21]	Han F, Reily B, Hoff W, Zhang H, (2017) Space-time representation of people based on 3d skeletal data: A review, Comput Vision Image Understanding, 158: 85–105. https://doi.org/10.1016/j.cviu.2016.12.003
[22]	Cao Z, Hidalgo G, Simon T, Wei S, Sheikh Y, (2019) Openpose: Realtime multi-person 2d pose estimation using part affinity fields, IEEE Trans Pattern Anal Mach Intell, 43: 172–186. https://doi.org/10.1109/TPAMI.2019.2929257
[23]	Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. (2014) Microsoft coco: common objects in context, In: European Conference on Computer Vision, 2014: 740–755. https://doi.org/10.1007/978-3-319-10602-1_48
[24]	Kipf TN, Welling M, (2017) Semi-supervised classification with graph convolutional networks, In: Proceedings of the International Conference on Learning Representations, 2017: 1–14.
[25]	Ba JL, Kiros JR, Hinton GE, (2016) Layer normalization, In: Proceedings of the Neural Information Processing Systems, 2016: 1–14.
[26]	Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. (2021) An image is worth 16x16 words: Transformers for image recognition at scale, In: Proceedings of the International Conference on Learning Representations, 2021: 1–22.
[27]	Ye M, Shen J, Lin G, Xiang T, Shao L, Hoi SC, (2021) Deep learning for person re-identification: a survey and outlook, IEEE Trans Pattern Anal Mach Intell, 44: 2872–2893. https://doi.org/10.1109/TPAMI.2020.3039709
[28]	Hermans A, Beyer L, Leibe B, (2019) In defense of the triplet loss for person re-identification, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019: 1–10. https://doi.org/10.1109/CVPRW.2017.137
[29]	Zheng L, Bie Z, Sun Y, Wang J, Su C, Wang S, et al. (2016) Mars: A video benchmark for large-scale person re-identification, In: European Conference on Computer Vision, 2016: 868–884. https://doi.org/10.1007/978-3-319-46475-6_53
[30]	Li J, Wang J, Tian Q, Gao W, Zhang S, (2019) Global-local temporal representations for video person re-identification, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 3958–3967. https://doi.org/10.1109/ICCV.2019.00404
[31]	Wang T, Gong S, Zhu X, Wang S, (2014) Person re-identification by video ranking, In: European Conference on Computer Vision, 2014: 688–703. https://doi.org/10.1007/978-3-319-10599-4_44
[32]	Hirzer M, Beleznai C, Roth PM, Bischof H, (2011) Person re-identification by descriptive and discriminative classification, In: Proceedings of the Scandinavian Conference on Image Analysis, 2011: 91–102. https://doi.org/10.1007/978-3-642-21227-7_9
[33]	Wang Y, Zhang P, Gao S, Geng X, Lu H, Wang D, (2021) Pyramid spatial-temporal aggregation for video-based person re-identification, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 12026–12035. https://doi.org/10.1109/ICCV48922.2021.01183
[34]	Eom C, Lee G, Lee J, Ham B, (2021) Video-based person re-identification with spatial and temporal memory networks, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 12036–12045. https://doi.org/10.1109/ICCV48922.2021.01184
[35]	Bai S, Ma B, Chang H, Huang R, Chen X, (2022) Salient-to-broad transition for video person re-identification, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 7339–7348. https://doi.org/10.1109/CVPR52688.2022.00722
[36]	Liu C, Chen J, Chen C, Chien S, (2021) Video-based person re-identification without bells and whistles, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2021: 1491–1500. https://doi.org/10.1109/CVPRW53098.2021.00156
[37]	Chen D, Zhang Y, Yuan J, Gao S, Bai X, (2022) Keypoint message passing for video-based person re-identification, In: Proceedings of the AAAI Conference on Artificial Intelligence, 36: 239–247. https://doi.org/10.1609/aaai.v36i1.19771
[38]	Yao Y, Jiang X, Fujita H, Fang Z, (2022) A sparse graph wavelet convolution neural network for video-based person re-identification, Pattern Recognit, 129: 108708. https://doi.org/10.1016/j.patcog.2022.108708
[39]	Hu X, Wei D, Wang Z, Shen J, Ren H, (2021) Hypergraph video pedestrian re-identification based on posture structure relationship and action constraints, Pattern Recognit, 111: 107688. https://doi.org/10.1016/j.patcog.2020.107688
[40]	Zheng L, Shen L, Tian L, Wang S, Wang J, Tian Q, (2015) Scalable person re-identification: A benchmark, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2015: 1116–1124. https://doi.org/10.1109/ICCV.2015.129
[41]	Bolle RM, Connell JH, Pankanti S, Ratha NK, Senior AW, (2005) The relation between the roc curve and the cmc, In: Fourth IEEE Workshop on Automatic Identification Advanced Technologies (AutoID'05), 2005: 15–20. https://doi.org/10.1109/AUTOID.2005.1529697
[42]	Zhong Z, Zheng L, Kang G, Li S, Yang Y, (2020) Random erasing data augmentation, In: Proceedings of the AAAI Conference on Artificial Intelligence, 34: 13001–13008. https://doi.org/10.1609/aaai.v34i07.6999
[43]	Kingma DP, Ba J, (2015) Adam: A method for stochastic optimization, In: Proceedings of the International Conference on Learning Representations, 2015: 1–15.
[44]	Luo H, Gu Y, Liao X, Lai S, Jiang W, (2019) Bag of tricks and a strong baseline for deep person re-identification, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019: 4321–4329. https://doi.org/10.1109/CVPRW.2019.00463
[45]	Chen X, Xie S, He K, (2021) An empirical study of training self-supervised vision transformers, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 9640–9649. https://doi.org/10.1109/ICCV48922.2021.00957
[46]	Bai J, Yuan L, Xia S, Yan S, Li Z, Liu W, (2022) Improving vision transformers by revisiting high-frequency components, In: European Conference on Computer Vision, 2022: 1–18. https://doi.org/10.1007/978-3-031-19800-7_1
[47]	Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R, (2014) Dropout: A simple way to prevent neural networks from overfitting, J Mach Learn Res, 15: 1929–1958. https://doi.org/10.1145/2627435.2670313
[48]	Ioffe S, Szegedy C, (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift, In: Proceedings of the International Conference on Machine Learning, 37: 448–456.

Reader Comments

Your name:*

Email:*
© 2025 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)