With the observation that most frames in video-based person re-identification (VReID) capture human figures and that continuous movements intrinsically separate the foreground from the background, this paper demonstrates that it is unnecessary to model human structures with extra key-point estimators that are elaborately pre-trained on costly body-part labels. Specifically, we propose a novel human structure modeling (HTML) module to generate discriminative part-level features from the foreground for VReID without using exclusively annotated part labels. Under the guidance of a simple humanoid topology, HTML extracts coarse body shapes (body proportions) by mapping image patches to both the topological body parts and the background. To compensate for the lack of supervision, a regularization loss is designed for the training of the HTML module to compensate for the lack of part-label supervision. Furthermore, a spatial-temporal part mixer and a spatial-temporal patch mixer are introduced to make its output more discriminative and reliable. Extensive experiments show that our approach achieves competitive performance with a favorable accuracy-efficiency trade-off across multiple benchmarks.
Citation: Haotian Chen, Jianyuan Guo, Chao Zhang, Zhouchen Lin. Human structure modeling for video-based person re-identification without body-part labels[J]. Big Data and Information Analytics, 2025, 9: 328-349. doi: 10.3934/bdia.2025015
With the observation that most frames in video-based person re-identification (VReID) capture human figures and that continuous movements intrinsically separate the foreground from the background, this paper demonstrates that it is unnecessary to model human structures with extra key-point estimators that are elaborately pre-trained on costly body-part labels. Specifically, we propose a novel human structure modeling (HTML) module to generate discriminative part-level features from the foreground for VReID without using exclusively annotated part labels. Under the guidance of a simple humanoid topology, HTML extracts coarse body shapes (body proportions) by mapping image patches to both the topological body parts and the background. To compensate for the lack of supervision, a regularization loss is designed for the training of the HTML module to compensate for the lack of part-label supervision. Furthermore, a spatial-temporal part mixer and a spatial-temporal patch mixer are introduced to make its output more discriminative and reliable. Extensive experiments show that our approach achieves competitive performance with a favorable accuracy-efficiency trade-off across multiple benchmarks.
| [1] |
Ma H, Zhang C, Zhang Y, Li Z, Wang Z, Wei C, (2024) A review on video person re-identification based on deep learning. Neurocomputing 609: 128479. https://doi.org/10.1016/j.neucom.2024.128479 doi: 10.1016/j.neucom.2024.128479
|
| [2] |
Saad RSM, Moussa MM, Abdel-Kader NS, Farouk H, Mashaly S, (2024) Deep video-based person re-identification (deep vid-reid): Comprehensive survey. EURASIP J Adv Signal Process 2024: 63. https://doi.org/10.1186/s13634-024-01139-x doi: 10.1186/s13634-024-01139-x
|
| [3] | Liu J, Zha Z, Wu W, Zheng K, Sun Q, (2021) Spatial-temporal correlation and topology learning for person re-identification in videos, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 4370–4379. https://doi.org/10.1109/CVPR46437.2021.00435 |
| [4] | Jiang X, Qiao Y, Yan J, Li Q, Zheng W, Chen D, (2021) SSN3D: Self-separated network to align parts for 3d convolution in video person re-identification, In: Proceedings of the AAAI Conference on Artificial Intelligence, 35: 1691–1699. https://doi.org/10.1609/aaai.v35i2.16262 |
| [5] | Wu J, He L, Liu W, Yang Y, Lei Z, Mei T, et al. (2022) CAViT: Contextual alignment vision transformer for video object re-identification, In: European Conference on Computer Vision, 2022: 549–566. https://doi.org/10.1007/978-3-031-19781-9_32 |
| [6] | Wu P, Wang L, Zhou S, Hua G, Sun C, (2024) Temporal correlation vision transformer for video person re-identification, In: Proceedings of the AAAI Conference on Artificial Intelligence, 38: 6083–6091. https://doi.org/10.1609/aaai.v38i6.28424 |
| [7] | Yu C, Liu X, Wang Y, Zhang P, Lu H, (2024) Tf-clip: Learning text-free clip for video-based person re-identification, In: Proceedings of the AAAI Conference on Artificial Intelligence, 38: 6764–6772. https://doi.org/10.1609/aaai.v38i7.28500 |
| [8] | Su Y, Shi Y, Liu F, Liu X, (2025) Hamobe: Hierarchical and adaptive mixture of biometric experts for video-based person reid, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025: 11525–11536. |
| [9] | Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D, (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization, In: Proceedings of the IEEE International Conference on Computer Vision, 2017: 618–626. https://doi.org/10.1109/ICCV.2017.74 |
| [10] | Jiang X, Gong Y, Guo X, Yang Q, Huang F, Zheng W, et al., (2020) Rethinking temporal fusion for video-based person re-identification on semantic and time aspect, In: Proceedings of the AAAI Conference on Artificial Intelligence, 34: 11133–11140. https://doi.org/10.1609/aaai.v34i07.6770 |
| [11] | Aich A, Zheng M, Karanam S, Chen T, Roy-Chowdhury AK, Wu Z, (2021) Spatio-temporal representation factorization for video-based person re-identification, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 152–162. https://doi.org/10.1109/ICCV48922.2021.00022 |
| [12] | Liu X, Zhang P, Yu C, Lu H, Yang X, (2021) Watching you: Global-guided reciprocal learning for video-based person re-identification, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 13334–13343. https://doi.org/10.1109/CVPR46437.2021.01306 |
| [13] | Hou R, Chang H, Ma B, Shan S, Chen X, (2020) Temporal complementary learning for video person re-identification, In: European Conference on Computer Vision, 2020: 388–405. https://doi.org/10.1007/978-3-030-58545-7_23 |
| [14] | Hou R, Chang H, Ma B, Huang R, Shan S, (2021) Bicnet-tks: Learning efficient spatial-temporal representation for video person re-identification, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 2014–2023. https://doi.org/10.1109/CVPR46437.2021.00204 |
| [15] | Shu X, Li G, Wei L, Zhong J, Zang X, Zhang S, et al., (2021) Diverse part attentive network for video-based person re-identification, Pattern Recognition Letters, 149: 1–8. https://doi.org/10.1016/j.patrec.2021.07.015 |
| [16] | He T, Jin X, Shen X, Huang J, Chen Z, Hua X, (2021) Dense interaction learning for video-based person re-identification, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 1490–1501. https://doi.org/10.1109/ICCV48922.2021.00153 |
| [17] | Yang J, Zheng W, Yang Q, Chen Y, Tian Q, (2020) Spatial-temporal graph convolutional network for video-based person re-identification, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 3289–3299. https://doi.org/10.1109/CVPR42600.2020.00332 |
| [18] | Li S, Yu H, Hu H, (2020) Appearance and motion enhancement for video-based person re-identification, In: Proceedings of the AAAI Conference on Artificial Intelligence, 34: 11394–11401. https://doi.org/10.1609/aaai.v34i07.6890 |
| [19] | Subramaniam A, Nambiar A, Mittal A, (2019) Co-segmentation inspired attention networks for video-based person re-identification, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 562–572. https://doi.org/10.1109/ICCV.2019.00061 |
| [20] | He K, Zhang X, Ren S, Sun J, (2016) Deep residual learning for image recognition, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016: 770–778. https://doi.org/10.1109/CVPR.2016.90 |
| [21] | Han F, Reily B, Hoff W, Zhang H, (2017) Space-time representation of people based on 3d skeletal data: A review, Comput Vision Image Understanding, 158: 85–105. https://doi.org/10.1016/j.cviu.2016.12.003 |
| [22] | Cao Z, Hidalgo G, Simon T, Wei S, Sheikh Y, (2019) Openpose: Realtime multi-person 2d pose estimation using part affinity fields, IEEE Trans Pattern Anal Mach Intell, 43: 172–186. https://doi.org/10.1109/TPAMI.2019.2929257 |
| [23] | Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. (2014) Microsoft coco: common objects in context, In: European Conference on Computer Vision, 2014: 740–755. https://doi.org/10.1007/978-3-319-10602-1_48 |
| [24] | Kipf TN, Welling M, (2017) Semi-supervised classification with graph convolutional networks, In: Proceedings of the International Conference on Learning Representations, 2017: 1–14. |
| [25] | Ba JL, Kiros JR, Hinton GE, (2016) Layer normalization, In: Proceedings of the Neural Information Processing Systems, 2016: 1–14. |
| [26] | Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. (2021) An image is worth 16x16 words: Transformers for image recognition at scale, In: Proceedings of the International Conference on Learning Representations, 2021: 1–22. |
| [27] | Ye M, Shen J, Lin G, Xiang T, Shao L, Hoi SC, (2021) Deep learning for person re-identification: a survey and outlook, IEEE Trans Pattern Anal Mach Intell, 44: 2872–2893. https://doi.org/10.1109/TPAMI.2020.3039709 |
| [28] | Hermans A, Beyer L, Leibe B, (2019) In defense of the triplet loss for person re-identification, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019: 1–10. https://doi.org/10.1109/CVPRW.2017.137 |
| [29] | Zheng L, Bie Z, Sun Y, Wang J, Su C, Wang S, et al. (2016) Mars: A video benchmark for large-scale person re-identification, In: European Conference on Computer Vision, 2016: 868–884. https://doi.org/10.1007/978-3-319-46475-6_53 |
| [30] | Li J, Wang J, Tian Q, Gao W, Zhang S, (2019) Global-local temporal representations for video person re-identification, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 3958–3967. https://doi.org/10.1109/ICCV.2019.00404 |
| [31] | Wang T, Gong S, Zhu X, Wang S, (2014) Person re-identification by video ranking, In: European Conference on Computer Vision, 2014: 688–703. https://doi.org/10.1007/978-3-319-10599-4_44 |
| [32] | Hirzer M, Beleznai C, Roth PM, Bischof H, (2011) Person re-identification by descriptive and discriminative classification, In: Proceedings of the Scandinavian Conference on Image Analysis, 2011: 91–102. https://doi.org/10.1007/978-3-642-21227-7_9 |
| [33] | Wang Y, Zhang P, Gao S, Geng X, Lu H, Wang D, (2021) Pyramid spatial-temporal aggregation for video-based person re-identification, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 12026–12035. https://doi.org/10.1109/ICCV48922.2021.01183 |
| [34] | Eom C, Lee G, Lee J, Ham B, (2021) Video-based person re-identification with spatial and temporal memory networks, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 12036–12045. https://doi.org/10.1109/ICCV48922.2021.01184 |
| [35] | Bai S, Ma B, Chang H, Huang R, Chen X, (2022) Salient-to-broad transition for video person re-identification, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 7339–7348. https://doi.org/10.1109/CVPR52688.2022.00722 |
| [36] | Liu C, Chen J, Chen C, Chien S, (2021) Video-based person re-identification without bells and whistles, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2021: 1491–1500. https://doi.org/10.1109/CVPRW53098.2021.00156 |
| [37] | Chen D, Zhang Y, Yuan J, Gao S, Bai X, (2022) Keypoint message passing for video-based person re-identification, In: Proceedings of the AAAI Conference on Artificial Intelligence, 36: 239–247. https://doi.org/10.1609/aaai.v36i1.19771 |
| [38] | Yao Y, Jiang X, Fujita H, Fang Z, (2022) A sparse graph wavelet convolution neural network for video-based person re-identification, Pattern Recognit, 129: 108708. https://doi.org/10.1016/j.patcog.2022.108708 |
| [39] | Hu X, Wei D, Wang Z, Shen J, Ren H, (2021) Hypergraph video pedestrian re-identification based on posture structure relationship and action constraints, Pattern Recognit, 111: 107688. https://doi.org/10.1016/j.patcog.2020.107688 |
| [40] | Zheng L, Shen L, Tian L, Wang S, Wang J, Tian Q, (2015) Scalable person re-identification: A benchmark, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2015: 1116–1124. https://doi.org/10.1109/ICCV.2015.129 |
| [41] | Bolle RM, Connell JH, Pankanti S, Ratha NK, Senior AW, (2005) The relation between the roc curve and the cmc, In: Fourth IEEE Workshop on Automatic Identification Advanced Technologies (AutoID'05), 2005: 15–20. https://doi.org/10.1109/AUTOID.2005.1529697 |
| [42] | Zhong Z, Zheng L, Kang G, Li S, Yang Y, (2020) Random erasing data augmentation, In: Proceedings of the AAAI Conference on Artificial Intelligence, 34: 13001–13008. https://doi.org/10.1609/aaai.v34i07.6999 |
| [43] | Kingma DP, Ba J, (2015) Adam: A method for stochastic optimization, In: Proceedings of the International Conference on Learning Representations, 2015: 1–15. |
| [44] | Luo H, Gu Y, Liao X, Lai S, Jiang W, (2019) Bag of tricks and a strong baseline for deep person re-identification, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019: 4321–4329. https://doi.org/10.1109/CVPRW.2019.00463 |
| [45] | Chen X, Xie S, He K, (2021) An empirical study of training self-supervised vision transformers, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 9640–9649. https://doi.org/10.1109/ICCV48922.2021.00957 |
| [46] | Bai J, Yuan L, Xia S, Yan S, Li Z, Liu W, (2022) Improving vision transformers by revisiting high-frequency components, In: European Conference on Computer Vision, 2022: 1–18. https://doi.org/10.1007/978-3-031-19800-7_1 |
| [47] | Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R, (2014) Dropout: A simple way to prevent neural networks from overfitting, J Mach Learn Res, 15: 1929–1958. https://doi.org/10.1145/2627435.2670313 |
| [48] | Ioffe S, Szegedy C, (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift, In: Proceedings of the International Conference on Machine Learning, 37: 448–456. |