Research article

Human structure modeling for video-based person re-identification without body-part labels

  • Published: 04 December 2025
  • With the observation that most frames in video-based person re-identification (VReID) capture human figures and that continuous movements intrinsically separate the foreground from the background, this paper demonstrates that it is unnecessary to model human structures with extra key-point estimators that are elaborately pre-trained on costly body-part labels. Specifically, we propose a novel human structure modeling (HTML) module to generate discriminative part-level features from the foreground for VReID without using exclusively annotated part labels. Under the guidance of a simple humanoid topology, HTML extracts coarse body shapes (body proportions) by mapping image patches to both the topological body parts and the background. To compensate for the lack of supervision, a regularization loss is designed for the training of the HTML module to compensate for the lack of part-label supervision. Furthermore, a spatial-temporal part mixer and a spatial-temporal patch mixer are introduced to make its output more discriminative and reliable. Extensive experiments show that our approach achieves competitive performance with a favorable accuracy-efficiency trade-off across multiple benchmarks.

    Citation: Haotian Chen, Jianyuan Guo, Chao Zhang, Zhouchen Lin. Human structure modeling for video-based person re-identification without body-part labels[J]. Big Data and Information Analytics, 2025, 9: 328-349. doi: 10.3934/bdia.2025015

    Related Papers:

  • With the observation that most frames in video-based person re-identification (VReID) capture human figures and that continuous movements intrinsically separate the foreground from the background, this paper demonstrates that it is unnecessary to model human structures with extra key-point estimators that are elaborately pre-trained on costly body-part labels. Specifically, we propose a novel human structure modeling (HTML) module to generate discriminative part-level features from the foreground for VReID without using exclusively annotated part labels. Under the guidance of a simple humanoid topology, HTML extracts coarse body shapes (body proportions) by mapping image patches to both the topological body parts and the background. To compensate for the lack of supervision, a regularization loss is designed for the training of the HTML module to compensate for the lack of part-label supervision. Furthermore, a spatial-temporal part mixer and a spatial-temporal patch mixer are introduced to make its output more discriminative and reliable. Extensive experiments show that our approach achieves competitive performance with a favorable accuracy-efficiency trade-off across multiple benchmarks.



    加载中


    [1] Ma H, Zhang C, Zhang Y, Li Z, Wang Z, Wei C, (2024) A review on video person re-identification based on deep learning. Neurocomputing 609: 128479. https://doi.org/10.1016/j.neucom.2024.128479 doi: 10.1016/j.neucom.2024.128479
    [2] Saad RSM, Moussa MM, Abdel-Kader NS, Farouk H, Mashaly S, (2024) Deep video-based person re-identification (deep vid-reid): Comprehensive survey. EURASIP J Adv Signal Process 2024: 63. https://doi.org/10.1186/s13634-024-01139-x doi: 10.1186/s13634-024-01139-x
    [3] Liu J, Zha Z, Wu W, Zheng K, Sun Q, (2021) Spatial-temporal correlation and topology learning for person re-identification in videos, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 4370–4379. https://doi.org/10.1109/CVPR46437.2021.00435
    [4] Jiang X, Qiao Y, Yan J, Li Q, Zheng W, Chen D, (2021) SSN3D: Self-separated network to align parts for 3d convolution in video person re-identification, In: Proceedings of the AAAI Conference on Artificial Intelligence, 35: 1691–1699. https://doi.org/10.1609/aaai.v35i2.16262
    [5] Wu J, He L, Liu W, Yang Y, Lei Z, Mei T, et al. (2022) CAViT: Contextual alignment vision transformer for video object re-identification, In: European Conference on Computer Vision, 2022: 549–566. https://doi.org/10.1007/978-3-031-19781-9_32
    [6] Wu P, Wang L, Zhou S, Hua G, Sun C, (2024) Temporal correlation vision transformer for video person re-identification, In: Proceedings of the AAAI Conference on Artificial Intelligence, 38: 6083–6091. https://doi.org/10.1609/aaai.v38i6.28424
    [7] Yu C, Liu X, Wang Y, Zhang P, Lu H, (2024) Tf-clip: Learning text-free clip for video-based person re-identification, In: Proceedings of the AAAI Conference on Artificial Intelligence, 38: 6764–6772. https://doi.org/10.1609/aaai.v38i7.28500
    [8] Su Y, Shi Y, Liu F, Liu X, (2025) Hamobe: Hierarchical and adaptive mixture of biometric experts for video-based person reid, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025: 11525–11536.
    [9] Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D, (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization, In: Proceedings of the IEEE International Conference on Computer Vision, 2017: 618–626. https://doi.org/10.1109/ICCV.2017.74
    [10] Jiang X, Gong Y, Guo X, Yang Q, Huang F, Zheng W, et al., (2020) Rethinking temporal fusion for video-based person re-identification on semantic and time aspect, In: Proceedings of the AAAI Conference on Artificial Intelligence, 34: 11133–11140. https://doi.org/10.1609/aaai.v34i07.6770
    [11] Aich A, Zheng M, Karanam S, Chen T, Roy-Chowdhury AK, Wu Z, (2021) Spatio-temporal representation factorization for video-based person re-identification, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 152–162. https://doi.org/10.1109/ICCV48922.2021.00022
    [12] Liu X, Zhang P, Yu C, Lu H, Yang X, (2021) Watching you: Global-guided reciprocal learning for video-based person re-identification, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 13334–13343. https://doi.org/10.1109/CVPR46437.2021.01306
    [13] Hou R, Chang H, Ma B, Shan S, Chen X, (2020) Temporal complementary learning for video person re-identification, In: European Conference on Computer Vision, 2020: 388–405. https://doi.org/10.1007/978-3-030-58545-7_23
    [14] Hou R, Chang H, Ma B, Huang R, Shan S, (2021) Bicnet-tks: Learning efficient spatial-temporal representation for video person re-identification, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 2014–2023. https://doi.org/10.1109/CVPR46437.2021.00204
    [15] Shu X, Li G, Wei L, Zhong J, Zang X, Zhang S, et al., (2021) Diverse part attentive network for video-based person re-identification, Pattern Recognition Letters, 149: 1–8. https://doi.org/10.1016/j.patrec.2021.07.015
    [16] He T, Jin X, Shen X, Huang J, Chen Z, Hua X, (2021) Dense interaction learning for video-based person re-identification, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 1490–1501. https://doi.org/10.1109/ICCV48922.2021.00153
    [17] Yang J, Zheng W, Yang Q, Chen Y, Tian Q, (2020) Spatial-temporal graph convolutional network for video-based person re-identification, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020: 3289–3299. https://doi.org/10.1109/CVPR42600.2020.00332
    [18] Li S, Yu H, Hu H, (2020) Appearance and motion enhancement for video-based person re-identification, In: Proceedings of the AAAI Conference on Artificial Intelligence, 34: 11394–11401. https://doi.org/10.1609/aaai.v34i07.6890
    [19] Subramaniam A, Nambiar A, Mittal A, (2019) Co-segmentation inspired attention networks for video-based person re-identification, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 562–572. https://doi.org/10.1109/ICCV.2019.00061
    [20] He K, Zhang X, Ren S, Sun J, (2016) Deep residual learning for image recognition, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016: 770–778. https://doi.org/10.1109/CVPR.2016.90
    [21] Han F, Reily B, Hoff W, Zhang H, (2017) Space-time representation of people based on 3d skeletal data: A review, Comput Vision Image Understanding, 158: 85–105. https://doi.org/10.1016/j.cviu.2016.12.003
    [22] Cao Z, Hidalgo G, Simon T, Wei S, Sheikh Y, (2019) Openpose: Realtime multi-person 2d pose estimation using part affinity fields, IEEE Trans Pattern Anal Mach Intell, 43: 172–186. https://doi.org/10.1109/TPAMI.2019.2929257
    [23] Lin T, Maire M, Belongie S, Hays J, Perona P, Ramanan D, et al. (2014) Microsoft coco: common objects in context, In: European Conference on Computer Vision, 2014: 740–755. https://doi.org/10.1007/978-3-319-10602-1_48
    [24] Kipf TN, Welling M, (2017) Semi-supervised classification with graph convolutional networks, In: Proceedings of the International Conference on Learning Representations, 2017: 1–14.
    [25] Ba JL, Kiros JR, Hinton GE, (2016) Layer normalization, In: Proceedings of the Neural Information Processing Systems, 2016: 1–14.
    [26] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. (2021) An image is worth 16x16 words: Transformers for image recognition at scale, In: Proceedings of the International Conference on Learning Representations, 2021: 1–22.
    [27] Ye M, Shen J, Lin G, Xiang T, Shao L, Hoi SC, (2021) Deep learning for person re-identification: a survey and outlook, IEEE Trans Pattern Anal Mach Intell, 44: 2872–2893. https://doi.org/10.1109/TPAMI.2020.3039709
    [28] Hermans A, Beyer L, Leibe B, (2019) In defense of the triplet loss for person re-identification, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019: 1–10. https://doi.org/10.1109/CVPRW.2017.137
    [29] Zheng L, Bie Z, Sun Y, Wang J, Su C, Wang S, et al. (2016) Mars: A video benchmark for large-scale person re-identification, In: European Conference on Computer Vision, 2016: 868–884. https://doi.org/10.1007/978-3-319-46475-6_53
    [30] Li J, Wang J, Tian Q, Gao W, Zhang S, (2019) Global-local temporal representations for video person re-identification, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019: 3958–3967. https://doi.org/10.1109/ICCV.2019.00404
    [31] Wang T, Gong S, Zhu X, Wang S, (2014) Person re-identification by video ranking, In: European Conference on Computer Vision, 2014: 688–703. https://doi.org/10.1007/978-3-319-10599-4_44
    [32] Hirzer M, Beleznai C, Roth PM, Bischof H, (2011) Person re-identification by descriptive and discriminative classification, In: Proceedings of the Scandinavian Conference on Image Analysis, 2011: 91–102. https://doi.org/10.1007/978-3-642-21227-7_9
    [33] Wang Y, Zhang P, Gao S, Geng X, Lu H, Wang D, (2021) Pyramid spatial-temporal aggregation for video-based person re-identification, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 12026–12035. https://doi.org/10.1109/ICCV48922.2021.01183
    [34] Eom C, Lee G, Lee J, Ham B, (2021) Video-based person re-identification with spatial and temporal memory networks, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 12036–12045. https://doi.org/10.1109/ICCV48922.2021.01184
    [35] Bai S, Ma B, Chang H, Huang R, Chen X, (2022) Salient-to-broad transition for video person re-identification, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 7339–7348. https://doi.org/10.1109/CVPR52688.2022.00722
    [36] Liu C, Chen J, Chen C, Chien S, (2021) Video-based person re-identification without bells and whistles, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2021: 1491–1500. https://doi.org/10.1109/CVPRW53098.2021.00156
    [37] Chen D, Zhang Y, Yuan J, Gao S, Bai X, (2022) Keypoint message passing for video-based person re-identification, In: Proceedings of the AAAI Conference on Artificial Intelligence, 36: 239–247. https://doi.org/10.1609/aaai.v36i1.19771
    [38] Yao Y, Jiang X, Fujita H, Fang Z, (2022) A sparse graph wavelet convolution neural network for video-based person re-identification, Pattern Recognit, 129: 108708. https://doi.org/10.1016/j.patcog.2022.108708
    [39] Hu X, Wei D, Wang Z, Shen J, Ren H, (2021) Hypergraph video pedestrian re-identification based on posture structure relationship and action constraints, Pattern Recognit, 111: 107688. https://doi.org/10.1016/j.patcog.2020.107688
    [40] Zheng L, Shen L, Tian L, Wang S, Wang J, Tian Q, (2015) Scalable person re-identification: A benchmark, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2015: 1116–1124. https://doi.org/10.1109/ICCV.2015.129
    [41] Bolle RM, Connell JH, Pankanti S, Ratha NK, Senior AW, (2005) The relation between the roc curve and the cmc, In: Fourth IEEE Workshop on Automatic Identification Advanced Technologies (AutoID'05), 2005: 15–20. https://doi.org/10.1109/AUTOID.2005.1529697
    [42] Zhong Z, Zheng L, Kang G, Li S, Yang Y, (2020) Random erasing data augmentation, In: Proceedings of the AAAI Conference on Artificial Intelligence, 34: 13001–13008. https://doi.org/10.1609/aaai.v34i07.6999
    [43] Kingma DP, Ba J, (2015) Adam: A method for stochastic optimization, In: Proceedings of the International Conference on Learning Representations, 2015: 1–15.
    [44] Luo H, Gu Y, Liao X, Lai S, Jiang W, (2019) Bag of tricks and a strong baseline for deep person re-identification, In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019: 4321–4329. https://doi.org/10.1109/CVPRW.2019.00463
    [45] Chen X, Xie S, He K, (2021) An empirical study of training self-supervised vision transformers, In: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021: 9640–9649. https://doi.org/10.1109/ICCV48922.2021.00957
    [46] Bai J, Yuan L, Xia S, Yan S, Li Z, Liu W, (2022) Improving vision transformers by revisiting high-frequency components, In: European Conference on Computer Vision, 2022: 1–18. https://doi.org/10.1007/978-3-031-19800-7_1
    [47] Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R, (2014) Dropout: A simple way to prevent neural networks from overfitting, J Mach Learn Res, 15: 1929–1958. https://doi.org/10.1145/2627435.2670313
    [48] Ioffe S, Szegedy C, (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift, In: Proceedings of the International Conference on Machine Learning, 37: 448–456.
  • Reader Comments
  • © 2025 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(337) PDF downloads(9) Cited by(0)

Article outline

Figures and Tables

Figures(12)  /  Tables(4)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog