Research article Special Issues

Multi-attribute scientific documents retrieval and ranking model based on GBDT and LR


  • Received: 26 October 2021 Revised: 13 January 2022 Accepted: 26 January 2022 Published: 10 February 2022
  • Scientific documents contain a large number of mathematical expressions and texts containing mathematical semantics. Simply using mathematical expressions or text to retrieve scientific documents can hardly meet retrieval needs. The real difficulty in retrieving scientific documents is to effectively integrate mathematical expressions and related textual features. Therefore, this study proposes a multi-attribute scientific documents retrieval and ranking model based on GBDT (gradient boosting decision tree) and LR (logistic regression) by integrating the expressions and text contained in scientific documents. First, the similarities of the five attributes are calculated, including mathematical expression symbols, mathematical expression sub-forms, mathematical expression context, scientific document keywords and the frequency of mathematical expressions. Next, the GBDT model is used to discretize and reorganize the five attributes. Finally, the reorganized features are input into the LR model, and the final retrieval and ranking results of scientific documents are obtained. The experiment in this study was carried out on the NTCIR dataset. The average value of the final MAP@20 of the scientific document recall was 81.92%. The average value of the scientific document ranking nDCG@20 was 86.05%.

    Citation: Xuedong Tian, Jiameng Wang, Yu Wen, Hongyan Ma. Multi-attribute scientific documents retrieval and ranking model based on GBDT and LR[J]. Mathematical Biosciences and Engineering, 2022, 19(4): 3748-3766. doi: 10.3934/mbe.2022172

    Related Papers:

  • Scientific documents contain a large number of mathematical expressions and texts containing mathematical semantics. Simply using mathematical expressions or text to retrieve scientific documents can hardly meet retrieval needs. The real difficulty in retrieving scientific documents is to effectively integrate mathematical expressions and related textual features. Therefore, this study proposes a multi-attribute scientific documents retrieval and ranking model based on GBDT (gradient boosting decision tree) and LR (logistic regression) by integrating the expressions and text contained in scientific documents. First, the similarities of the five attributes are calculated, including mathematical expression symbols, mathematical expression sub-forms, mathematical expression context, scientific document keywords and the frequency of mathematical expressions. Next, the GBDT model is used to discretize and reorganize the five attributes. Finally, the reorganized features are input into the LR model, and the final retrieval and ranking results of scientific documents are obtained. The experiment in this study was carried out on the NTCIR dataset. The average value of the final MAP@20 of the scientific document recall was 81.92%. The average value of the scientific document ranking nDCG@20 was 86.05%.



    加载中


    [1] K. Yamada, H Murakami, Mathematical expression retrieval in PDFs from the web using mathematical term queries, in 33rd International Conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems, (2020), 155-161. https://doi.org/10.1007/978-3-030-55789-8_14
    [2] R. M. Oliveira, F. B. Gonzaga, V. Barbosa, G. Xexéo, A distributed system for search on math based on the microsoft bizSpark program, preprint, arXiv: 1711.04189.
    [3] P. Sojka, M. Ruzicka, V. Novotný, MIaS: Math-aware retrieval in digital mathematical libraries, in Proceedings of the 27th ACM International Conference on Information and Knowledge Management, (2018), 1923-1926. https://doi.org/10.1145/3269206.3269233
    [4] B. Mansouri, S. Rohatgi, D. Oard, J. Wu, C. L. Giles, R. Zanibbi, Tangent-CFT: An embedding model for mathematical formulas, in Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval, (2019), 11-18. https://doi.org/10.1145/3341981.3344235
    [5] J. M. Xu, C. Y. Xu, Computing similarity of scientific documents based on texts and formulas, Data and Knowledge Discovery, 2 (2018), 103-109. Available from: https://wenku.baidu.com/view/3ca592af1cd9ad51f01dc281e53a580217fc500d.html?fr=income1-wk_app_search_ctr-search
    [6] W. Zhong, R. Zanibbi, Structural similarity search for formulas using leaf-root paths in operator subtrees, in European Conference on Information Retrieval, (2019), 116-129. https://doi.org/10.1007/978-3-030-15712-8_8
    [7] W. Zhong, S. Rohatgi, J. Wu, C. L. Giles, R. Zanibbi, Accelerating substructure similarity search for formula retrieval, Adv. Inf. Retr., 12035 (2020), 714-727. https://doi.org/10.1007/978-3-030-45439-5_47 doi: 10.1007/978-3-030-45439-5_47
    [8] M. Schubotz, N. Meuschke, T. Hepp, H. Cohl, B. Gipp, VMEXT: a visualization tool for mathematical expression trees, preprint, arXiv: 1707.03540v1
    [9] K. Davila, R. Zanibbi, Visual search engine for handwritten and typeset math in lecture videos and LATEX notes, in 16th International Conference on Frontiers in Handwriting Recognition, (2018), 50-55. https://doi.org/10.1109/ICFHR-2018.2018.00018
    [10] L. Gao, Z. Jiang, Y. Yin, K. Yuan, Z. Yuan, Z. Tang, Preliminary exploration of formula embedding for mathematical information retrieval: Can mathematical formulae be embedded like a natural language?, preprint, arXiv: 1707.05154.
    [11] F. Dai, L. Chen, Z. Zhang, An N-ary tree-based model for similarity evaluation on mathematical formulae, in 2020 IEEE International Conference on Systems, Man, and Cybernetics, (2020), 2578-2584. https://doi.org/10.1109/SMC42975.2020.9283495
    [12] P. Dadure, P. Pakray, S. Bandyopadhyay, BERT-based embedding model for formula retrieval, in Conference and Labs of the Evaluation Forum, 2021.
    [13] A. Pathak, P. Pakray, R. Das, Context guided retrieval of math formulae from scientific documents, J. Inf. Optim. Sci., 40 (2019), 1559-1574. https://doi.org/10.1080/02522667.2019.1703255 doi: 10.1080/02522667.2019.1703255
    [14] A. Pathak, P. Pakray, A. Gelbukh, Binary vector transformation of math formula for mathematical information retrieval, J. Intell. Fuzzy Syst., 36 (2019), 4685-4695. https://doi.org/10.3233/JIFS-179018 doi: 10.3233/JIFS-179018
    [15] A. Pathak, P. Pakray, S. Sarkar, D. Das, A. Gelbukh, MathIRs: Retrieval system for scientific documents, Computacióny Sistemas, 21 (2017), 253-265. https://doi.org/10.13053/CyS-21-2-2743 doi: 10.13053/CyS-21-2-2743
    [16] M. Schubotz, A. Grigorev, M. Leich, H. Cohl, N. Meuschke, B. Gipp, et al., Semantification of identifiers in mathematics for better math information retrieval, in Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, (2016), 135-144. https://doi.org/10.1145/2911451.2911503
    [17] H. B. Wang, X. D. Tian, K. G. Zhang, X. J. Cui, Q. X. Shi, X. F. Li, A multi-membership evaluating method in ranking of mathematical retrieval results, Sci. Technol. Eng., 19 (2019), 164-170.
    [18] D. Fraser, A. Kane, F. W. Tompa, Choosing math features for BM25 ranking with tangent-L, in Proceedings of the ACM Symposium on Document Engineering, (2018), 1-10. https://doi.org/10.1145/3209280.3209527
    [19] X. Tian, S. Yang, X. Li, F Yang, An indexing method of mathematical expression retrieval, in 3rd International Conference on Computer Science and Network Technology, (2013), 574-578. https://doi.org/10.1109/ICCSNT.2013.6967179
    [20] X. Tian, J. Wang, Retrieval of scientific documents based on HFS and BERT, IEEE Acccess, 9 (2021), 8708-8717. https://doi.org/10.1109/ACCESS.2021.3049391 doi: 10.1109/ACCESS.2021.3049391
    [21] L. Zadeh, Fuzzy Sets, Inf. Control, 8 (1965), 338-353. https://doi.org/10.1016/S0019-9958(65)90241-X doi: 10.1016/S0019-9958(65)90241-X
    [22] V. Torra, Hesitant fuzzy sets, Int. J. Intell. Syst., 25 (2010), 529-539. https://doi.org/10.1002/int.20418 doi: 10.1002/int.20418
    [23] L. N. Cai, S. W. Chen, W. Zhou, H. B. Huang, Y. Liang, Interval-valued hesitant fuzzy WOWA operator and its application in decision making, Journal of Zhengzhou University, 35 (2014), 49-53.
    [24] A. Reusch, M. Thiele, W. Lehner, TU_DBS in the ARQ math lab 2021, in Conference and Labs of the Evaluation Forum, (2021), 107-124.
    [25] J. Devlin, M. W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language understanding, preprint, arXiv: 1810.04805.
    [26] Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun. Q. Liu, ERNIE: Enhanced language representation with informative entities, preprint, arXiv: 1905.07129.
    [27] Z. Tian, R. Zhang, X. Hou, J. Liu, K. Ren, FederBoost: Private federated learning for GBDT, preprint, arXiv: 2011.02796.
    [28] F. Fu, J. Jiang, Y. Shao, B. Cui, An experimental evaluation of large scale GBDT systems, preprint, arXiv: 1907.01882.
  • Reader Comments
  • © 2022 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(2280) PDF downloads(113) Cited by(3)

Article outline

Figures and Tables

Figures(7)  /  Tables(9)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog