Research article Special Issues

Cross-platform binary code similarity detection based on NMT and graph embedding

  • Received: 12 March 2021 Accepted: 10 May 2021 Published: 25 May 2021
  • Cross-platform binary code similarity detection is determining whether a pair of binary functions coming from different platforms are similar, and plays an important role in many areas. Traditional methods focus on using platform-independent characteristic strands intersecting or control flow graph (CFG) matching to compute the similarity and have shortages in terms of efficiency and scalability. The existing deep-learning-based methods improve the efficiency but have a low accuracy and still using manually constructed features. Aiming at these problems, a cross-platform binary code similarity detection method based on neural machine translation (NMT) and graph embedding is proposed in this manuscript. We train an NMT model and a graph embedding model to automatically extract two parts of semantics of the binary code and represent it as a high-dimension vector, named an embedding. Then the similarity of two binary functions can be measured by the distance between their corresponding embeddings. We implement a prototype named SimInspector. Our comparative experiment result shows that SimInspector outperforms the state-of-the-art approach, Gemini, by about 6$ \% $ with respect to similarity detection accuracy, and maintains a good efficiency.

    Citation: Xiaodong Zhu, Liehui Jiang, Zeng Chen. Cross-platform binary code similarity detection based on NMT and graph embedding[J]. Mathematical Biosciences and Engineering, 2021, 18(4): 4528-4551. doi: 10.3934/mbe.2021230

    Related Papers:

  • Cross-platform binary code similarity detection is determining whether a pair of binary functions coming from different platforms are similar, and plays an important role in many areas. Traditional methods focus on using platform-independent characteristic strands intersecting or control flow graph (CFG) matching to compute the similarity and have shortages in terms of efficiency and scalability. The existing deep-learning-based methods improve the efficiency but have a low accuracy and still using manually constructed features. Aiming at these problems, a cross-platform binary code similarity detection method based on neural machine translation (NMT) and graph embedding is proposed in this manuscript. We train an NMT model and a graph embedding model to automatically extract two parts of semantics of the binary code and represent it as a high-dimension vector, named an embedding. Then the similarity of two binary functions can be measured by the distance between their corresponding embeddings. We implement a prototype named SimInspector. Our comparative experiment result shows that SimInspector outperforms the state-of-the-art approach, Gemini, by about 6$ \% $ with respect to similarity detection accuracy, and maintains a good efficiency.



    加载中


    [1] J. Pewny, B. Garmany, R. Gawlik, C. Rossow, T. Holz, Cross-architecture bug search in binary executables, in 2015 IEEE Symposium on Security and Privacy, (2015), 709–724.
    [2] B. Liu, W. Huo, C. Zhang, W. Li, F. Li, A. Piao, et al., αdiff: cross-version binary code similarity detection with DNN, in Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, (2018), 667–678.
    [3] L. Luo, J. Ming, D. Wu, P. Liu, S. Zhu, Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection, in Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, (2014), 389–400.
    [4] L. Luo, J. Ming, D. Wu, P. Liu, S. Zhu, Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection, IEEE Trans. Software Eng., 43 (2017), 1157–1177.
    [5] A. Sæbjørnsen, J. Willcock, T. Panas, D. J. Quinlan, Z. Su, Detecting code clones in binary executables, in Proceedings of the 18th International Symposium on Software Testing and Analysis, (2009), 117–128.
    [6] Z. Xu, B. Chen, M. Chandramohan, Y. Liu, F. Song, SPAIN: security patch analysis for binaries towards understanding the pain and pills, in Proceedings of the 39th International Conference on Software Engineering, (2017), 462–472.
    [7] U. Bayer, P. M. Comparetti, C. Hlauschek, C. Krügel, E. Kirda, Scalable, behavior-based malware clustering, in Proceedings of the Network and Distributed System Security Symposium, (2009).
    [8] X. Hu, T. Chiueh, K. G. Shin, Large-scale malware indexing using function-call graphs, in Proceedings of the 2009 ACM Conference on Computer and Communications Security, (2009), 611–620.
    [9] J. Jang, M. Woo, D. Brumley, Towards automatic software lineage inference, in Proceedings of the 22th USENIX Security Symposium, (2013), 81–96.
    [10] S. Eschweiler, K. Yakdan, E. Gerhards-Padilla, discovre: Efficient cross-architecture identification of bugs in binary code, in 23rd Annual Network and Distributed System Security Symposium, (2016).
    [11] Q. Feng, R. Zhou, C. Xu, Y. Cheng, B. Testa, H. Yin, Scalable graph-based bug search for firmware images, in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, (2016), 480–491.
    [12] X. Xu, C. Liu, Q. Feng, H. Yin, L. Song, D. Song, Neural network-based graph embedding for cross-platform binary code similarity detection, in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, (2017), 363–376.
    [13] Y. David, N. Partush, E. Yahav, Statistical similarity of binaries, in Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, (2016), 266–280.
    [14] Y. David, N. Partush, E. Yahav, Similarity of binaries through re-optimization, in Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, (2017), 79–94.
    [15] Y. David, N. Partush, E. Yahav, Firmup: Precise static detection of common vulnerabilities in firmware, in Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems, (2018), 392–404.
    [16] Y. David, E. Yahav, Tracelet-based code search in executables, in Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation, (2014), 349–360.
    [17] B. Lu, F. Liu, X. Ge, B. Liu, X. Luo, A software birthmark based on dynamic opcode n-gram, in Proceedings of the 1st IEEE International Conference on Semantic Computing, (2007), 37–44.
    [18] W. M. Khoo, A. Mycroft, R. J. Anderson, Rendezvous: A search engine for binary code, in Proceedings of the 10th Working Conference on Mining Software Repositories, (2013), 329–338.
    [19] D. Gao, M. K. Reiter, D. X. Song, Binhunt: Automatically finding semantic differences in binary programs, in 10th International Conference on Information and Communications Security, (2008), 238–255.
    [20] J. Ming, M. Pan, D. Gao, ibinhunt: Binary hunting with inter-procedural control flow, in 15th International Conference on Information Security and Cryptology, (2012), 92–109.
    [21] J. Pewny, F. Schuster, L. Bernhard, T. Holz, C. Rossow, Leveraging semantic signatures for bug search in binary programs, in Proceedings of the 30th Annual Computer Security Applications Conference, (2014), 406–415.
    [22] M. Bourquin, A. King and E. Robbins, Binslayer: accurate comparison of binary executables, in Proceedings of the 2nd ACM SIGPLAN Program Protection and Reverse Engineering Workshop, (2013), 4: 1–4: 10.
    [23] T. Dullien, R. Rolles, Graph-based comparison of executable objects (english version), in SSTIC, 5 (2005).
    [24] H. Flake, Structural comparison of executable objects, in Proceddings of the 2004 SIDAR Workshop on Detection of Intrusions and Malware & Vulnerability Assessment, (2004), 161–173.
    [25] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (2019), 4171–4186.
    [26] P. Liu, X. Qiu, X. Huang, Recurrent neural network for text classification with multi-task learning, in Proceedings of the 25th International Joint Conference on Artificial Intelligence, (2016), 2873–2879.
    [27] F. Schroff, D. Kalenichenko, J. Philbin, Facenet: A unified embedding for face recognition and clustering, in IEEE Conference on Computer Vision and Pattern Recognition, (2015), 815–823.
    [28] H. O. Song, Y. Xiang, S. Jegelka, S. Savarese, Deep metric learning via lifted structured feature embedding, in 2016 IEEE Conference on Computer Vision and Pattern Recognition, (2016), 4004–4012.
    [29] C. Fellbaum, WordNet — An Electronical Lexical Database, (1998).
    [30] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in Advances in Neural Information Processing Systems, (2013), 3111–3119.
    [31] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, in Proceedings of the 1st International Conference on Learning Representations, (2013).
    [32] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word representation, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, (2014), 1532–1543.
    [33] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput., 9 (1997), 1735–1780.
    [34] A. Graves, Supervised Sequence Labelling with Recurrent Neural Networks, Stud. Comput. Intell. 385 (2012).
    [35] H. Dai, B. Dai, L. Song, Discriminative embeddings of latent variable models for structured data, in Proceedings of the 33rd International Conference on Machine Learning, (2016), 2702–2711.
    [36] F. Zuo, X. Li, P. Young, L. Luo, Q. Zeng, Z. Zhang, Neural machine translation inspired binary code similarity comparison beyond function pairs, in 26th Annual Network and Distributed System Security Symposium, (2019).
    [37] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, R. Shah, Signature verification using a siamese time delay neural network, in Advances in Neural Information Processing Systems, (1993), 737–744.
    [38] F. Chollet, Keras: The Python Deep Learning library, (2018).
    [39] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, et al., Tensorflow: A system for large-scale machine learning, in 12th USENIX Symposium on Operating Systems Design and Implementation, (2016), 265–283.
    [40] F. Wang, Y. Shoshitaishvili, Angr - the next generation of binary analysis, in IEEE Cybersecurity Development, (2017), 8–9.
    [41] L. van der Maaten, G. Hinton, Visualizing data using t-sne, J. Mach. Learn. Res., 9 (2008), 2579–2605.
  • Reader Comments
  • © 2021 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(3565) PDF downloads(183) Cited by(3)

Article outline

Figures and Tables

Figures(10)  /  Tables(2)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog