Source code is the heart of the software systems; it holds a wealth of knowledge that can be tapped for intelligent software systems and leverage the possibilities of reuse of the software. In this work, exploration revolves around making use of the pattern hidden in various software development processes and artifacts. This module is part of the smart requirements management system that is intended to be built. This system will have multiple modules to make the software requirements management phase more secure from vulnerabilities. Some of the critical challenges bothering the software development community are discussed. The background of Machine Learning approaches and their application in software development practices are explored. Some of the work done around modeling the source code and approaches used for vulnerabilities understanding in software systems are reviewed. Program representation is explored to understand some of the principles that would help in understanding the subject well. Further deeper dive into source code modeling possibilities are explored. Machine learning best practices are explored inline with the software source code modeling.
Citation: Raghavendra Rao Althar, Abdulrahman Alahmadi, Debabrata Samanta, Mohammad Zubair Khan, Ahmed H. Alahmadi. Mathematical foundations based statistical modeling of software source code for software system evolution[J]. Mathematical Biosciences and Engineering, 2022, 19(4): 3701-3719. doi: 10.3934/mbe.2022170
Source code is the heart of the software systems; it holds a wealth of knowledge that can be tapped for intelligent software systems and leverage the possibilities of reuse of the software. In this work, exploration revolves around making use of the pattern hidden in various software development processes and artifacts. This module is part of the smart requirements management system that is intended to be built. This system will have multiple modules to make the software requirements management phase more secure from vulnerabilities. Some of the critical challenges bothering the software development community are discussed. The background of Machine Learning approaches and their application in software development practices are explored. Some of the work done around modeling the source code and approaches used for vulnerabilities understanding in software systems are reviewed. Program representation is explored to understand some of the principles that would help in understanding the subject well. Further deeper dive into source code modeling possibilities are explored. Machine learning best practices are explored inline with the software source code modeling.
[1] | A. Ahmad, C. Feng, M. Khan, A. Khan, A. Ullah, S. Nazir, et al., A systematic literature review on using machine learning algorithms for software requirements identification on stack overflow, Secur. Commun. Networks, 2020. |
[2] | R. Russell, L. Kim, L. Hamilton, T. Lazovich, J. Harer, O. Ozdemir, et al., Automated vulnerability detection in source code using deep representation learning, in 2018 17th IEEE international conference on machine learning and applications (ICMLA), IEEE, (2018), 757–762. |
[3] | H. El-Hadary, S. El-Kassas, Capturing security requirements for software systems, J. Adv. Res., 5 (2014), 463–472. https://doi.org/10.1016/j.jare.2014.03.001 doi: 10.1016/j.jare.2014.03.001 |
[4] | K. Chen G. S. Corrado, T. Mikolov, I. Sutskever, J. Dean, Distributed representations of words and phrases and their compositionality, Adv. Neural Inf. Process. Syst., (2013), 3111–3119. |
[5] | Y. Kim, Convolutional neural networks for sentence classification, in Proc. 2014 Conf. Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, (2014), 1746–1751. https://doi.org/10.3115/v1/D14-1181 |
[6] | Y. Pang, X. Xue, A. S. Namin, Predicting vulnerable software components through n-gram analysis and statistical feature selection, in 2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA), IEEE, (2015), 543–548. https://doi.org/10.1109/ICMLA.2015.99 |
[7] | S. Bettaieb, S. Y. Shin, M. Sabetzadeh, L. Briand, G. Nou, M. Garceau, Decision support for security-control identification using machine learning, in International Working Conference on Requirements Engineering: Foundation for Software Quality, Springer, (2019), 3–20. |
[8] | R. Malhotra, A. Chug, A. Hayrapetian, R. Raje, Analyzing and evaluating security features in software requirements, in 2016 International Conference on Innovation and Challenges in Cyber Security (ICICCS-INBUSH), IEEE, (2016), 26–30. |
[9] | Y. Pang, X. Xue, A. S. Namin, Feature selections for effectively localizing faulty events in gui applications, in 2014 13th International Conference on Machine Learning and Applications, IEEE, (2014), 306–311. |
[10] | B. Murphy, P. Morrison, K. Herzig, L. Williams, Challenges with applying vulnerability prediction models, in Proceedings of the 2015 Symposium and Bootcamp on the Science of Security, ACM– Association for Computing Machinery, 2015. |
[11] | Y. Pang, X. Xue, A. S. Namin, Trimming test suites with coincidentally correct test cases for enhancing fault localizations, in 2014 IEEE 38th Annual Computer Software and Applications Conference, IEEE, (2014), 239–244. |
[12] | X. Xue, Y. Pang, A. S. Namin, Identifying effective test cases through k-means clustering for enhancing regression testing, in 2013 12th International Conference on Machine Learning and Applications, IEEE, 2 (2013), 78-–83. |
[13] | C. Catal, A. Akbulut, E. Ekenoglu, M. Alemdaroglu, Development of a software vulnerability prediction web service based on artificial neural networks, in Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Cham, (2017), 59–67. |
[14] | I. Medeiros, N. Neves, M. Correia, Dekant: a static analysis tool that learns to detect web application vulnerabilities, in Proceedings of the 25th International Symposium on Software Testing and Analysis, (2016), 1–11, |
[15] | L. K. Shar, L. C. Briand, H. B. K. Tan, Web application vulnerability prediction using hybrid program analysis and machine learning, IEEE Trans. Dependable Secure Comput., 12 (2014), 688–707. |
[16] | Y. Zhang, D. Lo, X. Xia, B. Xu, J. Sun, S. Li, Combining software metrics and text features for vulnerable file prediction, in 2015 20th International Conference on Engineering of Complex Computer Systems (ICECCS), IEEE, (2015), 40–49. |
[17] | S. Neuhaus, T. Zimmermann, C. Holler, A. Zeller, Predicting vulnerable software components, in Proceedings of the 14th ACM conference on Computer and communications security, (2007), 529–540, |
[18] | N. Nagappan, T. Ball, A. Zeller, Mining metrics to predict component failures, in Proceedings of the 28th international conference on Software engineering, (2006), 452–461. |
[19] | Z. Li, L. Tan, X. Wang, S. Lu, Y. Zhou, C. Zhai, Have things changed now? an empirical study of bug characteristics in modern open source software, in Proceedings of the 1st workshop on Architectural and system support for improving software dependability, (2006), 25–33, |
[20] | V. H. Nguyen, L. M. S. Tran, Predicting vulnerable software components with dependency graphs, in Proceedings of the 6th International Workshop on Security Measurements and Metrics, (2010), 1–8. |
[21] | C. Catal, B. Diri, A systematic review of software fault prediction studies, Expert Syst. Appl., 36 (2009), 7346–7354. https://doi.org/10.1016/j.eswa.2008.10.027 doi: 10.1016/j.eswa.2008.10.027 |
[22] | A. Hayrapetian, R. Raje, Empirically analyzing and evaluating security features in software requirements, in Proceedings of the 11th Innovations in Software Engineering Conference, (2018), 1–11. https://doi.org/10.1145/3172871.3172879 |
[23] | T. Li, Identifying security requirements based on linguistic analysis and machine learning, in 2017 24th Asia-Pacific Software Engineering Conference (APSEC), IEEE, (2017), 388–397. |
[24] | R. Jindal, R. Malhotra, A. Jain, Automated classification of security requirements, in 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), (2016), 2027–2033. https://doi.org/10.1109/ICACCI.2016.7732349 |
[25] | J. Zhang, X. Wang, H. Zhang, H. Sun, K. Wang, X. Liu, A novel neural source code representation based on abstract syntax tree, in 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), IEEE/ACM, (2019), 783–794. |
[26] | U. Alon, M. Zilberstein, O. Levy, E. Yahav, code2vec: Learning distributed representations of code, in Proceedings of the ACM on Programming Languages, 3 (2019), 1–29. https://doi.org/10.1145/3341688 |
[27] | Z. Mushtaq, G. Rasool, B. Shehzad, Multilingual source code analysis: A systematic literature review, IEEE Access, 5 (2017), 11307–11336. https://doi.org/10.1109/ACCESS.2017.2710421 doi: 10.1109/ACCESS.2017.2710421 |
[28] | H. K. Dam, T. Tran, T. Pham, S. W. Ng, J. Grundy, A. Ghose, Automatic feature learning for predicting vulnerable software components, IEEE Trans. Software Eng., 47 (2021), 67–85. https://doi.org/10.1109/TSE.2018.2881961 doi: 10.1109/TSE.2018.2881961 |
[29] | X. Sun, X. Liu, B. Li, Y. Duan, H. Yang, J. Hu, Exploring topic models in software engineering data analysis: A survey, in 2016 17th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), IEEE/ACIS, (2016), 357–362. |
[30] | M. Allamanis, E. T. Barr, P. Devanbu, C. Sutton, A survey of machine learning for big code and naturalness, ACM Comput. Surv. (CSUR), 51 (2018), 1–37. |
[31] | A. Hindle, E. T. Barr, M. Gabel, Z. Su, Z. Su, P. Devanbu, On the naturalness of software, Commun. ACM, 59 (2016), 122–131. https://doi.org/10.1145/2902362 doi: 10.1145/2902362 |
[32] | J. Tabassum, M. Maddela, W. Xu, A. Ritter, Code and named entity recognition in stackoverflow, preprint, arXiv: 2005.01634. |
[33] | C. Ling, Z. Lin, Y. Zou, B. Xie, Adaptive deep code search, in Proceedings of the 28th International Conference on Program Comprehension, (2020), 48–59. https://doi.org/10.1145/3387904.3389278 |
[34] | K. Goseva-Popstojanova, J. Tyo, Identification of security related bug reports via text mining using supervised and unsupervised classification, in 2018 IEEE International conference on software quality, reliability and security (QRS), IEEE, (2018), 344–355. |
[35] | T. D. Oyetoyan, P. Morrison, An improved text classification modelling approach to identify security messages in heterogeneous projects, Software Qual. J., 1 (2021). https://doi.org/10.1007/s11219-020-09546-7 |
[36] | L. B. Othmane, G. Chehrazi, E. Bodden, P. Tsalovski, A. D. Brucker, Time for addressing software security issues: Prediction models and impacting factors. data science and engineering, Data Sci. Eng., 2 (2017), 107–124. |
[37] | S. Xu, Y. Xiong, Automatic generation of pseudocode with attention Seq2seq model, in 2018 25th Asia-Pacific Software Engineering Conference (APSEC), IEEE, (2018), 711–712. |
[38] | Q. L. Nguyen, Non-functional requirements analysis modeling for software product lines, in 2009 ICSE Workshop on Modeling in Software Engineering, (2009), 56–61. https://doi.org/10.1109/MISE.2009.5069898 |
[39] | R. R. Althar, D. Samanta, D. Konar, S. Bhattacharyya, Software Source Code, De Gruyter, July 2021. |
[40] | R. R. Althar, D. Samanta, Application of machine intelligence-based knowledge graphs for software engineering, in Methodologies and Applications of Computational Statistics for Machine Intelligence, (2021), 186–202. |
[41] | R. R. Althar, D. Samanta, The realist approach for evaluation of computational intelligence in software engineering, Innovations Syst. Software Eng., 17 (2021), 17–27. https://doi.org/10.1007/s11334-020-00383-2 doi: 10.1007/s11334-020-00383-2 |
[42] | R. R. Althar, D. Samanta, Computational Statistics of Data Science for Secured Software Engineering, in Methodologies and Applications of Computational Statistics for Machine Intelligence, (2021), 81–96. |