Research article

Mutation prediction in the SARS-CoV-2 genome using attention-based neural machine translation


  • Received: 31 January 2024 Revised: 05 April 2024 Accepted: 30 April 2024 Published: 20 May 2024
  • Severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2) has been evolving rapidly after causing havoc worldwide in 2020. Since then, it has been very hard to contain the virus owing to its frequently mutating nature. Changes in its genome lead to viral evolution, rendering it more resistant to existing vaccines and drugs. Predicting viral mutations beforehand will help in gearing up against more infectious and virulent versions of the virus in turn decreasing the damage caused by them. In this paper, we have proposed different NMT (neural machine translation) architectures based on RNNs (recurrent neural networks) to predict mutations in the SARS-CoV-2-selected non-structural proteins (NSP), i.e., NSP1, NSP3, NSP5, NSP8, NSP9, NSP13, and NSP15. First, we created and pre-processed the pairs of sequences from two languages using k-means clustering and nearest neighbors for training a neural translation machine. We also provided insights for training NMTs on long biological sequences. In addition, we evaluated and benchmarked our models to demonstrate their efficiency and reliability.

    Citation: Darrak Moin Quddusi, Sandesh Athni Hiremath, Naim Bajcinca. Mutation prediction in the SARS-CoV-2 genome using attention-based neural machine translation[J]. Mathematical Biosciences and Engineering, 2024, 21(5): 5996-6018. doi: 10.3934/mbe.2024264

    Related Papers:

  • Severe acute respiratory syndrome coronavirus 2 (SARS‑CoV‑2) has been evolving rapidly after causing havoc worldwide in 2020. Since then, it has been very hard to contain the virus owing to its frequently mutating nature. Changes in its genome lead to viral evolution, rendering it more resistant to existing vaccines and drugs. Predicting viral mutations beforehand will help in gearing up against more infectious and virulent versions of the virus in turn decreasing the damage caused by them. In this paper, we have proposed different NMT (neural machine translation) architectures based on RNNs (recurrent neural networks) to predict mutations in the SARS-CoV-2-selected non-structural proteins (NSP), i.e., NSP1, NSP3, NSP5, NSP8, NSP9, NSP13, and NSP15. First, we created and pre-processed the pairs of sequences from two languages using k-means clustering and nearest neighbors for training a neural translation machine. We also provided insights for training NMTs on long biological sequences. In addition, we evaluated and benchmarked our models to demonstrate their efficiency and reliability.



    加载中


    [1] World Health Organization, WHO Coronavirus (COVID-19) Dashboard, 2023. Available from: https://covid19.who.int.
    [2] R. Lu, X. Zhao, J. Li, P. Niu, B. Yang, H. Wu, et al., Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding, The Lancet, 395 (2020), 565–574. https://doi.org/10.1016/S0140-6736(20)30251-8 doi: 10.1016/S0140-6736(20)30251-8
    [3] A. Naqvi, K. Fatima, T. Mohammad, U. Fatima, I. Singh, A. Singh, et al., Insights into SARS-CoV-2 genome, structure, evolution, pathogenesis and therapies: structural genomics approach, Biochim. Biophys. Acta, Mol. Basis Dis., 1866 (2020), 165878. https://doi.org/10.1016/j.bbadis.2020.165878 doi: 10.1016/j.bbadis.2020.165878
    [4] R. Sanjuán, M. Nebot, N. Chirico, L. Mansky, R. Belshaw, Viral mutation rates, J. Virol., 84 (2010), 9733–9748. https://doi.org/10.1128/jvi.00694-10 doi: 10.1128/jvi.00694-10
    [5] S. Duffy, Why are RNA virus mutation rates so damn high, PLoS Biol., 16 (2018), e3000003. https://doi.org/10.1371/journal.pbio.3000003 doi: 10.1371/journal.pbio.3000003
    [6] R. Carrasco-Hernandez, R. Jácome, Y. L. Vidal, S. P. de León, Are RNA viruses candidate agents for the next global pandemic? A review, ILAR J., 58 (2017), 343–358. https://doi.org/10.1093/ilar/ilx026 doi: 10.1093/ilar/ilx026
    [7] E. Cilia, S. Teso, S. Ammendola, T. Lenaerts, A. Passerini, Predicting virus mutations through statistical relational learning, BMC Bioinf., 15 (2014), 309. https://doi.org/10.1186/1471-2105-15-309 doi: 10.1186/1471-2105-15-309
    [8] R. Yin, E. Luusua, J. Dabrowski, Y. Zhang, C. Kwoh, Tempel: time-series mutation prediction of influenza A viruses via attention-based recurrent neural networks, Bioinformatics, 36 (2020), 2697–2704. https://doi.org/10.1093/bioinformatics/btaa050 doi: 10.1093/bioinformatics/btaa050
    [9] G. Wu, S. Yan, Prediction of mutations engineered by randomness in H5N1 neuraminidases from influenza A virus, Amino Acids, 34 (2008), 81–90. https://doi.org/10.1007/s00726-007-0579-z doi: 10.1007/s00726-007-0579-z
    [10] M. Salama, A. Hassanien, A. Mostafa, The prediction of virus mutation using neural networks and rough set techniques, EURASIP J. Bioinf. Syst. Biol., 2016 (2016), 10. https://doi.org/10.1186/s13637-016-0042-0 doi: 10.1186/s13637-016-0042-0
    [11] B. Hie, E. Zhong, B. Berger, B. Bryson, Learning the language of viral evolution and escape, Science, 371 (2021), 284–288. https://doi.org/10.1126/science.abd7331 doi: 10.1126/science.abd7331
    [12] N. Thadani, S. Gurev, P. Notin, N. Youssef, N. Rollins, D. Ritter, et al., Learning from prepandemic data to forecast viral escape, Nature, 622 (2023), 818–825. https://doi.org/10.1038/s41586-023-06617-0 doi: 10.1038/s41586-023-06617-0
    [13] K. Beguir, M. Skwark, Y. Fu, T. Pierrot, N. Carranza, A. Laterre, et al., Early computational detection of potential high-risk SARS-CoV-2 variants, Comput. Biol. Med., 155 (2023), 106618. https://doi.org/10.1016/j.compbiomed.2023.106618 doi: 10.1016/j.compbiomed.2023.106618
    [14] B. Zhou, H. Zhou, X. Zhang, X. Xu, Y. Chai, Z. Zheng, et al., TEMPO: a transformer-based mutation prediction framework for SARS-CoV-2 evolution, Comput. Biol. Med., 152 (2023), 106264. https://doi.org/10.1016/j.compbiomed.2022.106264 doi: 10.1016/j.compbiomed.2022.106264
    [15] W. Han, N. Chen, X. Xu, A. Sahil, J. Zhou, Z. Li, et al., Predicting the antigenic evolution of SARS-COV-2 with deep learning, Nat. Commun., 14 (2023), 3478. https://doi.org/10.1038/s41467-023-39199-6 doi: 10.1038/s41467-023-39199-6
    [16] M. Zvyagin, A. Brace, K. Hippe, Y. Deng, B. Zhang, C. Bohorquez, et al., GenSLMs: genome-scale language models reveal SARS-CoV-2 evolutionary dynamics, Int. J. High Perform. Comput. Appl., 37 (2023), 683–705. https://doi.org/10.1177/10943420231201154 doi: 10.1177/10943420231201154
    [17] B. Hie, K. Yang, P. Kim, Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins, Cell Syst., 13 (2022), 274–285. https://doi.org/10.1016/j.cels.2022.01.003 doi: 10.1016/j.cels.2022.01.003
    [18] Z. Lin, H. Akin, R. Rao, B. Hie, Z. Zhu, W. Lu, et al., Evolutionary-scale prediction of atomic-level protein structure with a language model, Science, 379 (2023), 1123–1130. https://doi.org/10.1126/science.ade2574 doi: 10.1126/science.ade2574
    [19] A. Elnaggar, M. Heinzinger, C. Dallago, G. Rehawi, Y. Wang, L. Jones, et al., Prottrans: toward understanding the language of life through self-supervised learning, IEEE Trans. Pattern Anal. Mach. Intell., 44 (2021), 7112–7127. https://doi.org/10.1109/TPAMI.2021.3095381 doi: 10.1109/TPAMI.2021.3095381
    [20] Y. Ji, Z. Zhou, H. Liu, R. Davuluri, DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome, Bioinformatics, 37 (2021), 2112–2120. https://doi.org/10.1093/bioinformatics/btab083 doi: 10.1093/bioinformatics/btab083
    [21] H. Dalla-Torre, L. Gonzalez, J. Mendoza-Revilla, N. Carranza, A. Grzywaczewski, F. Oteri, et al., The nucleotide transformer: building and evaluating robust foundation models for human genomics, preprint, 2023. https://doi.org/10.1101/2023.01.11.523679
    [22] P. Pushkar, C. Ananth, P. Nagrath, J. Al-Amri, Vividha, A. Nayyar, Mutation prediction for coronaviruses using genome sequence and recurrent neural networks, CMC-Comput. Mater., 73 (2022), 1601–1619. https://doi.org/10.32604/cmc.2022.026205 doi: 10.32604/cmc.2022.026205
    [23] T. Mohamed, S. Sayed, A. Salah, E. Houssein, Long short-term memory neural networks for RNA viruses mutations prediction, Math. Probl. Eng., 2021 (2021), 9980347. https://doi.org/10.1155/2021/9980347 doi: 10.1155/2021/9980347
    [24] S. Tasnim, K. Talukder, A. Asfi, Next mutation prediction of SARS-COV-2 spike protein using encoder-decoder based long short term memory (LSTM) method, Khulna Univ. Stud., 2022 (2022), 803–816. https://doi.org/10.53808/KUS.2022.ICSTEM4IR.0142-se doi: 10.53808/KUS.2022.ICSTEM4IR.0142-se
    [25] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning to align and translate, preprint, arXiv: 1409.0473.
    [26] M. Luong, H. Pham, C. Manning, Effective approaches to attention-based neural machine translation, preprint, arXiv: 1508.04025.
    [27] I. Sutskever, J. Martens, G. Hinton, Generating text with recurrent neural networks, in Proceedings of the 28th International Conference on Machine Learning (ICML-11), (2011), 1017–1024.
    [28] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, preprint, arXiv: 1412.3555.
    [29] K. Schubert, E. Karousis, A. Jomaa, A. Scaiola, B. Echeverria, L. Gurzeler, et al., SARS-CoV-2 NSP1 binds the ribosomal mRNA channel to inhibit translation, Nat. Struct. Mol. Biol., 27 (2020), 959–966. https://doi.org/10.1038/s41594-020-0511-8 doi: 10.1038/s41594-020-0511-8
    [30] B. Qin, Z. Li, K. Tang, T. Wang, Y. Xie, S. Aumonier, et al., Identification of the SARS-unique domain of SARS-CoV-2 as an antiviral target, Nat. Commun., 14 (2023), 3999. https://doi.org/10.1038/s41467-023-39709-6 doi: 10.1038/s41467-023-39709-6
    [31] Y. Zheng, J. Deng, L. Han, M. Zhuang, Y. Xu, J. Zhang, et al., SARS-CoV-2 NSP5 and N protein counteract the RIG-I signaling pathway by suppressing the formation of stress granules, Signal Transduction Targeted Ther., 7 (2022), 22. https://doi.org/10.1038/s41392-022-00878-3 doi: 10.1038/s41392-022-00878-3
    [32] S. Reshamwala, V. Likhite, M. Degani, S. Deb, S. Noronha, Mutations in SARS-CoV-2 NSP7 and NSP8 proteins and their predicted impact on replication/transcription complex structure, J. Med. Virol., 93 (2021), 4616–4619. https://doi.org/10.1002/jmv.26791 doi: 10.1002/jmv.26791
    [33] G. Yeo, J. Xiang, J. Mueller, E. Luo, B. Yee, D. Schafer, et al., Discovery and functional interrogation of SARS-CoV-2 protein-RNA interactions, preprint, 2022. https://doi.org/10.21203/rs.3.rs-1394331/v1
    [34] C. Vazquez, S. Swanson, S. Negatu, M. Dittmar, J. Miller, H. Ramage, et al., SARS-CoV-2 viral proteins NSP1 and NSP13 inhibit interferon activation through distinct mechanisms, PLoS One, 16 (2021), e0253089. https://doi.org/10.1371/journal.pone.0253089 doi: 10.1371/journal.pone.0253089
    [35] M. Pillon, M. Frazier, L. Dillard, J. Williams, S. Kocaman, J. Krahn, et al., Cryo-EM structures of the SARS-CoV-2 endoribonuclease Nsp15 reveal insight into nuclease specificity and dynamics, Nat. Commun., 12 (2021), 636. https://doi.org/10.1038/s41467-020-20608-z doi: 10.1038/s41467-020-20608-z
    [36] S. Khare, C. Gurry, L. Freitas, M. Schultz, G. Bach, A. Diallo, et al., Perspectives: GISAID's role in pandemic response, China CDC Weekly, 3 (2021), 1049–1051. https://doi.org/10.46234/ccdcw2021.255 doi: 10.46234/ccdcw2021.255
    [37] L. Clark, T. Green, C. Petit, Structure of nonstructural protein 1 from SARS-CoV-2, J. Virol., 95 (2021), 4. https://doi.org/10.1128/jvi.02019-20 doi: 10.1128/jvi.02019-20
    [38] V. Srinivasan, H. Brognaro, P. Prabhu, E. Souza, S. Günther, P. Reinke, et al., Antiviral activity of natural phenolic compounds in complex at an allosteric site of SARS-CoV-2 papain-like protease, Commun. Biol., 5 (2022), 805. https://doi.org/10.1038/s42003-022-03737-7 doi: 10.1038/s42003-022-03737-7
    [39] A. Ebrahim, B. Riley, D. Kumaran, B. Andi, M. Fuchs, S. McSweeney, et al., The temperature-dependent conformational ensemble of SARS-CoV-2 main protease (Mpro), IUCrJ, 9 (2022), 682–694. https://doi.org/10.1107/S2052252522007497 doi: 10.1107/S2052252522007497
    [40] M. Biswal, S. Diggs, D. Xu, N. Khudaverdyan, J. Lu, J. Fang, et al., Two conserved oligomer interfaces of NSP7 and NSP8 underpin the dynamic assembly of SARS-CoV-2 RdRP, Nucleic Acids Res., 49 (2021), 5956–5966. https://doi.org/10.1093/nar/gkab370 doi: 10.1093/nar/gkab370
    [41] C. Zhang, Y. Chen, L. Li, Y. Yang, J. He, C. Chen, et al., Structural basis for the multimerization of nonstructural protein NSP9 from SARS-CoV-2, Mol. Biomed., 1 (2020), 5. https://doi.org/10.1186/s43556-020-00005-0 doi: 10.1186/s43556-020-00005-0
    [42] J. Chen, Q. Wang, B. Malone, E. Llewellyn, Y. Pechersky, K. Maruthi, et al., Ensemble cryo-EM reveals conformational states of the NSP13 helicase in the SARS-CoV-2 helicase replication–transcription complex, Nat. Struct. Mol. Biol., 29 (2022), 250–260. https://doi.org/10.1038/s41594-022-00734-6 doi: 10.1038/s41594-022-00734-6
    [43] Y. Kim, R. Jedrzejczak, N. Maltseva, M. Wilamowski, M. Endres, A. Godzik, et al., Crystal structure of NSP15 endoribonuclease NendoU from SARS-CoV-2, Protein Sci., 29 (2020), 1596–1605. https://doi.org/10.1002/pro.3873 doi: 10.1002/pro.3873
    [44] K. Papineni, S. Roukos, T. Ward, W. Zhu, Bleu: a method for automatic evaluation of machine translation, in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, (2002), 311–318. https://doi.org/10.3115/1073083.1073135
  • mbe-21-05-264-Supplementary.zip
  • Reader Comments
  • © 2024 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(377) PDF downloads(50) Cited by(0)

Article outline

Figures and Tables

Figures(6)  /  Tables(4)

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog