DNA-protein binding is crucial for the normal development and function of organisms. The significance of accurately identifying DNA-protein binding sites lies in its role in disease prevention and the development of innovative approaches to disease treatment. In the present study, we introduce a precise and robust identifier for DNA-protein binding residues. In the context of protein representation, we combine the evolutionary information of the protein, represented by its position-specific scoring matrix, with the spatial information of the protein's secondary structure, enriching the overall informational content. This approach initially employs a combination of Bi-directional Long Short-Term Memory and Transformer encoder to jointly extract the interdependencies among residues within the protein sequence. Subsequently, convolutional operations are applied to the resulting feature matrix to capture local features of the residues. Experimental results on the benchmark dataset demonstrate that our method exhibits a higher level of competitiveness when compared to contemporary classifiers. Specifically, our method achieved an MCC of 0.349, SP of 96.50%, SN of 44.03% and ACC of 94.59% on the PDNA-41 dataset.
Citation: Haipeng Zhao, Baozhong Zhu, Tengsheng Jiang, Zhiming Cui, Hongjie Wu. Identification of DNA-protein binding residues through integration of Transformer encoder and Bi-directional Long Short-Term Memory[J]. Mathematical Biosciences and Engineering, 2024, 21(1): 170-185. doi: 10.3934/mbe.2024008
DNA-protein binding is crucial for the normal development and function of organisms. The significance of accurately identifying DNA-protein binding sites lies in its role in disease prevention and the development of innovative approaches to disease treatment. In the present study, we introduce a precise and robust identifier for DNA-protein binding residues. In the context of protein representation, we combine the evolutionary information of the protein, represented by its position-specific scoring matrix, with the spatial information of the protein's secondary structure, enriching the overall informational content. This approach initially employs a combination of Bi-directional Long Short-Term Memory and Transformer encoder to jointly extract the interdependencies among residues within the protein sequence. Subsequently, convolutional operations are applied to the resulting feature matrix to capture local features of the residues. Experimental results on the benchmark dataset demonstrate that our method exhibits a higher level of competitiveness when compared to contemporary classifiers. Specifically, our method achieved an MCC of 0.349, SP of 96.50%, SN of 44.03% and ACC of 94.59% on the PDNA-41 dataset.
[1] | V. Charoensawan, D. Wilson, S. A. Teichmann, Genomic repertoires of DNA-binding transcription factors across the tree of life, Nucleic Acids Res., 38 (2010), 7364–7377. https://doi.org/10.1093/nar/gkq617 doi: 10.1093/nar/gkq617 |
[2] | J. Si, R. Zhao, R. Wu, An overview of the prediction of protein DNA-binding sites, Int. J. Mol. Sci., 16 (2015), 5194–5215. https://doi.org/10.3390/ijms16035194 doi: 10.3390/ijms16035194 |
[3] | K. A. Aeling, N. R. Steffen, M. Johnson, G. W. Hatfield, R. H. Lathrop, D. F. Senear, DNA deformation energy as an indirect recognition mechanism in protein-DNA interactions, IEEE/ACM Trans. Comput. Biol. Bioinf., 4 (2007), 117–125. https://doi.org/10.1109/TCBB.2007.1000 doi: 10.1109/TCBB.2007.1000 |
[4] | M. Ljungman, Activation of DNA damage signaling, Mutat. Res. Fundam. Mol. Mech. Mutagen., 577 (2005), 203–216. https://doi.org/10.1016/j.mrfmmm.2005.02.014 doi: 10.1016/j.mrfmmm.2005.02.014 |
[5] | G. Zhu, S. Cansiz, M. You, L. Qiu, D. Han, L. Zhang, et al., Nuclease-resistant synthetic drug-DNA adducts: Programmable drug-DNA conjugation for targeted anticancer drug delivery, NPG Asia Mater., 7 (2015). https://doi.org/10.1038/am.2015.19 doi: 10.1038/am.2015.19 |
[6] | S. Peled, O. Leiderman, R. Charar, G. Efroni, Y. Shav-Tal, Y. Ofran, De-novo protein function prediction using DNA binding and RNA binding proteins as a test case, Nat. Commun., 7 (2016), 13424. https://doi.org/10.1038/ncomms13424 doi: 10.1038/ncomms13424 |
[7] | C. J. Jeffery, Current successes and remaining challenges in protein function prediction, Front. Bioinf., 3 (2023). https://doi.org/10.3389/fbinf.2023.1222182 doi: 10.3389/fbinf.2023.1222182 |
[8] | C. P. Ponting, J. Schultz, F. Milpetz, P. Bork, SMART: Identification and annotation of domains from signalling and extracellular protein sequences, Nucleic Acids Res., 27 (1999), 229–232. https://doi.org/10.1093/nar/27.1.229 doi: 10.1093/nar/27.1.229 |
[9] | N. M. Luscombe, R. A. Laskowski, J. M. Thornton, Amino acid–base interactions: A three-dimensional analysis of protein–DNA interactions at an atomic level, Nucleic Acids Res., 29 (2001), 2860–2874. https://doi.org/10.1093/nar/29.13.2860 doi: 10.1093/nar/29.13.2860 |
[10] | Y. Mandel-Gutfreund, H. Margalit, Quantitative parameters for amino acid-base interaction: Implications for prediction of protein-DNA binding sites, Nucleic Acids Res., 26 (1998), 2306–2312. https://doi.org/10.1093/nar/26.10.2306 doi: 10.1093/nar/26.10.2306 |
[11] | Y. H. Zhu, J. Hu, X. N. Song, D. Yu, DNAPred: Accurate identification of DNA-binding sites from protein sequence by ensembled hyperplane-distance-based support vector machines, J. Chem. Inf. Model., 59 (2019), 3057–3071. https://doi.org/10.1021/acs.jcim.8b00749 doi: 10.1021/acs.jcim.8b00749 |
[12] | X. Ma, J. Guo, H. D. Liu, J. Xie, X. Sun, Sequence-based prediction of DNA-binding residues in proteins with conservation and correlation information, IEEE/ACM Trans. Comput. Biol. Bioinf., 9 (2012), 1766–1775. https://doi.org/10.1109/TCBB.2012.106 doi: 10.1109/TCBB.2012.106 |
[13] | J. Yan, L. Kurgan, DRNApred, fast sequence-based method that accurately predicts and discriminates DNA-and RNA-binding residues, Nucleic Acids Res., 45 (2017). https://doi.org/10.1093/nar/gkx059 doi: 10.1093/nar/gkx059 |
[14] | L. Wang, M. Q. Yang, J. Y. Yang, Prediction of DNA-binding residues from protein sequence information using random forests, BMC Genomics, 10 (2009). https://doi.org/10.1186/1471-2164-10-S1-S1 doi: 10.1186/1471-2164-10-S1-S1 |
[15] | H. A. Maghawry, M. G. M. Mostafa, T. F. Gharib, A new protein structure representation for efficient protein function prediction, J. Comput. Biol., 21 (2014), 936–946. https://doi.org/10.1089/cmb.2014.0137 doi: 10.1089/cmb.2014.0137 |
[16] | Y. Xia, C. Q. Xia, X. Pan, H. Shen, GraphBind: Protein structural context embedded rules learned by hierarchical graph neural networks for recognizing nucleic-acid-binding residues, Nucleic Acids Res., 49 (2021). https://doi.org/10.1093/nar/gkab044 doi: 10.1093/nar/gkab044 |
[17] | H. Zhou, D. Ren, H. Xia, M. Fan, X. Yang, H. Huang, Ast-gnn: An attention-based spatio-temporal graph neural network for interaction-aware pedestrian trajectory prediction, Neurocomputing, 445 (2021), 298–308. https://doi.org/10.1016/j.neucom.2021.03.024 doi: 10.1016/j.neucom.2021.03.024 |
[18] | R. Liu, J. Hu, DNABind: A hybrid algorithm for structure‐based prediction of DNA‐binding residues by combining machine learning‐and template‐based approaches, Proteins Struct. Funct. Bioinf., 81 (2013), 1885–1899. https://doi.org/10.1002/prot.24330 doi: 10.1002/prot.24330 |
[19] | S. Jones, H. P. Shanahan, H. M. Berman, J. M. Thornton, Using electrostatic potentials to predict DNA‐binding sites on DNA‐binding proteins, Nucleic Acids Res., 31 (2003), 7189–7198. https://doi.org/10.1093/nar/gkg922 doi: 10.1093/nar/gkg922 |
[20] | Y. Tsuchiya, K. Kinoshita, H. Nakamura, Structure‐based prediction of DNA‐binding sites on proteins using the empirical preference of electrostatic potential and the shape of molecular surfaces, Proteins Struct. Funct. Bioinf., 55 (2004), 885–894. https://doi.org/10.1002/prot.20111 doi: 10.1002/prot.20111 |
[21] | T. Wang, J. Sun, Q. Zhao, Investigating cardiotoxicity related with hERG channel blockers using molecular fingerprints and graph attention mechanism, Comput. Biol. Med., 153 (2023), 106464. https://doi.org/10.1016/j.compbiomed.2022.106464 doi: 10.1016/j.compbiomed.2022.106464 |
[22] | Z. Chen, L. Zhang, J. Sun, R. Meng, S. Yin, Q. Zhao, DCAMCP: A deep learning model based on capsule network and attention mechanism for molecular carcinogenicity prediction, J. Cell. Mol. Med., 27 (2023), 3117–3126. https://doi.org/10.1111/jcmm.17889 doi: 10.1111/jcmm.17889 |
[23] | R. Meng, S. Yin, J. Sun, H. Hu, Q. Zhao, scAAGA: Single cell data analysis framework using asymmetric autoencoder with gene attention, Comput. Biol. Med., 165 (2023), 107414. https://doi.org/10.1016/j.compbiomed.2023.107414 doi: 10.1016/j.compbiomed.2023.107414 |
[24] | J. Hu, Y. Li, M. Zhang, X. Yang, H. Shen, D. Yu, Predicting protein-DNA binding residues by weightedly combining sequence-based features and boosting multiple SVMs, IEEE/ACM Trans. Comput. Biol. Bioinf., 14 (2017), 1389–1398. https://doi.org/10.1109/TCBB.2016.2616469 doi: 10.1109/TCBB.2016.2616469 |
[25] | W. Li, A. Godzik, Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences, Bioinformatics, 22 (2006), 1658–1659. https://doi.org/10.1093/bioinformatics/btl158 doi: 10.1093/bioinformatics/btl158 |
[26] | J. cheol Jeong, X. Lin, X. W. Chen, On position-specific scoring matrix for protein function prediction, IEEE/ACM Trans. Comput. Biol. Bioinf., 8 (2010), 308–315. https://doi.org/10.1109/TCBB.2010.93 doi: 10.1109/TCBB.2010.93 |
[27] | J. Zahiri, O. Yaghoubi, M. Mohammad-Noori, R. Ebrahimpour, A. Masoudi-Nejad, PPIevo: Protein–protein interaction prediction from PSSM based evolutionary information, Genomics, 102 (2013), 237–242. https://doi.org/10.1016/j.ygeno.2013.05.006 doi: 10.1016/j.ygeno.2013.05.006 |
[28] | S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, et al., Gapped BLAST and PSI-BLAST: A new generation of protein database search programs, Nucleic Acids Res., 25 (1997), 3389–3402. https://doi.org/10.1093/nar/25.17.3389 doi: 10.1093/nar/25.17.3389 |
[29] | The UniProt Consortium, UniProt: A worldwide hub of protein knowledge, Nucleic Acids Res., 47 (2019), D506–D515. https://doi.org/10.1093/nar/gky1049 doi: 10.1093/nar/gky1049 |
[30] | L. J. McGuffin, K. Bryson, D. T. Jones, The PSIPRED protein structure prediction server, Bioinformatics, 16 (2000), 404–405. https://doi.org/10.1093/bioinformatics/16.4.404 doi: 10.1093/bioinformatics/16.4.404 |
[31] | L. Wang, S. J. Brown, BindN: A web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences, Nucleic Acids Res., 34 (2006), W243–W248. https://doi.org/10.1093/nar/gkl298 doi: 10.1093/nar/gkl298 |
[32] | W. Y. Chu, Y. F. Huang, C. C. Huang, Y. Cheng, C. Huang, Y. Oyang, ProteDNA: A sequence-based predictor of sequence-specific DNA-binding residues in transcription factors, Nucleic Acids Res., 37 (2009), W396–W401. https://doi.org/10.1093/nar/gkp449 doi: 10.1093/nar/gkp449 |
[33] | S. Hwang, Z. Gou, I. B. Kuznetsov, DP-Bind: A web server for sequence-based prediction of DNA-binding residues in DNA-binding proteins, Bioinformatics, 23 (2007), 634–636. https://doi.org/10.1093/bioinformatics/btl672 doi: 10.1093/bioinformatics/btl672 |
[34] | L. Wang, C. Huang, M. Q. Yang, J. Y. Yang, BindN+ for accurate prediction of DNA and RNA-binding residues from protein sequence features, BMC Syst. Biol., 4 (2010). https://doi.org/10.1186/1752-0509-4-S1-S3 doi: 10.1186/1752-0509-4-S1-S3 |
[35] | J. Si, Z. Zhang, B. Lin, M. Schroeder, B. Huang, MetaDBSite: A meta approach to improve protein DNA-binding sites prediction, BMC Syst. Biol., 5 (2011). https://doi.org/https://doi.org/10.1186/1752-0509-5-S1-S7 doi: 10.1186/1752-0509-5-S1-S7 |
[36] | J. Li, H. Tian, J. Yang, Z. Gong, Long noncoding RNAs regulate cell growth, proliferation, and apoptosis, DNA Cell Biol., 35 (2016), 459–470. https://doi.org/10.1089/dna.2015.3187 doi: 10.1089/dna.2015.3187 |
[37] | M. D. Paraskevopoulou, A. G. Hatzigeorgiou, Analyzing miRNA–lncRNA interactions, in Long Non-coding RNAs: Methods and Protocols, Humana press, (2016), 271–286. https://doi.org/10.1007/978-1-4939-3378-5_21 |
[38] | J. C. R. Fernandes, S. M. Acuña, J. I. Aoki, L. M. Floeter-Winter, S. M. Muxel, Long non-coding RNAs in the regulation of gene expression: Physiology and disease, Non-coding RNA, 5 (2019), 17. https://doi.org/10.3390/ncrna5010017 doi: 10.3390/ncrna5010017 |
[39] | X. Li, C. Q. Zhong, R. Wu, X. Xu, Z. Yang, S. Cai, et al., RIP1-dependent linear and nonlinear recruitments of caspase-8 and RIP3 respectively to necrosome specify distinct cell death outcomes, Protein Cell, 12 (2021), 858–876. https://doi.org/10.1007/s13238-020-00810-x doi: 10.1007/s13238-020-00810-x |
[40] | W. Wang, L. Zhang, J. Sun, Q. Zhao, J. Shuai, Predicting the potential human lncRNA–miRNA interactions based on graph convolution network with conditional random field, Briefings Bioinf., 23 (2022), bbac463. https://doi.org/10.1093/bib/bbac463 doi: 10.1093/bib/bbac463 |
[41] | L. Zhang, P. Yang, H. Feng, Q. Zhao, H. Liu, Using network distance analysis to predict lncRNA–miRNA interactions, Interdiscip. Sci.: Comput. Life Sci., 13 (2021), 535–545. https://doi.org/10.1007/s12539-021-00458-z doi: 10.1007/s12539-021-00458-z |