With the development of next-generation protein sequencing technologies, sequence assembly algorithm has become a key technology for de novo sequencing process. At present, the existing methods can address the assembly of an unknown single protein chain. However, for monoclonal antibodies with light and heavy chains, the assembly is still an unsolved question. To address this problem, we propose a new assembly method, DBAS, which integrates the quality scores and sequence alignment scores from de novo sequencing peptides into a weighted de Bruijn graph to assemble the final protein sequences. The established method is used to assembling sequences from two datasets with mixed light and heavy chains from antibodies. The results show that the DBAS can assemble long antibody sequences for both mixed light and heavy chains and single chains. In addition, DBAS is able to distinguish the light and heavy chains by using BLAST sequence alignment. The results show that the algorithm has good performance for both target sequence coverage and contig assembly accuracy.
Citation: Yi Lu, Cheng Ge, Biao Cai, Qing Xu, Ren Kong, Shan Chang. Antibody sequences assembly method based on weighted de Bruijn graph[J]. Mathematical Biosciences and Engineering, 2023, 20(4): 6174-6190. doi: 10.3934/mbe.2023266
With the development of next-generation protein sequencing technologies, sequence assembly algorithm has become a key technology for de novo sequencing process. At present, the existing methods can address the assembly of an unknown single protein chain. However, for monoclonal antibodies with light and heavy chains, the assembly is still an unsolved question. To address this problem, we propose a new assembly method, DBAS, which integrates the quality scores and sequence alignment scores from de novo sequencing peptides into a weighted de Bruijn graph to assemble the final protein sequences. The established method is used to assembling sequences from two datasets with mixed light and heavy chains from antibodies. The results show that the DBAS can assemble long antibody sequences for both mixed light and heavy chains and single chains. In addition, DBAS is able to distinguish the light and heavy chains by using BLAST sequence alignment. The results show that the algorithm has good performance for both target sequence coverage and contig assembly accuracy.
[1] | V. Pham, W. J. Henzel, D. Arnott, S. Hymowitz, W. N. Sandoval, B. T. Truong, et al., De novo proteomic sequencing of a monoclonal antibody raised against OX40 ligand, Anal. Biochem., 352 (2006), 77–86. https://doi.org/10.1016/j.ab.2006.02.001 doi: 10.1016/j.ab.2006.02.001 |
[2] | C. S. Pareek, R. Smoczynski, A. Tretyn, Sequencing technologies and genome sequencing, J. Appl. Genet., 52 (2011), 413–435. https://doi.org/10.1007/s13353-011-0057-x doi: 10.1007/s13353-011-0057-x |
[3] | X. Liao, M. Li, Y. Zou, F. X. Wu, Y. Pan, J. Wang, Current challenges and solutions of de novo assembly, Quant. Biol., 7 (2019), 90–109. https://doi.org/10.1007/s40484-019-0166-9 doi: 10.1007/s40484-019-0166-9 |
[4] | D. R. Zerbino, E. Birney, Velvet: Algorithms for de novo short read assembly using de Bruijn graphs, Genome Res., 18 (2008), 821–829. https://doi.org/10.1101/gr.074492.107 doi: 10.1101/gr.074492.107 |
[5] | N. Bandeira, H. Tang, V. Bafna, P. Pevzner, Shotgun protein sequencing by tandem mass spectra assembly, Anal. Chem., 76 (2004), 7221–7233. https://doi.org/10.1021/ac0489162 doi: 10.1021/ac0489162 |
[6] | J. A. Baaijens, A. Z. E. Aabidine, E. Rivals, A. Schönhuth, De novo assembly of viral quasispecies using overlap graphs, Genome Res., 27 (2017), 835–848. https://doi.org/10.1101/gr.215038.116 doi: 10.1101/gr.215038.116 |
[7] | C. Ge, Y. Lu, J. Qu, L. Xie, F. Wang, H. Zhang, et al., DePS: An improved deep learning model for de novo peptide sequencing, preprint, arXiv: 2203.08820. |
[8] | A. Guthals, K. R. Clauser, A. M. Frank, N. Bandeira, Sequencing-grade de novo analysis of MS/MS triplets (CID/HCD/ETD) from overlapping peptides, J. Proteome Res., 12 (2013), 2846–2857. https://doi.org/10.1021/pr400173d doi: 10.1021/pr400173d |
[9] | B. Ma, K. Zhang, C. Hendrie, C. Liang, M. Li, A. Doherty-Kirby, et al., PEAKS: Powerful software for peptide de novo sequencing by tandem mass spectrometry, Rapid Commun. Mass Spectrom., 17 (2003), 2337–2342. https://doi.org/10.1002/rcm.1196 doi: 10.1002/rcm.1196 |
[10] | M. M. Rahman, R. Sharker, S. Biswas, M. S. Rahman, HaVec: An efficient de Bruijn graph construction algorithm for genome assembly, Int. J. Genomics, 2017 (2017), 1–12. https://doi.org/10.1155/2017/6120980 doi: 10.1155/2017/6120980 |
[11] | J. Zhang, L. Xin, B. Shan, W. Chen, M. Xie, D. Yuen, et al., PEAKS DB: De novo sequencing assisted database search for sensitive and accurate peptide identification, Mol. Cell. Proteomics, 11 (2012). https://doi.org/10.1074/mcp.M111.010587 doi: 10.1074/mcp.M111.010587 |
[12] | J. Sohn, J. W. Nam, The present and future of de novo whole-genome assembly, Briefings Bioinf., 19 (2018), 23–40. https://doi.org/10.1093/bib/bbw096 doi: 10.1093/bib/bbw096 |
[13] | R. E. Green, A. S. Malaspinas, J. Krause, A. W. Briggs, P. L. Johnson, C. Uhler, et al., A complete Neandertal mitochondrial genome sequence determined by high-throughput sequencing, Cell, 134 (2008), 416–426. https://doi.org/10.1016/j.cell.2008.06.021 doi: 10.1016/j.cell.2008.06.021 |
[14] | M. Li, Z. Liao, Y. He, J. Wang, J. Luo, Y. Pan, ISEA: Iterative seed-extension algorithm for de novo assembly using paired-end information and insert size distribution, IEEE/ACM Trans. Comput. Biol. Bioinf., 14 (2017), 916–925. https://doi.org/10.1109/TCBB.2016.2550433 doi: 10.1109/TCBB.2016.2550433 |
[15] | J. Butler, I. MacCallum, M. Kleber, I. A. Shlyakhter, M. K. Belmonte, E. S. Lander, et al., ALLPATHS: De novo assembly of whole-genome shotgun microreads, Genome Res., 18 (2008), 810–820. https://doi.org/10.1101/gr.7337908 doi: 10.1101/gr.7337908 |
[16] | N. H. Tran, M. Z. Rahman, L. He, L. Xin, B. Shan, M. Li, Complete de novo assembly of monoclonal antibody sequences, Sci. Rep., 6 (2016), 1–10. https://doi.org/10.1038/srep31730 doi: 10.1038/srep31730 |
[17] | M. Ayling, M. D. Clark, R. M. Leggett, New approaches for metagenome assembly with short reads, Briefings Bioinf., 21 (2020), 584–594. https://doi.org/10.1093/bib/bbz020 doi: 10.1093/bib/bbz020 |
[18] | S. F. Altschul, W. Gish, W. Miller, E. W. Myers, D. J. Lipman, Basic local alignment search tool, J. Mol. Biol., 215 (1990), 403–410. https://doi.org/10.1016/S0022-2836(05)80360-2 doi: 10.1016/S0022-2836(05)80360-2 |
[19] | O. S. Upasani, M. M. Vaidya, A. N. Bhisey, Database on monoclonal antibodies to cytokeratins, Oral Oncol., 40 (2004), 236–256. https://doi.org/10.1016/j.oraloncology.2003.08.022 doi: 10.1016/j.oraloncology.2003.08.022 |
[20] | W. Li, R. Li, H. Liu, X. Guo, A. S. Shaikh, P. Li, et al., A comparison of liquid chromatography-tandem mass spectrometry (LC-MS/MS) and enzyme-multiplied immunoassay technique (EMIT) for the determination of the cyclosporin A concentration in whole blood from Chinese patients, BioSci. Trends, 11 (2017), 475–482. https://doi.org/10.5582/bst.2017.01121 doi: 10.5582/bst.2017.01121 |
[21] | A. Guthals, Y. Gan, L. Murray, Y. Chen, J. Stinson, G. Nakamura, et al., De novo MS/MS sequencing of native human antibodies, J. Proteome Res., 16 (2017), 45–54. https://doi.org/10.1021/acs.jproteome.6b00608 doi: 10.1021/acs.jproteome.6b00608 |
[22] | R. B. Batista, A. Boukerche, A. C. M. A. de Melo, A parallel strategy for biological sequence alignment in restricted memory space, J. Parallel Distrib. Comput., 68 (2008), 548–561. https://doi.org/10.1016/j.jpdc.2007.08.007 doi: 10.1016/j.jpdc.2007.08.007 |
[23] | K. Katoh, J. Rozewicki, K. D. Yamada, MAFFT online service: Multiple sequence alignment, interactive sequence choice and visualization, Briefings Bioinf., 20 (2019), 1160–1166. https://doi.org/10.1093/bib/bbx108 doi: 10.1093/bib/bbx108 |
[24] | P. Pandey, M. A. Bender, R. Johnson, R. Patro, deBGR: An efficient and near-exact representation of the weighted de Bruijn graph, Bioinformatics, 33 (2017), i133–i141. https://doi.org/10.1093/bioinformatics/btx261 doi: 10.1093/bioinformatics/btx261 |
[25] | J. Liu, Q. Lian, Y. Chen, J. Qi, Amino acid based de Bruijn graph algorithm for identifying complete coding genes from metagenomic and metatranscriptomic short reads, Nucleic Acids Res., 47 (2019), e30. https://doi.org/10.1093/nar/gkz017 doi: 10.1093/nar/gkz017 |
[26] | G. Peng, P. Ji, F. Zhao, A novel codon-based de Bruijn graph algorithm for gene construction from unassembled transcriptomes, Genome Biol., 17 (2016), 1–12. https://doi.org/10.1186/s13059-016-1094-x doi: 10.1186/s13059-016-1094-x |
[27] | R. Rizzi, S. Beretta, M. Patterson, Y. Pirola, M. Previtali, G. D. Vedova, et al., Overlap graphs and de Bruijn graphs: Data structures for de novo genome assembly in the big data era, Quant. Biol., 7 (2019), 278–292. https://doi.org/10.1007/s40484-019-0181-x doi: 10.1007/s40484-019-0181-x |
[28] | A. Bankevich, A. V. Bzikadze, M. Kolmogorov, D. Antipov, P. A. Pevzner, Multiplex de Bruijn graphs enable genome assembly from long, high-fidelity reads, Nat. Biotechnol., 40 (2022), 1075–1081. https://doi.org/10.1038/s41587-022-01220-6 doi: 10.1038/s41587-022-01220-6 |
[29] | I. Retter, H. H. Althaus, R. Münch, W. Müller, VBASE2, an integrative V gene database, Nucleic Acids Res., 33 (2005), D671–D674. https://doi.org/10.1093/nar/gki088 doi: 10.1093/nar/gki088 |
[30] | S. Mollova, I. Retter, W. Müller, Visualising the immune repertoire, BMC Syst. Biol., 1 (2007), 1. https://doi.org/10.1186/1752-0509-1-S1-P30 doi: 10.1186/1752-0509-1-S1-P30 |
[31] | M. P. Lefranc, V. Giudicelli, C. Ginestoux, J. J. Michaloud, G. Folch, F. Bellahcene, et al., IMGT®, the international ImMunoGeneTics information system®, Nucleic Acids Res., 37 (2009), D1006–D1012. https://doi.org/10.1093/nar/gkn838 doi: 10.1093/nar/gkn838 |
[32] | M. P. Lefranc, V. Giudicelli, P. Duroux, J. J. Michaloud, G. Folch, S. Aouinti, et al., IMGT®, the international ImMunoGeneTics information system® 25 years on, Nucleic Acids Res., 43 (2015), D413–D422. https://doi.org/10.1093/nar/gku1056 doi: 10.1093/nar/gku1056 |
[33] | M. G. Grabherr, B. J. Haas, M. Yassour, J. Z. Levin, D. A. Thompson, I. Amit, et al., Full-length transcriptome assembly from RNA-Seq data without a reference genome, Nat. Biotechnol., 29 (2011), 644–652. https://doi.org/10.1038/nbt.1883 doi: 10.1038/nbt.1883 |
[34] | N. Bandeira, K. R. Clauser, P. A. Pevzner, Shotgun protein sequencing: Assembly of peptide tandem mass spectra from mixtures of modified proteins, Mol. Cell. Proteomics, 6 (2007), 1123–1134. https://doi.org/10.1074/mcp.M700001-MCP200 doi: 10.1074/mcp.M700001-MCP200 |