Ribonucleic acid (RNA) structure alignment is an important problem in computational biology to identify structural similarity of RNAs. Obtaining an efficient method for this problem is challenging due to the high computational time for the optimal solution and the low accuracy of a heuristic solution. In this paper, an efficient algorithm is proposed based on a mathematical model called longest arc-preserving common subsequence. The proposed algorithm uses a heuristic technique and high-performance computing to optimize the solution of RNA structure alignment, both in terms of the running time and the accuracy of the output. Extensive experimental studies on a multicore system are conducted to show the effectiveness of the proposed algorithm on two types of data. The first is simulated data that consists of 450 comparisons of RNA structures, while the second is real biological data that consists of 357 comparisons of RNA structures. The results show that the proposed algorithm outperforms the best-known heuristic algorithm in terms of execution time, with a percentage improvement of 71% and increasing the length of the output, i.e., accuracy, by approximately 45% in all studied cases. Finally, future approaches are discussed.
Citation: Hazem M. Bahig, Mohamed A.G. Hazber, Tarek G. Kenawy. Optimized RNA structure alignment algorithm based on longest arc-preserving common subsequence[J]. AIMS Mathematics, 2024, 9(5): 11212-11227. doi: 10.3934/math.2024550
Ribonucleic acid (RNA) structure alignment is an important problem in computational biology to identify structural similarity of RNAs. Obtaining an efficient method for this problem is challenging due to the high computational time for the optimal solution and the low accuracy of a heuristic solution. In this paper, an efficient algorithm is proposed based on a mathematical model called longest arc-preserving common subsequence. The proposed algorithm uses a heuristic technique and high-performance computing to optimize the solution of RNA structure alignment, both in terms of the running time and the accuracy of the output. Extensive experimental studies on a multicore system are conducted to show the effectiveness of the proposed algorithm on two types of data. The first is simulated data that consists of 450 comparisons of RNA structures, while the second is real biological data that consists of 357 comparisons of RNA structures. The results show that the proposed algorithm outperforms the best-known heuristic algorithm in terms of execution time, with a percentage improvement of 71% and increasing the length of the output, i.e., accuracy, by approximately 45% in all studied cases. Finally, future approaches are discussed.
[1] | D. Jereva, P. Alov, I. Tsakovska, M. Angelova, V. Atanassova, P. Vassilev, et al., Application of intercriteria analysis to assess the performance of scoring functions in molecular docking software packages, Mathematics, 10 (2022), 2549. https://doi.org/10.3390/math10152549 doi: 10.3390/math10152549 |
[2] | M. M. Abbas, M. Abouelhoda, H. M. Bahig, A hybrid method for the exact planted (l, d) motif finding problem and its parallelization, BMC Bioinformatics, 13 (2012), S10. https://doi.org/10.1186/1471-2105-13-S17-S10 doi: 10.1186/1471-2105-13-S17-S10 |
[3] | M. M. Abbass, H. M. Bahig, An efficient algorithm to identify DNA motifs, Math. Comput. Sci., 7 (2013), 387–399. https://doi.org/10.1007/s11786-013-0165-6 doi: 10.1007/s11786-013-0165-6 |
[4] | T. G. Kenawy, M. H. Abdel-Rahman, H. M. Bahig, A fast longest crossing-plain preserving common subsequence algorithm, Int. J. Inf. Technol., 14 (2022), 3019–3029. https://doi.org/10.1007/s41870-022-01038-0 doi: 10.1007/s41870-022-01038-0 |
[5] | M. M. Abbas, H. M. Bahig, M. Abouelhoda, M. M. Mohie-Eldin, Parallelizing exact motif finding algorithms on multi-core, J. Supercomput., 69 (2014), 814–826. https://doi.org/10.1007/s11227-014-1180-3 doi: 10.1007/s11227-014-1180-3 |
[6] | C. Blum, M. J. Blesa, Hybrid techniques based on solving reduced problem instances for a longest common subsequence problem, Appl. Soft Comput., 62 (2018), 15–28. https://doi.org/10.1016/j.asoc.2017.10.005 doi: 10.1016/j.asoc.2017.10.005 |
[7] | M. S. Islam, M. R. Islam, A hybrid framework based on genetic algorithm and simulated annealing for RNA structure prediction with pseudoknots, J. King Saud Univ. Comput. Inform. Sci., 34 (2022), 912–922. https://doi.org/10.1016/j.jksuci.2020.03.005 doi: 10.1016/j.jksuci.2020.03.005 |
[8] | T. J. X. Li, C. M. Reidys, On the loop homology of a certain complex of RNA structures, Mathematics, 9 (2021), 1749. https://doi.org/10.3390/math9151749 doi: 10.3390/math9151749 |
[9] | J. Fallmann, S. Will, J. Engelhardt, B. Grüning, R. Backofen, P. F. Stadler, Recent advances in RNA folding, J. Biotechnol., 261 (2017), 97–104. https://doi.org/10.1016/j.jbiotec.2017.07.007 doi: 10.1016/j.jbiotec.2017.07.007 |
[10] | K. Zhang, D. Shasha, Simple fast algorithms for the editing distance between trees and related problems, SIAM J. Comput., 18 (1989), 1245–1262. https://doi.org/10.1137/0218082 doi: 10.1137/0218082 |
[11] | M. Quadrini, L. Tesei, E. Merelli, An algebraic language for RNA pseudoknots comparison, BMC Bioinformatics, 20 (2019), 16. https://doi.org/10.1186/s12859-019-2689-5. doi: 10.1186/s12859-019-2689-5 |
[12] | F. Wang, T. Akutsu, T. Mori, Comparison of pseudoknotted RNA secondary structures by topological centroid identification and tree edit distance. J. Comput. Biol., 27 (2020), 1443–1451. https://doi.org/10.1089/cmb.2019.0512 doi: 10.1089/cmb.2019.0512 |
[13] | P. A. Evans, Algorithms and complexity for annotated sequence analysis, Ph. D Thesis, Canada: University of Victoria, 1999. |
[14] | L. Yang, Y. Liu, X. Hu, P. Wang, X. Li, J. Wu, Graph-based analysis of RNA secondary structure similarity comparison, Complexity, 2021 (2021), 8841822. https://doi.org/10.1155/2021/8841822 doi: 10.1155/2021/8841822 |
[15] | J. Guo, Exact algorithms for the longest common subsequence problem for arc annotated sequences, Master's Thesis, Universitat Tubingen, 2002 |
[16] | G. Lin, Z. Z. Chen, T. Jiang, J. Wen, The longest common subsequence problem for sequences with nested arc annotations, J. Comput. Syst. Sci., 65 (2002), 465–480. https://doi.org/10.1016/S0022-0000(02)00004-1 doi: 10.1016/S0022-0000(02)00004-1 |
[17] | T. Jiang, G. Lin, B. Ma, K. Zhang, The longest common subsequence problem for arc-annotated sequences, J. Discrete Algorithms, 2 (2004), 257–270. https://doi.org/10.1016/S1570-8667(03)00080-7 doi: 10.1016/S1570-8667(03)00080-7 |
[18] | T. F. Smith, M. S. Waterman, Identification of common molecular subsequences, J. Mol. Biol., 147 (1981), 195–197. https://doi.org/10.1016/0022-2836(81)90087-5 doi: 10.1016/0022-2836(81)90087-5 |
[19] | C. Blum, M. Djukanovic, A. Santini, H. Jiang, C. M. Li, F. Manyà, et al., Solving longest common subsequence problems via a transformation to the maximum clique problem, Comput. Oper. Res., 125 (2021), 105089. https://doi.org/10.1016/j.cor.2020.105089 doi: 10.1016/j.cor.2020.105089 |
[20] | J. Gramm, J. Guo, R. Niedermeier, Pattern matching for arc-annotated sequences, In: Foundations of software technology and theoretical computer science, Berlin, Heidelberg: Springer, 2002. https://doi.org/10.1007/3-540-36206-1_17 |
[21] | IBM, CPLEX Optimization Studio V12.8.0, Available from: https://www.ibm.com/support/pages/cplex-optimization-studio-v128. |
[22] | G. Blelloch. Prefix sums and their applications. In: Synthesis of parallel algorithms, 1990. Available from: http://shelf2.library.cmu.edu/Tech/23445461. |
[23] | H. Bahig, K. A. Fathy, An improved parallel prefix sums algorithm, Parallel Processing Lett., 32 (2022), 2250008. https://doi.org/10.1142/S0129626422500086 doi: 10.1142/S0129626422500086 |
[24] | R. Shikder, P. Thulasiraman, P. Irani, P. Hu, An OpenMP-based tool for finding longest common subsequence in bioinformatics, BMC Res. Notes, 12 (2019), 220. https://doi.org/10.1186/s13104-019-4256-6 doi: 10.1186/s13104-019-4256-6 |
[25] | M. Crochemore, C. S. Iliopoulos, Y. J. Pinzon, J. F. Reid, A fast and practical bit-vector algorithm for the longest common subsequence problem, Inform. Processing Lett., 80 (2001), 279–285. https://doi.org/10.1016/S0020-0190(01)00182-X doi: 10.1016/S0020-0190(01)00182-X |
[26] | M. Andronescu, V. Bereg, H. H. Hoos, A. Condon, RNA STRAND: The RNA secondary structure and statistical analysis database, BMC Bioinformatics, 9 (2008), 340. https://doi.org/10.1186/1471-2105-9-340 doi: 10.1186/1471-2105-9-340 |
[27] | CRW2: Comparative RNA Web-2. Available from: https://crw2-comparative-rna-web.org/. |
[28] | R. F. Woolson, Wilcoxon signed‐rank test, Wiley encyclopedia of clinical trials, 2008. https://doi.org/10.1002/9780471462422.eoct979 |
math-09-05-550-supplementary.xlsx |