Supertree methods are tree reconstruction techniques that combine several smaller gene trees (possibly on different sets of species) to build a larger species tree. The question of interest is whether the reconstructed supertree converges to the true species tree as the number of gene trees increases (that is, the consistency of supertree methods). In this paper, we are particularly interested in the convergence rate of the maximum likelihood supertree. Previous studies on the maximum likelihood supertree approach often formulate the question of interest as a discrete problem and focus on reconstructing the correct topology of the species tree. Aiming to reconstruct both the topology and the branch lengths of the species tree, we propose an analytic approach for analyzing the convergence of the maximum likelihood supertree method. Specifically, we consider each tree as one point of a metric space and prove that the distance between the maximum likelihood supertree and the species tree converges to zero at a polynomial rate under some mild conditions. We further verify these conditions for the popular exponential error model of gene trees.
Citation: Vu Dinh, Lam Si Tung Ho. Convergence of maximum likelihood supertree reconstruction[J]. AIMS Mathematics, 2021, 6(8): 8854-8867. doi: 10.3934/math.2021513
Supertree methods are tree reconstruction techniques that combine several smaller gene trees (possibly on different sets of species) to build a larger species tree. The question of interest is whether the reconstructed supertree converges to the true species tree as the number of gene trees increases (that is, the consistency of supertree methods). In this paper, we are particularly interested in the convergence rate of the maximum likelihood supertree. Previous studies on the maximum likelihood supertree approach often formulate the question of interest as a discrete problem and focus on reconstructing the correct topology of the species tree. Aiming to reconstruct both the topology and the branch lengths of the species tree, we propose an analytic approach for analyzing the convergence of the maximum likelihood supertree method. Specifically, we consider each tree as one point of a metric space and prove that the distance between the maximum likelihood supertree and the species tree converges to zero at a polynomial rate under some mild conditions. We further verify these conditions for the popular exponential error model of gene trees.
[1] | N. Amenta, M. Godwin, N. Postarnakevich, K. S. John, Approximating geodesic tree distance, Inform. Process. Lett., 103 (2007), 61-65. doi: 10.1016/j.ipl.2007.02.008 |
[2] | B. R. Baum, Combining trees as a way of combining data sets for phylogenetic inference, and the desirability of combining gene trees, Taxon, 41 (1992), 3-10. doi: 10.2307/1222480 |
[3] | M. S. Bayzid, T. Warnow, Naive binning improves phylogenomic analyses, Bioinformatics, 29 (2013), 2277-2284. doi: 10.1093/bioinformatics/btt394 |
[4] | L. J. Billera, S. P. Holmes, K. Vogtmann, Geometry of the space of phylogenetic trees, Adv. Appl. Math., 27 (2001), 733-767. doi: 10.1006/aama.2001.0759 |
[5] | D. Bryant, R. Bouckaert, J. Felsenstein, N. A. Rosenberg, A. RoyChoudhury, Inferring species trees directly from biallelic genetic markers: bypassing gene trees in a full coalescent analysis, Mol. Biol. Evol., 29 (2012), 1917-1932. doi: 10.1093/molbev/mss086 |
[6] | J. Chakerian, S. Holmes, DISTORY: Distance between phylogenetic histories. R package version, 1 (2013). |
[7] | J. Chifman, L. Kubatko, Quartet inference from SNP data under the coalescent model, Bioinformatics, 30 (2014), 3317-3324. doi: 10.1093/bioinformatics/btu530 |
[8] | J. A. Cotton, M. Wilkinson, Majority-rule supertrees, Syst. biol., 56 (2007), 445-452. |
[9] | V. Dinh, L. S. T. Ho, M. A. Suchard, F. A. Matsen IV, Consistency and convergence rate of phylogenetic inference via regularization, Ann. Stat., 46 (2018), 1481. |
[10] | J. Gatesy, M. S. Springer, Phylogenetic analysis at deep timescales: unreliable gene trees, bypassed hidden support, and the coalescence/concatalescence conundrum, Mol. Phylogenet. Evol., 80 (2014), 231-266. doi: 10.1016/j.ympev.2014.08.013 |
[11] | J. Heled, A. J. Drummond, Bayesian inference of species trees from multilocus data, Mol. Biol. Evol., 27 (2009), 570-580. |
[12] | W. Hoeffding, Probability inequalities for sums of bounded random variables, J. Am. Stat. Assoc., 58 (1963), 13-30. doi: 10.1080/01621459.1963.10500830 |
[13] | S. Ji, J. Kollár, B. Shiffman, A global Łojasiewicz inequality for algebraic varieties, T. Am. Math. Soc., 329 (1992), 813-818. |
[14] | L. S. Kubatko, B. C. Carstens, L. L. Knowles, STEM: species tree estimation using maximum likelihood for gene trees under coalescence, Bioinformatics, 25 (2009), 971-973. doi: 10.1093/bioinformatics/btp079 |
[15] | M. K. Kuhner, J. Felsenstein, A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates, Mol. Biol. Evol., 11 (1994), 459-468. |
[16] | B. R. Larget, S. K. Kotha, C. N. Dewey, C. Ané, BUCKy: gene tree/species tree reconciliation with bayesian concordance analysis, Bioinformatics, 26 (2010), 2910-2911. doi: 10.1093/bioinformatics/btq539 |
[17] | L. Liu, L. Yu, Estimating species trees from unrooted gene trees, Syst. Biol., 60 (2011), 661-667. doi: 10.1093/sysbio/syr027 |
[18] | L. Liu, L. Yu, S. V. Edwards, A maximum pseudo-likelihood approach for estimating species trees under the coalescent model, BMC Evol. Biol., 10 (2010), 302. doi: 10.1186/1471-2148-10-302 |
[19] | S. Mirarab, M. S. Bayzid, B. Boussau, T. Warnow, Statistical binning enables an accurate coalescent-based estimation of the avian tree, Science, 346 (2014), 1250463. doi: 10.1126/science.1250463 |
[20] | S. Mirarab, R. Reaz, M. S. Bayzid, T. Zimmermann, M. S. Swenson, T. Warnow, ASTRAL: genome-scale coalescent-based species tree estimation, Bioinformatics, 30 (2014), i541-i548. doi: 10.1093/bioinformatics/btu462 |
[21] | E. Mossel, S. Roch, Incomplete lineage sorting: consistent phylogeny estimation from multiple loci, IEEE ACM T. Comput. Bi., 7 (2008), 166-171. |
[22] | S. Patel, R. T. Kimball, E. L. Braun, Error in phylogenetic estimation for bushes in the tree of life, Journal of Phylogenetics & Evolutionary Biology, (2013). |
[23] | D. F. Robinson, Comparison of labeled trees with valency three, J. Comb. Theory B, 11 (1971), 105-119. doi: 10.1016/0095-8956(71)90020-7 |
[24] | S. Roch, M. Nute, T. Warnow, Long-branch attraction in species tree estimation: inconsistency of partitioned likelihood and topology-based summary methods, Syst. biol., 68 (2019), 281-297. doi: 10.1093/sysbio/syy061 |
[25] | S. Roch, T. Warnow, On the robustness to gene tree estimation error (or lack thereof) of coalescent-based species tree methods, Syst. Biol., 64 (2015), 663-676. doi: 10.1093/sysbio/syv016 |
[26] | A. Rokas, B. L. Williams, N. King, S. B. Carroll, Genome-scale approaches to resolving incongruence in molecular phylogenies, Nature, 425 (2003), 798-804. doi: 10.1038/nature02053 |
[27] | K. P. Schliep, phangorn: phylogenetic analysis in r, Bioinformatics, 27 (2011), 592-593. doi: 10.1093/bioinformatics/btq706 |
[28] | M. Steel, A. Rodrigo, Maximum likelihood supertrees, Syst. Biol., 57 (2008), 243-250. doi: 10.1080/10635150802033014 |
[29] | P. Vachaspati, T. Warnow, ASTRID: accurate species trees from internode distances, BMC genomics, 16 (2015), 1-13. doi: 10.1186/1471-2164-16-1 |