Research article Special Issues

Minimal Phylogenetic Supertrees and Local Consensus Trees

  • The problem of constructing a minimally resolved phylogenetic supertree (i.e., a rootedtree having the smallest possible number of internal nodes) that contains all of the rooted triplets froma consistent set R is known to be NP-hard. In this article, we prove that constructing a phylogenetictree consistent with R that contains the minimum number of additional rooted triplets is also NP-hard,and develop exact, exponential-time algorithms for both problems. The new algorithms are applied toconstruct two variants of the local consensus tree; for any set S of phylogenetic trees over some leaflabel set L, this gives a minimal phylogenetic tree over L that contains every rooted triplet present in alltrees in S, where “minimal” means either having the smallest possible number of internal nodes or thesmallest possible number of rooted triplets. (The second variant generalizes the RV-II tree, introducedby Kannan et al. in 1998.) We also measure the running times and memory usage in practice of thenew algorithms for various inputs. Finally, we use our implementations to experimentally investigatethe non-optimality of Aho et al.’s well-known BUILD algorithm from 1981 when applied to the localconsensus tree problems considered here.

    Citation: Jesper Jansson, Ramesh Rajaby, Wing-Kin Sung. Minimal Phylogenetic Supertrees and Local Consensus Trees[J]. AIMS Medical Science, 2018, 5(2): 181-203. doi: 10.3934/medsci.2018.2.181

    Related Papers:

    [1] Gabriele Fici, Alessio Langiu, Giosuè Lo Bosco, Riccardo Rizzo . Bacteria classification using minimal absent words. AIMS Medical Science, 2018, 5(1): 23-32. doi: 10.3934/medsci.2018.1.23
    [2] Fotios Kounelis, Christos Makris . Space Efficient Data Structures for N-gram Retrieval. AIMS Medical Science, 2017, 4(4): 426-440. doi: 10.3934/medsci.2017.4.426
    [3] Herbert F. Jelinek, Jemal H. Abawajy, Andrei V. Kelarev, Morshed U. Chowdhury, Andrew Stranieri . Decision trees and multi-level ensemble classifiers for neurological diagnostics. AIMS Medical Science, 2014, 1(1): 1-12. doi: 10.3934/medsci.2014.1.1
    [4] Magaisha Edward Kyomo, Nelson Mpumi, Elingarami Sauli, Salum J Lidenge . Efficiency of honey–grape blend in reducing radiation-induced mucositis in locally advanced head and neck squamous cell carcinoma. AIMS Medical Science, 2025, 12(1): 90-104. doi: 10.3934/medsci.2025007
    [5] Segun Akinola, Reddy Leelakrishna, Vijayakumar Varadarajan . Enhancing cardiovascular disease prediction: A hybrid machine learning approach integrating oversampling and adaptive boosting techniques. AIMS Medical Science, 2024, 11(2): 58-71. doi: 10.3934/medsci.2024005
    [6] Claudia Chaufan, Laurie Manwell, Camila Heredia, Jennifer McDonald . COVID-19 vaccines and autoimmune disorders: A scoping review protocol. AIMS Medical Science, 2023, 10(4): 318-328. doi: 10.3934/medsci.2023025
    [7] Segun Akinola . Advancing healthcare with AI: designing frameworks for diagnostics, personalized treatment, and enhanced efficiency. AIMS Medical Science, 2024, 11(3): 248-264. doi: 10.3934/medsci.2024019
    [8] Benjamin P Jones, Srdjan Saso, Timothy Bracewell-Milnes, Jen Barcroft, Jane Borley, Teodor Goroszeniuk, Kostas Lathouras, Joseph Yazbek, J Richard Smith . Laparoscopic uterosacral nerve block: A fertility preserving option in chronic pelvic pain. AIMS Medical Science, 2019, 6(4): 260-267. doi: 10.3934/medsci.2019.4.260
    [9] Tamás Micsik, Göran Elmberger, Anders Mikael Bergquist, László Fónyad . Experiences with an International Digital Slide Based Telepathology System for Routine Sign-out between Sweden and Hungary. AIMS Medical Science, 2015, 2(2): 79-89. doi: 10.3934/medsci.2015.2.79
    [10] Elżbieta Marczak, Maria Szarras-Czapnik, Małgorzata Wójcik, Agata Zygmunt-Górska, Jerzy Starzyk, Karolina Czyżowska, Anna Szymańska, Katarzyna Gołąb-Jenerał, Agnieszka Zachurzok, Elżbieta Moszczyńska . Hypopituitarism—A rare manifestation in Joubert syndrome: about 4 cases. AIMS Medical Science, 2024, 11(3): 318-329. doi: 10.3934/medsci.2024022
  • The problem of constructing a minimally resolved phylogenetic supertree (i.e., a rootedtree having the smallest possible number of internal nodes) that contains all of the rooted triplets froma consistent set R is known to be NP-hard. In this article, we prove that constructing a phylogenetictree consistent with R that contains the minimum number of additional rooted triplets is also NP-hard,and develop exact, exponential-time algorithms for both problems. The new algorithms are applied toconstruct two variants of the local consensus tree; for any set S of phylogenetic trees over some leaflabel set L, this gives a minimal phylogenetic tree over L that contains every rooted triplet present in alltrees in S, where “minimal” means either having the smallest possible number of internal nodes or thesmallest possible number of rooted triplets. (The second variant generalizes the RV-II tree, introducedby Kannan et al. in 1998.) We also measure the running times and memory usage in practice of thenew algorithms for various inputs. Finally, we use our implementations to experimentally investigatethe non-optimality of Aho et al.’s well-known BUILD algorithm from 1981 when applied to the localconsensus tree problems considered here.



    1. Introduction

    Phylogenetic trees are used to describe evolutionary relationships between species [12]. Numerous methods for reconstructing and comparing phylogenetic trees have been developed, fine-tuned to different applications and different types of input data [12,30]. The supertree approach is a relatively new divide-and-conquer-based technique for reconstructing phylogenetic trees that may be useful when dealing with very large datasets [5]. The general idea behind it is to first infer a set of highly accurate trees for overlapping subsets of the species (e.g., using a computationally expensive method such as maximum likelihood [10,12]) and then combine all the trees into one tree according to some well-defined rule. An example of a famous phylogenetic supertree for more than 4500 species can be found in [6]; see also [5,16] for references to many other supertrees in the biological literature. One class of supertree methods consists of the BUILD algorithm [2] and its various extensions [11,14,15,19,25,26,27,28,31] for combining a set of rooted triplets (binary phylogenetic trees with three leaves each), e.g., inferred by the method in [10].

    A consensus tree [1,8,22] can be regarded as the special case of a phylogenetic supertree where all the trees that are to be combined have the same leaf label set. Such inputs arise when a collection of alternative datasets, each covering all the species, is available, or when applying bootstrapping or different tree reconstruction algorithms to the same basic dataset [12]. A consensus tree can also measure the similarity between two identically leaf-labeled trees or identify parts of trees that are similar. Many different types of consensus trees, whose formal definitions of how to handle conflicts differ, have been proposed in the last 45 years. See the surveys in [8], Chapter 30 in [12], and Chapter 8.4 in [30] for more details about different consensus trees and their advantages and disadvantages, and [20,21] for some recent algorithmic results.

    In situations where more than one phylogenetic tree can explain some given experimental data equally well, it is natural to select a “minimal” tree that supports the data while making as few extra statements about the evolutionary history as possible. A minimally resolved phylogenetic supertree [18] is a supertree that is consistent with all of the input and that has the minimum number of internal nodes. By minimizing the number of internal nodes, the risk of creating false groupings called “spurious novel clades” [5] is reduced. Furthermore, such a tree gives a simpler overview of the data than a tree with many internal nodes and can in general be stored in less memory. This makes it easier for scientists to exchange information. Another way to define “minimal” above, giving what we call a minimally rooted-triplet-inducing phylogenetic supertree, instead requires that the supertree contains the minimum number of rooted triplets. This interpretation of minimal was previously considered in the definition of the RV-II local consensus tree in [22].

    There are a few misunderstandings about minimal phylogenetic supertrees in the literature. The goal of this article is to correct these issues, to further develop the underlying mathematical framework, and to design new supertree algorithms that can also be applied to the construction of consensus trees. The algorithms presented here have been implemented and are publicly available.


    1.1. Problem Definitions

    A rooted phylogenetic tree is a rooted, unordered, leaf-labeled tree in which all leaf labels are different and every internal node has at least two children. For example, T1 and T2 in figure 1 are two rooted phylogenetic trees. In this article, rooted phylogenetic trees are referred to as “trees” and every leaf in a tree is identified with its unique label.

    Let T be a tree. The set of all nodes in T, the set of internal nodes in T, and the set of leaves in T are denoted by V(T), μ(T), and Λ(T), respectively. For any u,vV(T), if u is a descendant of v and uv then we write uv. The lowest common ancestor of u and v, or lca(u,v) for short, is the node w such that both u and v are descendants of w and wx holds for every other node x which is an ancestor of both u and v.

    A rooted triplet is a binary tree with exactly three leaves. We use the notation xy|z to refer to the rooted triplet with leaf label set {x, y, z} such that lca(x, y) ≺ lca(x, z) = lca(y, z). Let T be a tree. For any x, y, z∈Λ(T), if lca(x, y) ≺ lca(x, z) = lca(y, z) holds in T then the rooted triplet xy|z and T are said to be consistent with each other. For example, ab|c is consistent with T1 but not with T2 in figure 1. Observe that for any {x, y, z} ⊆ Λ(T), exactly zero or one of the three rooted triplets xy|z, xz|y, and yz|x is consistent with T. The set of all rooted triplets that are consistent with T is denoted by r(T). For any set of rooted triplets, if r(T) then and T are consistent with each other. Finally, a set of rooted triplets is consistent if there exists a tree that is consistent with .

    Figure 1. An example. Let 𝒮 = {T1, T2} as above with Λ(T1) = Λ(T2) = {a, b, c, d, e, f, g}. Then r(T1) ∩ r(T2) = {ab|e,ab|f,ab|g,cd|e,cd|f,cd|g,ef|a,ef|b,ef|c,ef|d,ef|g} and T2 is an optimal solution to MinRLC. On the other hand, |r(T1)|=15 while |r(T2)|=23, so T2 cannot be an optimal solution to MinILC.

    Next, we give the definitions of the minimally resolved phylogenetic supertree consistent with rooted triplets problem (MinRS) (studied in [18]) and the minimally rooted-triplet-inducing phylogenetic supertree consistent with rooted triplets problem (MinIS). In both problems, the input is a consistent set of rooted triplets*, and the output is a tree T satisfying Λ(T)=tΛ(t) and r(T). The objectives are to minimize the value of |μ(T)| (for MinRS) and to minimize the value of |r(T)| (for MinIS), respectively.

    In the minimally resolved local consensus tree problem (MinRLC) and the minimally rooted-triplet-inducing local consensus tree problem (MinILC) (introduced in [22] for the special case k = 2), the input is a set 𝒮 = {T1, T2, …, Tk} of trees with Λ(T1) = Λ(T2) = … = Λ(Tk) = L for some leaf label set L, and the output is a tree T satisfying Λ(T) = L and i=1kr(Ti)r(T). The objectives in MinRLC and MinILC are, respectively, to minimize the value of |μ(T)| and to minimize the value of |r(T)|.

    Note that MinRLC and MinILC admit polynomial-time reductions to MinRS and MinIS, respectively, by setting =i=1kr(Ti).

    See figure 1 for a simple example showing that MinRLC and MinILC are indeed different problems, and consequently, that MinRS is different from MinIS. From here on, we will use Newick notation to describe trees compactly. E.g., in figure 1, we have T1 = ((a, b), (c, d), (e, f), g); and T2 = ((a, b, c, d), (e, f), g). (For details about Newick notation, the reader is referred to http://evolution.genetics.washington.edu/phylip/newicktree.html.)

    Throughout the article, the size of the input to MinRS/MinIS is expressed in terms of k = || and n = |L|, where L=tΛ(t). For MinRLC/MinILC, k = |𝒮| and n = |L|, where L = Λ(T1) = Λ(T2) = … = Λ(Tk).


    1.2. Previous Work

    Here, we give an overview of some relevant results from the literature.

    BUILD: Aho et al. [2] presented a polynomial-time algorithm called BUILD for determining if an input set of rooted triplets is consistent, and if so, constructing a tree T with Λ(T)=tΛ(t) and r(T). (When the input is not consistent, one can for example look for a tree T that maximizes |r(T) ∩ |; cf. Section 2 in [9] for a survey on this problem variant.) BUILD is summarized in Section 2.1 below. Henzinger et al. [16] gave a faster implementation of BUILD, and substituting the data structure for dynamic graph connectivity used in the proof of Theorem 1 in [16] by the one in [32] yields a time complexity of min{O(n+klog2nloglogn),O(k+n2logn)}, where k = || and n=|tΛ(t)|.

    Importantly, BUILD does not solve MinRS and MinIS. This was first observed by Bryant [[7], Section 2.5.2], who gave the following counterexample: = {bc|a,bd|a,ef|a,eg|a}. Given as input, BUILD constructs the tree TB = (a,(b,c,d),(e,f,g)), which has three internal nodes and 24 rooted triplets. In contrast, the optimal solution to both MinRS and MinIS is the tree TO = (a,(b,c,d,e,f,g)), which has two internal nodes and 15 rooted triplets. As pointed out in [18], the claim by Henzinger et al. in [16] that their Algorithm A' always constructs a minimal tree is therefore false. In another highly cited article, Ng and Wormald [25] presented an extension of BUILD named OneTree to so-called fans; however, Note 2 in Section 4 of [25] incorrectly states that OneTree outputs a tree with the minimum number of nodes. Finally, the authors of the textbook [17] seem to have been unaware of Bryant's example, as p. 302 of [17] says it is not known if the tree output by BUILD always contains the minimum number of rooted triplets.

    MinRS: For MinRS, the following strong negative result is known: MinRS cannot be approximated within n1−ε for any constant ε > 0 in polynomial time, unless P = NP [18]. An algorithm named AllMinTrees in [26] outputs all minor-minimal trees consistent with , where a tree T is minor-minimal if it is consistent with and it is not possible to obtain a tree consistent with by contracting any edges of T, and this algorithm can be used to solve MinRS. However, it runs in ((n2)n/2) time [18], which is self-exponential in n/2. Some special cases of MinRS can be solved in polynomial time; e.g., if the output tree has at most three internal nodes or if it is a caterpillar (a tree in which every node has at most one child that is an internal node) [18]. Also, for any positive integer p, if every node in the output tree has at most p children which are internal nodes then MinRS can be solved in pO(n) time [18].

    MinIS: To determine the computational complexity of MinIS was listed as an open problem in Section 6 in [18]. As far as we know, it has not been studied previously.

    MinRLC, MinILC, and local consensus trees: Finding a tree T that satisfies i=1kr(Ti)r(T) is trivial since one can just select T = T1, but imposing additional conditions makes the output more informative and meaningful. MinRLC and MinILC provide two natural ways to do it. The MinILC problem originates from Kannan et al. [22], who gave several alternative definitions of a “local consensus tree”. They called a tree T an RV-II (“relaxed version II”) tree of two trees T1 and T2 with identical leaf label sets if r(T1) ∩ r(T2) ⊆ r(T) and |r(T)| is minimized. Thus, an RV-II tree is a solution to MinILC when k = 2. In Section 5.4 of [22], the authors suggested that applying BUILD to the set r(T1) ∩ r(T2) always produces an RV-II tree, but this is not correct. A counterexample, analogous to the one for MinRS and MinIS above, is obtained by letting T1 and T2 be the two trees TB and TO, which gives r(T1) ∩ r(T2) = {bc|a, bd|a, cd|a, ef|a, eg|a, fg|a}. Then, BUILD's output is TB but this cannot be a solution to MinRLC or MinILC because TO has fewer internal nodes and fewer rooted triplets than TB. This shows that one cannot solve MinRLC/MinILC by taking =i=1kr(Ti) and applying BUILD directly.

    In the consensus tree survey by Bryant [8], “the local consensus tree” is defined as the output of BUILD when given i=1kr(Ti) as input. The algorithm in Section 5.4.1 of [22] constructs such a tree in O(n2) time for the case k = 2, while the O(kn3)-time algorithm in Theorem 7 in [16] by Henzinger et al. can be used for unbounded k. The advantages of Bryant's BUILD-based local consensus tree are that it is unique and can be computed efficiently. The disadvantages are that it does not minimize the number of nodes or induced rooted triplets and that it is defined in terms of the output of an algorithm and not axiomatically.


    1.3. New Results and Organization of the Article

    Section 2 reviews the BUILD algorithm from [2] and an enlightening result by Semple in [26] that characterizes all trees consistent with the input . Based on Semple's characterization, Section 3 gives an O*((1+3)n)=O(2.733n)-time algorithm for MinRS and an O(kn3 + 2.733n)-time algorithm for MinRLC. Section 4 then describes an O*(4n)-time algorithm for MinIS and an O(kn3 + 4n · poly(n))-time algorithm for MinILC. All four problems are NP-hard; MinRS was shown to be NP-hard in [18], and we complement this result by establishing the NP-hardness of MinRLC in Section 5 and the NP-hardness of MinIS and MinILC in Section 6. See Table 1 for a summary of the theoretical results.

    Table 1. Overview of the computational complexity of the four studied problems.
    Problem Positive result Negative result
    MinRS O(2.733n) time NP-hard
    (Theorem 1, Section 3) (Theorem 3.3 in [18])
    MinRLC O(kn3 + 2.733n) time NP-hard
    (Corollary 1, Section 3) (Theorem 3, Section 5)
    MinIS O*(4n) time NP-hard
    (Theorem 2, Section 4) (Corollary 4, Section 6)
    MinILC O(kn3 + 4n · poly(n)) time NP-hard
    (Corollary 2, Section 3) (Theorem 4, Section 6)
     | Show Table
    DownLoad: CSV

    Next, Section 7 presents a publicly available implementation of our algorithms and experimental results demonstrating how the running times and memory usage increase as the inputs grow larger. Equipped with these implementations, in Section 8 we then investigate a fundamental question: Is Bryant's BUILD-based local consensus tree (see Section 1.2), which can be computed in polynomial time, a good approximation to MinRLC and MinILC? Our experiments indicate that although the former is not optimal most of the time even for small randomly generated inputs (e.g., for k = 4 and n = 12), the ratio between the number of internal nodes or the number of rooted triplets in the BUILD-based local consensus tree and in an optimal solution is not far from 1 on average for small inputs.


    2. Preliminaries


    2.1. Aho et al.'s BUILD Algorithm [2]

    The BUILD algorithm [2] is a recursive, top-down algorithm that takes as input a set of rooted triplets and a leaf label set L such that tΛ(t)L and outputs a tree T with Λ(T) = L that is consistent with all of the rooted triplets in , if such a tree exists; otherwise, it outputs fail. The time complexity of BUILD is polynomial (q.v., Section 1.2).

    A summary of how BUILD works is given here. It first partitions the leaf label set L into blocks according to the information contained in . More precisely, BUILD constructs an auxiliary graph, defined as the undirected graph 𝒢(L) = (L, E) where for any x,yL, the edge {x, y} belongs to E if and only if contains at least one rooted triplet of the form xy|z with zL. It then computes the connected components in 𝒢(L) and assigns the leaf labels in each connected component to one block. (Henceforth, the set of leaf labels belonging to any connected component C in 𝒢(L) is denoted by Λ(C), and for every L′ ⊆ L, we define | L′ = {t : Λ(t) ⊆ L′}.) Next, for each block Λ(C), BUILD builds a tree TC by calling itself recursively using |Λ(C) together with Λ(C) as input. Finally, BUILD returns a tree consisting of a newly created root node whose children are the roots of all the recursively constructed TC-trees. The recursion's base case is when the leaf label set consists of one element x, in which case the algorithm just returns a tree with a single leaf labeled by x. If any auxiliary graph 𝒢(L′) constructed during BUILD's execution has only one connected component and |L′| > 1 holds then the algorithm terminates and outputs fail. See, e.g., [2] for the correctness proof and further details.

    Returning to the example in Section 1.2 where we had = {bc|a, bd|a, ef|a, eg|a} and L = {a, b, c, d, e, f, g}, the blocks in the auxiliary graph 𝒢(L) are {a}, {b, c, d}, and {e, f, g}. The auxiliary graphs on the successive recursive levels contain no edges, so BUILD outputs the tree (a, (b, c, d), (e, f, g)).


    2.2. Semple's Characterization

    In [26], Semple clarified the relationship between the auxiliary graph 𝒢(L) used in the BUILD algorithm and the trees consistent with . For any tree T, define π(T) as the partition of Λ(T) whose parts are the leaves in the different subtrees attached to the root of T. As an example, π(T1) = {{a, b}, {c, d}, {e, f}, {g}} in figure 1. With this notation, Semple's characterization can be expressed as:

    Lemma 1. (Corollary 3.3 in [26]) Let T be any tree that is consistent with ℜ. For each connected component C in 𝒢(L), Λ(C) ⊆ B for some Bπ(T).

    Lemma 1 implies that if T is any tree consistent with then the partition π(T) can be obtained by performing zero or more mergings of 𝒢(L)'s connected components. Thus, every tree consistent with can be recovered by trying all possible mergings of the connected components in 𝒢(L) at each recursion level.

    We remark that Lemma 1 is very useful. For example, it can be employed to prove the non-uniqueness of solutions to MinRLC and MinILC (and hence, MinRS and MinIS). To illustrate, consider the following instance: 𝒮 = {T1, T2, T3} with T1 = ((1,2,3,4,5,6),(7,8),(9,10),11), T2 = ((1,2,3,4,5,6,7,8),(9,10),11), and T3 = ((1,2,3,4,5,6,9,10),(7,8),11). The connected components in 𝒢(L), where =i=13r(Ti), consist of {1,2,3,4,5,6}, {7,8}, {9,10}, and {11}. By Lemma 1, we only need to check a few possible candidate trees (corresponding to the different ways of merging these connected components), and we find that each of T1, T2, and T3 is an optimal solution to MinILC since |r(T1)| = |r(T2)| = |r(T3)| = 93. Furthermore, each of T2 and T3 is an optimal solution to MinRLC.


    3. Exponential-Time Algorithms for MinRS and MinRLC

    This section presents an exact O(2.733n)-time algorithm for MinRS. As a consequence, MinRLC can be solved in O(kn3 + 2.733n) time.

    The main idea is to use Lemma 1 together with dynamic programming. For every L′ ⊆ L, let opt(L′) be the number of internal nodes in an optimal solution to MinRS for |L′. Clearly, if |L′| = 1 then opt(L′) = 0. To compute opt(L′) when |L′| ≥ 2, observe that if T′ is any optimal solution for |L′ then T′ consists of a root node whose children are the roots of the optimal solutions for |P1, |P2, …, |Pt, where {P1, P2, …, Pt} is equal to π(T′). The partition π(T′) can be found by enumerating the partitions of L′ and using dynamic programming to identify the best one; by Lemma 1, only partitions corresponding to the different ways of merging connected components in the auxiliary graph 𝒢(L′) need to be considered.

    The details are explained next. Suppose L′ ⊆ L is given. Let 𝒞L′ be the set of connected components in 𝒢(L′). For every subset 𝒟𝒞L′, define Merge(𝒟) as the set of all leaf labels belonging to components in 𝒟, i.e., Merge(D)=QDΛ(Q). Also define DP(𝒟) for every 𝒟𝒞L′ to be the minimum value of Σ𝒳𝒬opt(Merge(𝒳)) taken over all possible true partitions 𝒬 of 𝒟, where we say that a partition 𝒬 of a set X is a true partition of X if |X| ≥ 2 and 𝒬 ≠ {X} (i.e., if |𝒬| > 1), or if |X| = |𝒬| = 1. Then:

    Lemma 2. For every L′ ⊆ L with |L′| ≥ 2, it holds that opt(L′) = DP(𝒞L′) + 1.

    Proof. Let T′ be any optimal tree for |L′. The children of the root of T′ are the roots of the optimal solutions for |P1, |P2, …, |Pt, where each Pi equals Merge(𝒟) for some 𝒟𝒞L′ because of Lemma 1. By definition, DP(𝒞L′) is the minimum value of Σ𝒳𝒫opt(𝒳) over all true partitions 𝒫 of L′ such that each 𝒳𝒫 equals Merge(𝒟) for some 𝒟𝒞L′. Together with the common root node, this gives opt(L′) = DP(𝒞L′) + 1.

    Lemma 3. For every 𝒟 ⊆ 𝒞L′ with |𝒟| ≥ 2, DP(D)=minXD{opt(Merge(X))+min{DP(D\X), opt(Merge(D\X))}}.

    Proof. DP(𝒟) = min{Σ𝒳𝒬opt(Merge(𝒳)): 𝒬 is a true partition of 𝒟} = min{opt(Merge(𝒳)) + min{DP(𝒟\𝒳), opt(Merge(𝒟\𝒳))}: 𝒳𝒬, 𝒬 is a true partition of 𝒟} = min{opt(Merge(𝒳)) + min{DP(𝒟\𝒳), opt(Merge(𝒟\𝒳))} : ∅ ≠ 𝒳𝒟}.

    Lemmas 2 and 3 suggest the following strategy: Compute opt(L′) for all subsets L′ of L in order of increasing cardinality by evaluating the formula in Lemma 2, while using dynamic programming to compute and store the relevant DP-values. To do this, for each L′, we first construct 𝒢(L′) in polynomial time. We then enumerate all subsets 𝒟 of 𝒞L′ in a loop having |𝒞L′| iterations in which iteration j uses Lemma 3 to compute all DP(𝒟)-values where |𝒟|=j. Each application of Lemma 3 takes O*(2|𝒟|) time, so this takes a total of O*(j=2|CL|(|CL|j)2j)=O*((2+1)|CL|)=O*(3|CL|) time for each L′ by binomial expansion. To obtain opt(L), we iterate over all subsets L′ of L of cardinality i = 2, 3, …, n; iteration i computes opt(L′) for each L′ with |L′| = i in O*(3|CL|) time as just described. The total running time becomes O*(i=2n(ni)3i)=O*((3+1)n)=O*(4n).

    To reduce the time complexity, we will reduce the number of applications of Lemma 3 in the main loop that computes opt(L′) for any L′ ⊆ L. We rely on the following simple observation, which essentially tells us that the singleton components of 𝒢(L′) can be ignored.

    Lemma 4. Let 𝒰 be the set of singleton components in 𝒢(L′). DP(𝒞L′) = DP(𝒞L′\𝒰).

    Proof. Consider any x𝒰. By the construction of 𝒢(L′) and 𝒰, there are no rooted triplets of the form xy|z for any y,zL′ in the set |L′. Hence, there exists a minimally resolved supertree consistent with |L′ in which x is attached directly to the root. The lemma follows.

    The resulting algorithm, called MinRS_exact, is summarized in figure 2.

    Figure 2. Algorithm MinRS_exact.

    Theorem 1. Algorithm MinRS_exact solves MinRS in O*((1+3)n) time.

    Proof. First note that 𝒞L′\𝒰 contains no singleton components. Therefore, the number of connected components in 𝒞L′\𝒰 is at most |L|2, i.e., |CL||U||L|2. Now, when computing opt(L′) for any subset L′ of L, the number of applications of the formula in Lemma 3 is reduced since there no are subsets 𝒟 of cardinality larger than |𝒞L′| − |𝒰|. More precisely, the time for each L′ is reduced to O*(j=2|CL||U|(|CL||U|j)2j)=O*((2+1)|CL||U|)=O*(3|CL||U|)=O*(3|L|). Finally, replacing O*(3|CL|) by O*(3|L|) in the analysis of computing opt(L) above gives a total time complexity of O*(i=2n(ni)3i)=O*((3+1)n).

    Remark 1: The algorithm as presented here returns opt(L). An optimal tree with this number of internal nodes can be obtained by standard traceback techniques.

    Remark 2: For each L′ ⊆ L, the algorithm needs to store the value of opt(L′) and there are Ω(2n) such subsets. Therefore, the space complexity of the algorithm is also exponential in n.

    Corollary 1. MinRLC can be solved in O(kn3+(1+3)n·poly(n)) time.

    Proof. First construct =i=1kr(Ti) in O(kn3) time, e.g., by preprocessing each Ti in O(n) time so that any query of the form lca(x, y) in Ti with x, yL can be answered in O(1) time [4] and then, for every L′L with |L′| = 3, doing 3k lca-queries to see if L′ induces the same rooted triplet in all of the k trees. Next, run MinRS_exact on .

    In Section 7, we shall refer to the algorithm in Corollary 1 as MinRLC_exact.


    4. Exponential-Time Algorithms for MinIS and MinILC

    We now describe an O*(4n)-time algorithm for MinIS based on the technique from Section 3. Applying it to MinILC yields an O(kn3+4n·poly(n))-time algorithm for the latter.

    Lemma 1 guarantees that every valid solution to MinIS can be discovered by trying all ways of merging connected components in the auxiliary graphs 𝒢(L′). As in Section 3, we use dynamic programming to compute and store optimal values to subproblems but make the following modifications. First of all, redefine opt so that opt(L′) for every L′L means the value of |r(L′)| for an optimal solution T′ to MinIS for |L′. Secondly, redefine DP(𝒟) for every 𝒟𝒞L′ to mean the minimum value of XQ(opt(Merge(X))+(|Merge(X)|2)·|L\Merge(X)|), taken over all possible true partitions 𝒬 of 𝒟. With the new definitions of opt and DP, the analogues of Lemmas 2 and 3 become:

    Lemma 5. For every L′ ⊆ L with |L′| ≥ 2, it holds that opt(L′) = DP(𝒞L′).

    Proof. DP(𝒞L′) counts the minimum number of rooted triplets in a tree consistent with |L′ among all partitions 𝒬 of 𝒞L′. Hence, opt(L′) = DP(𝒞L′).

    Lemma 6 For every 𝒟 ⊆ 𝒞L′ with |𝒟′| ≥ 2, DP(D)=minXD{opt(Merge(X))+(|Merge(X)|2)·|L\Merge(X)|+min{DP(D\X), opt(Merge(D\X))+(|Merge(D\X)|2)·|L\Merge(D\X)|}}.

    Proof. DP(D)=min{XQ(opt(Merge(X))+(|Merge(X)|2)·|L\Merge(X)|):Q is a true partition of D}=min{opt(Merge(X))+(|Merge(X)|2)·|L\Merge(X)|+min{DP(D\X),opt(Merge(D\X))+(|Merge(D\X)|2)·|L\Merge(D\X)|}:XQ,Q is a true partition of D}=min{opt(Merge(X))+(|Merge(X)|2)·|L\Merge(X)|+min{DP(D\X),opt(Merge(D\X))+(|Merge(D\X)|2)·|L\Merge(D\X)|}:XD}.

    The new algorithm, called MinIS_exact, is obtained by modifying Algorithm MinRS_exact as follows:

    •   Change the last part of Step 5 so that it assigns DP({X}):=opt(Λ(X))+(|Λ(X)|2)·|L\Λ(X)|, according to the new definition of DP.

    •   Change Step 8 so that it computes DP(𝒟) using Lemma 6 instead of Lemma 3.

    •   Change Step 11 so that it assigns opt(L′) := DP(𝒞L′), in accordance with Lemma 5.

    •   Change Step 4 so that it always sets 𝒰 to .

    The reason why we force 𝒰 = ∅ is that we do not have an analogue of Lemma 4 for MinIS that would allow us to ignore the singleton components. The algorithm therefore spends O*(j=2|CL|(|CL|j)2j)=O*((2+1)|CL|)=O*(3|CL|) time for each L′, just like the slower version of Algorithm MinRS_exact in Section 3, and the total running time is O*(i=2n(ni)3i)=O*((3+1)n)=O*(4n).

    Theorem 2. Algorithm MinIS_exact solves MinIS in O*(4n) time.

    Corollary 2. MinILC can be solved in O(kn3 + 4n · poly(n)) time.

    Section 7 refers to the algorithm in Corollary 2 as MinILC_exact.


    5. NP-Hardness of MinRLC

    Section 3 in [18] proved that MinRS is NP-hard. It follows from the proof in [18] that MinRS remains NP-hard even if restricted to a particular special case which we now describe.

    Suppose that L0 = {v1, v2, …, vq} is a set of elements. Define L0′ = {v1′, v1″, v2′, v2″, …, vq′, vq″}, and for any integers i, j with 1 ≤ i < jq, define (vi, vj) as the set of four rooted triplets {vi′vi″|vj′, vi′vi″|vj″, vj′vj″|vi′, vj′vj″|vi″} over L0′. For any set S, let (S2) denote the set of all subsets of S of cardinality 2. According to Section 3 in [18], MinRS is NP-hard even if restricted to instances where has the form ={vi,vj}Z(vi,vj) for some set L0 and some Z(L02).

    Theorem 3. MinRLC is NP-hard.

    Proof. We reduce from the above variant of MinRS. Let be any given instance of the problem. Let P be the set of pairs of indices that form rooted triplets in , i.e., P = {{i, j} : vi′vi″|vj′, vi′vi″|vj″, vj′vj″|vi′, vj′vj″|vi″}, and let Q=({1,2,...,q}2)\P.

    Define a tree T0 = ((v1′, v1″), (v2′, v2″), …, (vq′, vq″)); and for every f = {x, y} ∈ Q, define a tree Tf by taking a copy of T0 and merging the two subtrees (vx′, vx″) and (vy′, vy″) so that Tf = ((vx′, vx″, vy′, vy″), (v1′, v1″), (v2′, v2″), …, (vn′, vn″)). Let 𝒮 = {T0} ∪ {Tf : fQ}. Note that =TiSr(Ti). This is because for any {x, y} ∈ P, the four rooted triplets vx′vx″|vy′, vx′vx″|vy″, vy′vy″|vx′, vy′vy″|vx″ appear in as well as in r(Ti) for every Ti𝒮. On the other hand, for any {x, y} ∈ Q, vx′vx″|vy′, vx′vx″|vy″, vy′vy″|vx′, vy′vy″|vx″ do not appear in or in r(T{x,y}). Thus, there exists a tree T with TiSr(Ti)r(T) having x internal nodes if and only if there exists a tree T′ with r(T′) having x internal nodes.


    6. NP-Hardness of MinILC and MinIS

    To prove the NP-hardness of MinILC, we give a polynomial-time reduction from the Maximum Clique problem, which is NP-hard [13]. Maximum Clique takes as input an undirected graph G = (V, E) and asks for a largest clique in G, where XV is a clique in G if every two vertices belonging to X are adjacent in G.

    The reduction is as follows. Let n = |V| and write V = {1, 2, …, n}. Create a set L of leaf labels such that L = {vi, vi′ : iV} ∪ {z, w1, w2, …, wn2}. Let T be the tree (z, (w1, w2, …, wn2), (v1, v1′), (v2, v2′), …, (vn, vn′)); with Λ(T) = L. For any nonempty subset X = {i1, i2, …, ip} ⊆ V, define TX as the tree with Λ(TX) = L obtained by taking a copy of T, deleting the subtrees (vi, vi′) for all iX, and replacing the subtree (w1, w2, …, wn2) by ((w1,w2,,wn2,vi1,vi2,,vip), vi1,vi2,,vip). See figure 3 for an illustration. Finally, let S={T}{T{i}:iV}{T{i,j}:{i,j}E}.

    Figure 3. The tree T defined in the reduction from Maximum Clique and a tree T{1,2}.

    Section 6.1 first states some general properties satisfied by the trees defined above. Then, Section 6.2 establishes some specific properties satisfied by any local consensus tree of 𝒮. After that, we will prove that for any XV, X is a maximum clique in G if and only if TX is a local consensus tree of 𝒮 with the smallest possible number of rooted triplets, giving the main result of this section.


    6.1. General Properties

    The following additional notation is used. For any tree H, Child(H) is the set of children of the root of H. For any uV(H), the subtree of H induced by u and all proper descendants of u is called the subtree of H rooted at u and is denoted by Hu. For any L′ ⊆ Λ(H), H|L′ is the tree obtained from H by deleting all nodes with no descendants in L′ and their incident edges, and then contracting every edge between a node having one child and its child. Finally, for every positive integer n, define a function fn(k)=n5k12n4+4kn36k23k32n2+(4k24k1)n7k37k22. We immediately have:

    Lemma 7. For any tree H, |r(H)|=uChild(H)(|r(Hu)|+(|Λ(Hu)|2)·(|Λ(H)||Λ(Hu)|)).

    Lemma 8. Let X be any subset of the given V. Write k = |X|. Then |r(TX)| = fn(k).

    Proof. By Lemma 7, the number of rooted triplets consistent with TX is (n2+k2)·k+(n2+2k2)·(2n2k+1)+(nk)·(n2+2n1). Expanding this expression yields the formula.

    Corollary 3. For any fixed n ≥ 8, fn(k) is strictly decreasing as k increases.

    Proof. By Lemma 8, fn(k+1)fn(k)=12n4+4n312k+32n2+8kn72(3k2+k). Since n ≥ 8, 12n4+4n30 holds. Also, 12k+32n>8k for n ≥ 8 and 72(3k2+k)0. The corollary follows.

    Lemma 9. Consider any uChild(H) in a tree H. Suppose that Λ(Hu) = αβ for some α, β ≠ ∅ with αβ = ∅. Let H′ be the tree obtained from H by deleting Hu and its parent edge and attaching the roots of H|α and H|β as children of the root of H. If |α|+|β|2|Λ(H)|3 then |r(H′)| < |r(H)|.

    Proof. Define m = |Λ(H)|. Lemma 7 gives |r(H)||r(H)|=|r(Hu)|+(|α|+|β|2)·(m|α||β|)|r(H|α)|(|α|2)·(m|α|)|r(H|β)|(|β|2)·(m|β|). Noting that |r(Hu)||r(H|α)|+|r(H|β)|, we have |r(H)||r(H)|(|α|+|β|2)·(m|α||β|)(|α|2)·(m|α|)(|β|2)·(m|β|)=|α|·|β|·(m+132·(|α|+|β|))|α|·|β|·(m+1m)=|α|·|β|>0.


    6.2. Properties of a Local Consensus Tree of 𝒮

    By the definition of 𝒮, we have the next lemma.

    Lemma 10. The set TiSr(Ti) consists of the following rooted triplets:

    •   wiwj|z for all 1 ≤ i < jn2 and vivi′|z for all 1 ≤ in;

    •   wiwj|vkfor all 1 ≤ i < jn2 and 1 ≤ kn;

    •   vivi|vj, vivi|vj, vjvj|vi, and vjvj|vi for all 1 ≤ i < jn with {i, j} ∉ E.

    Let T be any local consensus tree of 𝒮, i.e., any tree T such that Λ(T) = L and TiSr(Ti)r(T). According to Lemma 10, r(T) contains vivi|z for all 1 ≤ in, so the two leaves vi and vi′ must belong to the same subtree attached to the root of T for all 1 ≤ in. Similarly, all leaves in {w1,w2,,wn2} must belong to one subtree attached to the root of T. The core of T, denoted by γT, is the subtree of T rooted at the node lca(w1,w2,,wn2). The path from the root of T to the parent of γT is called the core path of T. For any node uV(T), if u is a child of the core path of T that does not belong to the core path itself and ulca(w1,w2,,wn2), the subtree of T rooted at u is called a secondary subtree of T. Note that the secondary subtrees of T are disjoint. Define CT = {i : vi ∈ Λ(γT)}.

    Lemma 11. Let T be a local consensus tree of 𝒮. T has the following properties:

    1. The core γT does not contain the leaf vi′ for any 1 ≤ in.

    1. CT forms a clique in G.

    1. For any i ∈ {1, 2, …, n}, if CT ∪ {i} is not a clique in G then vi and vi′ belong to the same secondary subtree of T.

    Proof.

    1. Suppose vi′ ∈ Λ(γT). Let wa, wb be any two leaves such that lca({wa,wb})=lca({w1,w2,,wn2}). Then, wawb|vir(T), contradicting Lemma 10.

    1. Consider any i, jCT with ij. By point 1., vivj|vi′ ∈ r(T), so vivi|vjr(T). According to Lemma 10, {i, j} ∉ E does not hold, which means that {i, j} ∈ E.

    1. Since CT ∪ {i} is not a clique, there exists some jCT where {i, j} ∉ E. By Lemma 10, vivi|vjr(T). Thus, vi and vi are in the same subtree attached to the core path.

    Observe that Lemma 11.1 implies Λ(γT)={w1,w2,,wn2}{vp|pCT}. Moreover, by Lemma 11.3, for any i ∈ {1, 2, …, n}, if vi and vi′ belong to subtrees attached to different nodes on the core path then CT ∪ {i} is a clique in G.

    Lemma 12. Let n ≥ 10 and let T be a local consensus tree of 𝒮. T can be transformed into a local consensus tree of 𝒮 of the form TX for some XV where X is a clique in G and |r(TX)| ≤ |r(T)|.

    Proof. We describe a sequence of transformations that can be applied to T without increasing the number of rooted triplets consistent with it. After each transformation, the resulting tree still contains all of the rooted triplets listed in Lemma 10, so it is still a local consensus tree of 𝒮.

    First, consider the leaf z in T. Let P be the path from the root of T to z, and let ρ1, ρ2, …, ρe (where possibly e = 0) be the subtrees of T attached to P whose roots do not belong to P themselves. Let T1 be the tree formed by removing P and attaching z and the roots of ρ1, ρ2, …, ρe as children of the root. See figure 4. Then |r(T1)| ≤ |r(T)|, and T1 has the property that z is a child of the root of T1.

    Figure 4. Illustrating the first part of the proof of Lemma 12. Any local consensus tree T of 𝒮 can be transformed into a tree of the form shown in T1 without losing any of the rooted triplets specified in Lemma 10. P is the path in T from the root to the leaf z, to which the subtrees ρ1, ρ2, …, ρe are attached. For each i ∈ {1, 2, …, n}, the two leaves vi and vi′ belong to the same ρj-subtree by Lemma 10. Also, all leaves in {w1,w2,,wn2} are in a single ρj-subtree.

    Secondly, transform T1 to T2 by contracting the core γT1, i.e., by replacing γT1 by a single node to which all leaves in Λ(γT1) are directly attached. See figure 5. Clearly, |r(T2)| ≤ |r(T1)|.

    Figure 5. Transforming T1 to T2.

    Thirdly, suppose that for some s ∈ {1, 2, …, n}, it holds that sCT2 while CT2{s} is a clique in G. Let T3 be the tree formed by removing the leaf vs from its location in T2 and attaching it to the root of γT2, and moving the leaf vs′ so that it becomes a sibling of the root of γT2. (If, as a result, any node has only one child left then contract its outgoing edge.) See figure 6. There are two types of rooted triplets involving vs: (i) xvs|y and (ii) xy|vs. For (i), T3 has at most (n2 + n − 1) · 2n more rooted triplets than T2 of this form because there are at most n2 + n − 1 choices of x by Lemma 11.1 and at most 2n choices of y. For (ii), there are at least (n22) rooted triplets in T2 but not in T3, corresponding to pairs of the form (wi, wj), and at most (2n2) such rooted triplets in T3 that are not present in T2. Similarly, there are two types of rooted triplets involving vs′: (iii) xvs′|y and (iv) xy|vs′. As above, T3 has at most (n2 + n − 1) · 2n more rooted triplets than T2 of the form (iii) and at most (2n2) more rooted triplets of the form (iv). Hence, by transforming T2 to T3, the number of rooted triplets is reduced by at least (n22)2·(2n2)2·(n2+n1)·(2n), which is larger than 0 when n ≥ 10. We repeat this step until T3 has no leaf vs such that sCT3 and CT3{s} is a clique. This gives |r(T3)| ≤ |r(T2)|.

    Figure 6. Transforming T2 to T3.

    Next, transform T3 to a tree T4 in which every secondary subtree contains at most two leaves and, furthermore, the leaves in any secondary subtree with precisely two leaves are of the form {vq, vq′} where CT4{q} is not a clique in G. To do this, consider any secondary subtree σ of T3. By the definition of T3 in the previous paragraph, CT3{q} is not a clique in G for any vq ∈ Λ(σ). Recall from Lemma 11.3 that any two leaves of the form vq and vq′ must belong to the same secondary subtree. While | Λ(σ)| > 2, extract any pair of leaves {vq, vq′} from σ and create a new secondary subtree with the leaves {vq, vq′} attached to the core path as a sibling of σ. (As above, if any node has only one child left after these operations then contract its outgoing edge.) See figure 7. Every secondary subtree σ satisfies |Λ(σ)|2n2n232|Λ(H)|3, where H is the subtree rooted at the parent of the root of σ, so we get |r(T4)| ≤ |r(T3)| by Lemma 9.

    Figure 7. Transforming T3 to T4.

    Lastly, transform T4 to a tree T5 whose core path consists of a single edge (u0, u1), where u0 is the root of T5, as follows. Attach every secondary subtree of T4 having two leaves as a child of u0, and attach every secondary subtree of T4 having one leaf as a child of u1. Attach z as a child of u0 and the core γT4 as a child of u1. See figure 8. Note that this will not destroy any of the rooted triplets in Lemma 10 and that |r(T5)| ≤ |r(T4)|.

    Figure 8. Transforming T4 to T5.

    By the definition of TX, T5 is equal to TX if we select X=CT5. Finally, CT5 is a clique in G by Lemma 11.2.

    Lemma 13. Let n ≥ 10. XV is a maximum clique in G if and only if TX is a local consensus tree of 𝒮 that minimizes the number of rooted triplets.

    Proof. (→) For the purpose of obtaining a contradiction, suppose there exists a local consensus tree T′ of 𝒮 with |r(T′)| < |r(TX)|. Apply Lemma 12 to T′ to get a tree TQ that is also a local consensus tree of 𝒮 with |r(TQ)| ≤ |r(T′)| and where Q is a clique in G. Then |Q| > |X| by Lemma 8 and Corollary 3, which is impossible.

    (←) Suppose X′ is a larger clique in G than X. Lemma 8 and Corollary 3 imply |r(TX′)| < |r(TX)|, contradicting that TX is a local consensus tree of 𝒮 minimizing the number of rooted triplets.

    Now, assuming without loss of generality that n ≥ 10 in the reduction from Maximum Clique above, Lemma 13 gives:

    Theorem 4. MinILC is NP-hard.

    Finally, by the reduction from MinILC to MinIS mentioned in Section 1.1:

    Corollary 4. MinIS is NP-hard.


    7. Implementations

    We have implemented the algorithms from Sections 3 and 4 above in C++. The source code can be downloaded from:

    https://github.com/Mesh89/FACT2

    (This package also contains an algorithm for computing another type of consensus tree called the frequency difference consensus tree, described in detail in [20].) Subsets, as well as components, were represented as bitsets and implemented as unsigned 32-bit integers. This means that the current implementation can deal with up to 32 leaves; it can be extended by using 64-bit integers or a proper bitset, in which case the memory usage will increase accordingly.

    Since the computational complexity of the new algorithms is exponential, we do not expect them to be efficient for large inputs. Nevertheless, it would be useful to know how far they can be pushed. We therefore conducted several experiments to test their running times and memory usage for various inputs. Sections 7.1 and 7.2 below describe the experiments and the results, respectively. In short, the algorithms can be considered “practical” for n ≤ 16.


    7.1. Experimental Setup

    The experiments were done on a computer equipped with an Intel i7-2600 processor (3.4 GHz) and 8 Gb of memory, running Ubuntu 16.04. The source code was compiled using g++, version 5.4.0. Running times were measured using the time command, and the memory usage was measured using Valgrind [24] and its heap profiling tool Massif.

    To generate simulated inputs, we used two different approaches:

    •   Approach 1 aimed at generating a set of unrelated trees. For each specified value of (k, n), we repeated the following procedure k times to obtain a set of k non-binary trees with leaf label set {1,2,…,n}: First, generate a binary tree in the uniform model.§ Then, for each internal node, contract it with probability 0.2.

    •   Approach 2 aimed at generating a set of related trees. For each specified value of (k,n), we first generated a single master tree T0 using the method described in Approach 1. We then created a set of k non-binary trees by doing the following k times: Take a copy of T0 and for each non-root node v, with probability 0.1 move the subtree rooted at v by selecting a node u uniformly at random that is not a descendant of v and letting v become a child of u instead (nodes that end up with a single child are contracted to preserve the property that every internal node has at least two children).

    In the experiments, for various values of (k, n), we generated 50 independent inputs at random following Approach 1, and measured the average running times of Algorithms MinRLC_exact and MinILC_exact for these inputs. The average memory usage was also measured, but taken over 20 runs only because of the slowdown caused by Valgrind. The experiments were repeated for inputs generated according to Approach 2.

    To test the algorithms on real datasets, we also downloaded a set of trees from the 10kTrees website [3], Primates, for varying k and n. In order to obtain trees with a specified value of n, we manually selected n species from the Colobinae subcategory. The 10kTrees website produces a number of trees between 10 and 10,000 so we tested the algorithms on a more limited range of values of k here (i.e., k ≤ 10,000). To obtain instances with k = 8, we generated 10 trees and subsequently removed two of them selected at random.

    The results are reported in the next subsection.


    7.2. Experimental Results

    The plots in figures 9 and 10 show how the average running times (in seconds) and the average memory usage (in Mb) of Algorithms MinRLC_exact and MinILC_exact increase for fixed k and increasing n, and for fixed n and increasing k. In figure 9, inputs were generated according to Approach 1 (“unrelated trees”), while in figure 10, they were generated according to Approach 2 (“related trees”). As a fixed value of k, we selected k = 8 because using a larger number of trees resulted in i=1kr(Ti) being almost empty most of the time, leading to a trivial scenario. As a fixed value of n, we selected n = 16 because it fell safely within the practical limit of our algorithms yet it was large enough to yield informative results. In general, MinRLC_exact is more efficient than MinILC_exact.

    Figure 9. Plots of the average running times of MinRLC_exact and MinILC_exact in seconds (left) and their memory usage (right) as a function of increasing n (with fixed k = 8) and increasing k (with fixed n = 16) on unrelated trees.

    Figure 10. Plots of the average running times of MinRLC_exact and MinILC_exact in seconds (left) and their memory usage (right) as a function of increasing n (with fixed k = 8) and increasing k (with fixed n = 16) on related trees.

    Figure 11. Plots of the average running times of MinRLC_exact and MinILC_exact in seconds as a function of increasing n (with fixed k = 8) and increasing k (with fixed n = 16).

    The behavior of MinRLC_exact is as one might expect. Its average running time is essentially independent of k, with a slight increase when k gets huge due to the added cost of parsing the input. Its average memory usage also appears to be linear in the size of the input. On the other hand, as n increases, the average running time follows a typical exponential-growth pattern.

    The performance of MinILC_exact is more curious and worthy of some additional comments. Firstly, figures 9 (a), (b) and figures 10 (a), (b) show that its running time becomes an issue earlier than its memory usage does. (In fact, in figures 9 (b) and 10 (b), we had to end the experiments at n = 16 because n = 8 required hours to finish a single run.) Secondly, observe MinILC_exacts's behavior in figure 9 (c) and figure 10 (c). In both cases, the average running time increases rapidly as more trees are added and stabilizes after a certain value. However, the average running time grows much faster for unrelated trees than related trees. In contrast, MinRLC_exact seems to do better on unrelated trees than related trees; compare, e.g., figure 9 (a) to figure 10 (a). Our hypothesis is that this is due to the different impact that the singleton components have on the two algorithms: MinRLC_exact is allowed to ignore singleton components but MinILC_exact cannot. When there are few shared rooted triplets, most components in the auxiliary graph will be singleton components, meaning that MinRLC_exact will perform very well while MinILC_exact will perform poorly as it has to consider a larger number of connected components. In other words, MinRLC_exact prefers inputs consisting of dissimilar trees whereas MinILC_exact prefers inputs where the trees are similar. This hypothesis is reinforced by the results for the trees from the 10kTrees website; these trees are expected to be very similar as they are highly correlated [3]. Figure 11 shows that in this case, the running times of MinILC_exact and MinRLC_exact are roughly the same.


    8. An Experimental Investigation of the Non-Minimality of BUILD

    In this section, we use the implementations from Section 7 to experimentally investigate whether applying the BUILD algorithm [2] to the set i=1kr(Ti), i.e., computing Bryant's BUILD-based local consensus tree [8], yields close approximations to MinRLC and MinILC in practice. This would be significant because it would provide a good polynomial-time heuristic for MinRLC and MinILC, which are NP-hard according to Theorems 3 and 4.

    We generated sets of “related trees” of various sizes using Approach 2 in Section 7.1, and for each such input set, we extracted the shared rooted triplets and fed them to BUILD. We then counted the number of internal nodes and rooted triplets in BUILD's output as well as in the optimal solutions to MinRLC and MinILC, obtained by running our algorithms MinRLC_exact and MinILC_exact.

    Tables 2 and 3 report the fraction of times that the solution reported by BUILD was optimal for MinRLC and MinILC, respectively, over 50 runs. Unfortunately, even for small values of n, BUILD often returns a non-optimal solution. For example, when k = 4 and n = 12, the BUILD-based solution was suboptimal compared to MinILC in more than half of the cases. Next, Tables 4 and 5 show how wrong BUILD usually is, which may be more relevant. On the positive side, our experiments indicate that the ratio between the number of internal nodes or the number of rooted triplets in the BUILD-based local consensus tree and in an optimal solution is fairly close to 1 on average; in the example above with k = 4 and n = 12, the BUILD-based solution was only 8% worse than the optimal solution to MinILC on average. However, the errors seem to increase with n and it may be the case that BUILD becomes less and less optimal as n grows larger, thus motivating the need for either faster optimal algorithms or better polynomial-time heuristics.

    Table 2. The suboptimality of the BUILD-based local consensus tree as a solution to MinRLC. For different values of k (rows) and n (columns), the table shows the number of times that the BUILD-based local consensus tree was an optimal solution to MinRLC divided by the number of cases tested.
    n
    6 8 10 12 14 16 18 20 22 24
    k 2 1.00 1.00 1.00 0.96 0.94 0.96 0.92 0.94 0.84 0.80
    4 1.00 1.00 0.98 0.96 0.90 0.94 0.86 0.74 0.72 0.68
    8 1.00 0.98 0.92 0.94 0.92 0.68 0.58 0.56 0.52 0.32
    16 1.00 1.00 0.98 0.90 0.74 0.92 0.68 0.30 0.46 0.30
     | Show Table
    DownLoad: CSV
    Table 3. The fraction of times that the BUILD-based local consensus tree was an optimal solution to MinILC, analogous to Table 2.
    n
    6 8 10 12 14 16
    k 2 0.94 0.88 0.74 0.72 0.50 0.62
    4 0.88 0.78 0.72 0.46 0.34 0.32
    8 0.94 0.78 0.78 0.72 0.56 0.58
    16 1.00 0.96 0.74 0.72 0.78 0.56
     | Show Table
    DownLoad: CSV
    Table 4. The average ratio between the number of internal nodes in the BUILD-based local consensus tree and in an optimal solution to MinRLC.
    n
    6 8 10 12 14 16 18 20 22 24
    k 2 1.00 1.00 1.00 1.01 1.01 1.00 1.01 1.00 1.01 1.01
    4 1.00 1.00 1.00 1.01 1.02 1.01 1.02 1.03 1.03 1.03
    8 1.00 1.01 1.02 1.03 1.02 1.07 1.09 1.10 1.10 1.16
    16 1.00 1.00 1.01 1.03 1.08 1.03 1.15 1.24 1.19 1.32
     | Show Table
    DownLoad: CSV
    Table 5. The average ratio between the number of rooted triplets in the BUILD-based local consensus tree and in an optimal solution to MinILC, analogous to Table 4.
    n
    6 8 10 12 14 16
    k 2 1.00 1.01 1.02 1.01 1.04 1.02
    4 1.01 1.02 1.04 1.08 1.08 1.10
    8 1.01 1.04 1.03 1.07 1.09 1.13
    16 1.00 1.01 1.08 1.10 1.10 1.22
     | Show Table
    DownLoad: CSV

    9. Concluding Remarks

    The main open problem is to obtain faster exponential-time algorithms than the ones presented here. In particular, can MinRS be solved in O*(2n) time?

    According to the experiments in Section 7, the implemented algorithms have different bottlenecks. For MinRLC, the memory usage of our algorithm becomes problematic before the running time does, hence requiring a more memory-efficient solution, but for MinILC, developing a faster solution is more critical.

    Another task is to extend the algorithms in this article to unrooted phylogenetic trees. This would be interesting because many existing methods for inferring phylogenetic trees produce unrooted trees [12]. The unrooted case appears to be much harder than the rooted case, as the basic problem of determining the consistency of an input set of rooted triplets is solvable in polynomial time (see Section 1.2), while the corresponding problem for unrooted quartets (unrooted, distinctly leaf-labeled trees with exactly four leaves each and in which every internal node has at least three neighbors) is already NP-hard [29].


    Acknowledgements

    The authors would like to thank Andrzej Lingas and Richard S. Lemence for some inspiring discussions. J.J. was partially funded by The Hakubi Project at Kyoto University and KAKENHI grant number 26330014.


    Conflict of interest

    All authors declare no conflicts of interest in this paper.

    *

    This article assumes without loss of generality that the input to MinRS/MinIS is consistent. The reason is that given an arbitrary , one can check whether is consistent or not in polynomial time using the BUILD algorithm [2] described below.

    The notation O*(f(n)) means O(f(n) · poly(n)).

    This technique was actually used even earlier than [26]; the SUPERB algorithm in [11] outputs all binary trees consistent with by considering all ways of merging the connected components of 𝒢(L) into exactly two connected components at each recursion level.

    §

    The uniform model [23] starts with a tree consisting of a single leaf labeled by 1, and then for i from 2 to n, inserts a leaf labeled by i by selecting an existing edge f in the tree uniformly at random (here, the root is regarded to be attached to an imaginary parent node via an edge that belongs to the tree as well), breaking f into two edges by inserting a new internal node x into f, and creating an edge from x to a leaf labeled by i.


    [1] Adams III EN (1972) Consensus techniques and the comparison of taxonomic trees. Systematic Zoology 21: 390–397. doi: 10.2307/2412432
    [2] Aho AV, Sagiv Y, Szymanski TG, et al. (1981) Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J Comput 10: 405–421.
    [3] Arnold C, Matthews LJ, Nunn CL (2010) The 10kTrees website: A new online resource for primate phylogeny. Evolutionary Anthropology 19: 114–118. doi: 10.1002/evan.20251
    [4] Bender MA, Farach-Colton M (2000) The LCA problem revisited. In Proceedings of the 4thLatin American Symposium on Theoretical Informatics (LATIN 2000), volume 1776 of LNCS, pages 88–94. Springer-Verlag.
    [5] Bininda-Emonds ORP (2004) The evolution of supertrees. TRENDS Ecol Evolution 19: 315–322. doi: 10.1016/j.tree.2004.03.015
    [6] Bininda-Emonds ORP, Cardillo M, Jones KE, et al. (2007) The delayed rise of present-day mammals. Nature 446: 507–512. doi: 10.1038/nature05634
    [7] Bryant D (1997) Building Trees, Hunting for Trees, and Comparing Trees: Theory and Methods in Phylogenetic Analysis. PhD thesis, University of Canterbury, Christchurch, New Zealand.
    [8] Bryant D (2003) A classification of consensus methods for phylogenetics. In Janowitz MF, Lapointe FJ, McMorris FR, et al., editors, Bioconsensus, volume 61 of DIMACS Series in Discrete Mathematics and Theoretical Computer Science, pages 163–184. American Mathematical Society.
    [9] Byrka J, Guillemot S, Jansson J (2010) New results on optimizing rooted triplets consistency. Discrete Applied Mathematics 158: 1136–1147. doi: 10.1016/j.dam.2010.03.004
    [10] Chor B, Hendy M, Penny D (2007) Analytic solutions for three taxon ML trees with variable rates across sites. Discrete Applied Mathematics 155: 750–758. doi: 10.1016/j.dam.2005.05.043
    [11] Constantinescu M, Sankoff D (1995) An efficient algorithm for supertrees. J Classification 12: 101–112. doi: 10.1007/BF01202270
    [12] Felsenstein J (2004) Inferring Phylogenies. Sinauer Associates, Inc., Sunderland, Massachusetts.
    [13] Garey M, Johnson D (1979) Computers and Intractability – A Guide to the Theory of NPCompleteness. Freeman WH and Company, New York.
    [14] Gąsieniec L, Jansson J, Lingas A, et al. (1999) On the complexity of constructing evolutionary trees. J Combinatorial Optimization 3: 183–197. doi: 10.1023/A:1009833626004
    [15] He YJ, Huynh TND, Jansson J, et al. (2006) Inferring phylogenetic relationships avoiding forbidden rooted triplets. J Bioinformatics Comput Bio 4: 59–74. doi: 10.1142/S0219720006001709
    [16] Henzinger MR, King V, Warnow T. (1999) Constructing a tree from homeomorphic subtrees, with applications to computational evolutionary biology. Algorithmica 24: 1–13. doi: 10.1007/PL00009268
    [17] Huson DH, Rupp R, Scornavacca C (2010) Phylogenetic Networks: Concepts, Algorithms and Applications. Cambridge University Press, Cambridge, U.K.
    [18] Jansson J, Lemence RS, Lingas A (2012) The complexity of inferring a minimally resolved phylogenetic supertree. SIAM J Comput 41: 272–291. doi: 10.1137/100811489
    [19] Jansson J, Lingas A, Rajaby R, et al. (2017) Determining the consistency of resolved triplets and fan triplets. In Proceedings of the 21st Annual International Conference on Research in Computational Molecular Biology (RECOMB 2017), volume 10229 of LNCS, pages 82–98. Springer-Verlag.
    [20] Jansson J, Rajaby R, Shen C, et al. (to appear). Algorithms for the majority rule (+) consensus tree and the frequency difference consensus tree. IEEE/ACM Transactions Computational Bio Bioinformatics.
    [21] Jansson J, Shen C, Sung WK (2016) Improved algorithms for constructing consensus trees. J ACM 63.
    [22] Kannan S,Warnow T, Yooseph S (1998) Computing the local consensus of trees. SIAM J Computing 27: 1695–1724. doi: 10.1137/S0097539795287642
    [23] McKenzie A, Steel M (2000) Distributions of cherries for two models of trees. Mathematical Biosciences 164: 81–92. doi: 10.1016/S0025-5564(99)00060-7
    [24] Nethercote N, Seward J (2007) Valgrind: a framework for heavyweight dynamic binary instrumentation. In Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation (PLDI 2007), pages 89–100. ACM.
    [25] Ng MP, Wormald NC (1996) Reconstruction of rooted trees from subtrees. Discrete Applied Mathematics 69: 19–31. doi: 10.1016/0166-218X(95)00074-2
    [26] Semple C (2003) Reconstructing minimal rooted trees. Discrete Applied Mathematics 127: 489–503. doi: 10.1016/S0166-218X(02)00250-0
    [27] Semple C, Daniel P, Hordijk W, et al. (2004) Supertree algorithms for ancestral divergence dates and nested taxa. Bioinformatics 20: 2355–2360. doi: 10.1093/bioinformatics/bth246
    [28] Snir S, Rao S (2006) Using Max Cut to enhance rooted trees consistency. IEEE/ACM Transactions Comput Bio Bioinformatics 3: 323–333. doi: 10.1109/TCBB.2006.58
    [29] Steel M (1992) The complexity of reconstructing trees from qualitative characters and subtrees. J Classification 9: 91–116. doi: 10.1007/BF02618470
    [30] Sung WK (2010) Algorithms in Bioinformatics: A Practical Introduction. Chapman & Hall/CRC, Boca Raton, Florida.
    [31] Willson SJ. (2004) Constructing rooted supertrees using distances. Bulletin Mathematical Bio 66: 1755–1783. doi: 10.1016/j.bulm.2004.04.006
    [32] Wulff-Nilsen C (2013) Faster deterministic fully-dynamic graph connectivity. In Proceedings of the 24th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2013), pages 1757–1769. SIAM.
  • This article has been cited by:

    1. Jesper Jansson, Konstantinos Mampentzidis, T.P. Sandhya, Building a Small and Informative Phylogenetic Supertree, 2023, 08905401, 105082, 10.1016/j.ic.2023.105082
  • Reader Comments
  • © 2018 the Author(s), licensee AIMS Press. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0)
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Metrics

Article views(4907) PDF downloads(797) Cited by(1)

Article outline

Figures and Tables

Figures(11)  /  Tables(5)

Other Articles By Authors

/

DownLoad:  Full-Size Img  PowerPoint
Return
Return

Catalog