Testing the Incongruence of Phylogenetic Trees

Lecture by Tricia Abe

BIOL 606 Session, University of Alberta, February 3, 1999.

Rapporteur: Chris B. Cameron

Tricia Abe began her lecture by asking "How should different data sets be analyzed in order to reveal the true species phylogeny for the group?" A brief chronology about establishing a framework for describing biological diversity began with Linnaeus, and included the birth and growth of cladistics beginning with Willie Hennig in 1950. Not until the later 20th century were methods of analyzing data refined to include both separate and combined data analysis. Interpretation of congruence in phylogenetic data is one of the latest areas of cladistic analysis that has come to light because systematists now have access to large and independent data sets (thanks largely to advents in molecular biology) to compare alternate tree topologies.

Arguably, the grandest assumption in cladistic analysis was first proposed by William of Ockham, a medieval scholastic philosopher. 'Ockham's razor' is the principle that where several hypothesis are possible, the simplest is chosen. In a cladistic framework this principle might be restated as the correct phylogenetic tree topology is that which requires the fewest evolutionary steps. The problem that now arises is what do we do if a data set results in more than one most parsimonious tree? Two possible routes to overcome this paradox are first, to form a consensus tree and secondly, is to present the incongruent trees separately.

Before listing some strengths and weaknesses of combining data versus separate analysis Tricia indicated some of the reasons why incongruent phylogenies arise when comparing morphological and molecular data sets. Sampling error, differing rates of change in data sets, and differing phylogenetic histories (resulting from horizontal gene transfer or lineage sorting) were cited as the most common culprits of incongruence. The arguments that emerge from forming a consensus tree (either from more than one most parsimonious tree in a single data set, or from different tree topologies from independent data sets) versus presenting separate trees, points to the split between the conservative and the liberal approaches to cladistics. Some of the strengths of combining data are: eliminating the problem of conflicting tree topologies, no decisions have to be made about partitioning data, 'weak' signals will rise above the 'noisy' signal, and combining data will maximize the information in a single topology giving credence to its explanatory power. Counter arguments to combining data indicate that heterogeneity among data sets will increase 'noise' and reduce the phylogenetic signal, different quality of characters (where is R. Persig when you need him), and different quantity of characters are being treated. If there is a unifying theme about how to analyze phylogenetic data today, it is that data sets should be tested for incongruence before combining them.

The virtue of using separate analysis is to pinpoint interesting trends between data sets (taxon by taxon, character by character) rather than forming ambiguous, and sometimes misleading results. The lecturer acknowledged that there is only one true tree, and therefore all but one (and perhaps all) partitioned data sets are artificial groupings. In fact, a consensus tree of separate analyses can give conflicting results if the data sets were combined and reanalyzed. How, then, can a systematist test the accuracy of phylogenetic methods?

Recent papers that test the phylogenetic accuracy of different methods (i.e.: combined versus separate analysis) first require a well-formed, widely accepted tree for comparison. Given the paucity of such trees Tricia bravely used a case study (Graham et al., 1998) where the 'true' tree was not known. In this paper a morphological data set and three sources of data from the chloroplast genome were used to construct the phylogenetic history of 24 plant taxa (family Pontederiaceae). The morphological data set (with comparatively few characters) had a highly resolved strict consensus tree but its bootstrap support was lower than that of any of the molecular trees. When all the data sets were combined, the resulting strict consensus tree was poorer in resolution than the combined molecular tree, and its average bootstrap value was slightly higher. How do we explain this discrepancy? Graham et al. (1998) concluded that the morphological data set is 'noisy', because it is small and because it is more susceptible to convergence (a particularly annoying problem with plants).

Tricia Abe concluded that inferring phylogenetic histories is an incomplete science, "compatibility" of data is largely dependent on the sources of incongruence between data sets, and finally, interpreting different data sets still relies on background knowledge.

References: please see this list provided by Tricia Abe

 

Discussion

Discussant: Keith Jackson

Keith Jackson began the discussion by presenting an overhead, which summarized the four types of incongruence (from Brower et al., 1996). The discussion immediately exploded, many participants insisting that Type 3 incongruence is based on homoplasy discovered in separate analysis of partitioned data sets, which is different than "mosaic incongruence" which is an example of Type 2 incongruence. Jackson's approach to starting a discussion was both novel and effective.

Graham pointed out that the problem with using 'true' tree topologies is that those phylogenies can have long branch lengths and result from cladogenesis a long time ago, prompting Jackson to ask, "What is a true phylogenetic history?"

Graham remarked that the truth can be a hypothesis or an inference depending on your philosophy. He believes that statistics (i.e.: bootstrapping) can be used to test the robustness and accuracy of the 'truth'. Strobeck disagreed, remarking that bootstrap tests data robustness but unless the data is 'right' it won't support the true hypothesis. Graham countered that the systematist first has to assume that the data reflect the true phylogeny and then the data can be tested with bootstrapping. Both parties agreed, 'garbage in, garbage out'.

Spence asked, "If one data set gives a polytomy, whereas two other data sets result in resolved trees, is it appropriate to combine the data or should it be analyzed separately?" Given such a scenario, Polzhein would request additional independent data. Cameron argues that if the two resolved data sets agree, then they should be combined; if, however, the two resolved data sets disagree, combining them will result in a polytomy.

Wilson asked of Graham, "Are the three chloroplast (in Graham et al., 1998) genes independent?" Graham replied that there is very little evidence for jumping genes and lineage sorting in the chloroplast genome of plants. Additionally, his data set had no signs of region duplication. Wilson asked if there were any problems associated with paternal versus maternal chloroplast genomes. Graham argued that only maternal chloroplast genomes are passed on to F1's in the Pontederiaceae and the problem with maternal versus paternal inheritance of genetic information is a problem with all genes.

Cameron shared a scenario where an amino acid sequence agrees with a morphological data set tree topology, but the base pair sequence from which the amino acid sequence was obtained is incongruent with the morphological data set. He then asked "Do you combine the amino acid and morphological data sets?" Graham responded that the data sets can be combined if the data is clean, but if the data is messy and prone to 'noise' then combining the data will overwhelm the phylogenetic signal.

Graham asked if the Brower et al. (1996) conclusions are valid (that nuclear gene topology is superior to mtDNA gene topology) and McIntyre argued that their conclusion was correct given their data.

The group was asked if they agreed with Brower et al. (1996), who concluded that explaining incongruence was not a task for systematists. The group was unified in their disagreement with this statement. Palmer expressed especially strong disapproval.

At the watering hole...Keith Jackson had the idea that the rift that we see in phylogenetic analysis was increased by the editors of Systematic Biology when they insisted that authors submitting a molecular sequence-based topology must discuss their results in respect to morphology.