BIOL 606 Home

Phylogenetic Analysis of Heterogeneous Data Sets

Lecture by Yazdan Keivany (March 9, 1998)

Rapporteur: Greg Dueck

Two techniques of analyzing heterogeneous data sets are the consensus and the combined methods. In the consensus method, characters from multiple data sets are analyzed independently, and interact through a consensus of trees derived from the separate data sets. In the combined method, characters from multiple data sets are combined into a single data set and are analyzed as such. If characters in a data set are statistically more similar to each other than to characters in another data set, the data sets are heterogeneous. Consensus among phylogenetic trees can be found using several methods, including Strict, Semi-strict, Adams, and Majority-rule. Two strategies used to search a data set for the best possible tree(s) are maximum parsimony (MP) which finds the shortest tree(s), and maximum likelihood (ML) which finds the tree(s) that has the highest probability of representing a particular data set. Unlike MP, ML does not assume rates of change along different lineages to be the same. Bootstrapping is used to analyze random sets of characters and bootstrap confidence finds the ratio of trees from the bootstrapping analysis that support a clade.

Several tests for heterogeneity have been developed. In general, if there is no overlap in bootstrap trees, the data sets are heterogeneous. By comparing the Wagner tree for each data set to the tree from the combined data, the degree to which a tree departs from a reference tree can be measured. Assigning the fit of each set of data to the computed tree from the combined data can also test for heterogeneity. The degree of support for a particular clade within a data set can also be measured by testing for Permutation Tail Probability (T-PTP).

Two consensus methods for heterogeneous data sets that have been formalized are the Traditional method, and the Prior Congruent method. Using the Traditional method, partitioned data of different types are obtained and analyzed separately. A consensus tree is then constructed from the trees obtained from analyzing the data sets independently. Using the Prior Congruent method, all available data are combined and then partitioned to different data sets which are analyzed separately. As in the traditional method, a consensus tree is constructed from the trees found in the analysis of the separate data sets. Examples of data partitioning include molecular vs. morphological, coding vs. noncoding regions of the genome, larval vs. adult morphology, and cranial vs. postcranial anatomy.

An advantage of the consensus method is that detection of heterogeneous data sets can highlight conflicts caused by natural selection, hybridization, and different evolutionary rates. Another advantage of the consensus method is that characters between data sets are more likely to be independent of each other than are characters within data sets. Additionally, some data sets have not been effectively combined by any other techniques. There are several disadvantages to the consensus method, however, including conflicts among data sets and in judging how to place characters into data sets, a loss of descriptive and explanatory power, difficulty in choosing an appropriate consensus method, and the limitations of smaller sample size.

Some problems inherent to the consensus method are not found in the combined method, such as the combined method's greater descriptive and explanatory power, and its relative impartiality because it does not require a scheme of data partitioning or the potential subjectiveness of partitioning. There is also a philosophical justification for the combined method based on the idea of total evidence. The combined method, however, may obscure some patterns of congruent and discordant characters, and may give large data sets a disproportionate influence over the combined data set. In addition, the impact of combining good and bad character sets (e.g., slowly vs. rapidly evolved genes respectively) on a combined data set has been shown by Bull et al. (1993) to reduce the rate of recovering the correct phylogeny. Bull et al. also demonstrated that a combined data set's ability to recover the correct phylogeny decreases as characters are added that provide inconsistent phylogenies. Chippindale & Wiens (1994) found the probability of estimating the correct phylogeny using a combined data set may be improved by weighting rapidly evolving characters less than slowly evolving ones. Chippindale & Wiens also surveyed studies where trees were made from separate analyses of data sets and from combined analyses and found about half of the trees from combined analyses resembled a tree made by at least one of the separate analyses of data sets. Cao et al. (1994) provided further support for the combined method in their study of proteins encoded by mtDNA, where they found separate analyses gave several different ML trees, while a combined analysis gave one correct tree. The correct tree produced by their combined method was found by searching for trees with the greatest fit combined over all data sets analyzed separately.

Conflicting trees are often produced by analysis of heterogeneous data sets. When sampling error is the source of conflict, a solution is to combine data sets. Occasionally, different stochastic processes, such as different rates of evolution in characters, can create conflict among trees. In this case, the previously mentioned method used by Cao et al. (1994) may resolve the conflict. Finally, lineage sorting and horizontal transfer can produce conflicting trees. If the resulting trees are reticulating, they can be analyzed using Continuous Track Analysis, Map Analysis, or Reticulate Evolution Detection. If disagreements are confined to a few taxa, the Taxon Excision Method (de Queiroz et al. 1995) can be used to excise taxa involved in the conflict, and if disagreements are spread over the whole tree, the Pairwise Outlier Excision Method can be used to eliminate data for pairs of outlying taxa (de Queiroz et al. 1995).

Bull JJ, Huelsenbeck JP, Cunningham CW, Swofford DL, Waddell PJ. 1993. Syst Biol 42:384-97
Cao Y, Adachi J, Janke A, Paabo S, Hasegawa M. 1994. J Mol Evol 39:519-27
Chippindale PT, Wiens JJ. 1994. Syst Biol 43:278-87
de Querioz A, Donoghue MJ, Kim J. 1995. Annu Rev Ecol Syst 26:657-81


based on Queiroz et al. 1995. Separate versus Combined Analysis of Phylogenetic Evidence. Annu Rev Ecol Syst 26:657-81

Discussants: Corey Davis and Gavin Hanke

Rapporteur: Greg Dueck

A basic assumption in using the consensus method is that there are different classes of characters. Is this assumption justified? Molecular data are commonly analyzed independently from other data classes. Classes such as larval vs. adult life cycle stages may seem different at first, but problems arise when defining such classes. At a finer scale of analysis, the classes themselves may become irrelevant.

In the lecture, it was shown that combining quickly and slowly evolving molecular data in the same analysis creates inaccurate and inconsistent phylogenies. Are there parallels to this problem in non-molecular data? There are parallels in non-molecular data (i.e., quickly and slowly evolving characters of any kind); however, it is probably not as easy to define in non-molecular data. When it is possible to make such a definition using morphological data, it probably cannot be done a priori as often as when using molecular data.

How many classes of data are worth while studying in a phylogenetic analysis? New classes of data should probably be added until some consensus is developed for any given node of a tree. However, trees based on different classes of data are not necessarily good. An example of this are trees of echinoderms, which are consistent when based on adult or fossil characters, but misleading when based on the more convergent larval characters.

What are some characteristics of a "good" outgroup? Using distantly related outgroups introduces heterogeneity to a data set, thus it is best to use as closely related an outgroup as possible. Also, one outgroup is not enough. Outgroups should be added until the ingroups stabilize; three outgroups is a good minimum. Additional outgroups should also be used to break up long branches between other outgroups.

"True" trees used in models may not be valuable or realistic in real phylogenetic problems. While virus' have been used to produce "known" trees, in general such trees are not knowable. A good data set gives true trees, so the real question is how to decide what is a good or a bad data set? Some characters such as those that evolve quickly or those that often reverse are known to provide bad data sets. The problem is complicated by very complex trees that attempt to resolve deep and shallow branches because slowly evolving characters will resolve deep roots but not shallow ones. Despite de Queiroz et al.'s (1995) suggestion that slowly evolving characters are always better, a balance of quickly and slowly evolving characters may be more appropriate in such studies. On the other hand, it may not be possible to accurately resolve deep and shallow roots in the same study. Maybe trees of deep and shallow roots should be resolved in separate studies and subsequently integrated.

In any type of combined analysis, character weighting is an important factor. A method used to determine the sensitivity of a data set to character weighting is to find which weighting schemes create different topologies. There is no reason to assume equal weighting is any more correct than differential weighting. It is often forgotten that weight is given to characters that are used while none is given to those that are not used.

Differential weighting of third codon positions seems to be based on a circular argument. How has this weighting been established? Studies on bacteria and on cell lines have found different rates of transitions and transversions among the three codon positions, with particularly high rates of change at the third codon position. Many studies compare DNA sequences among taxa to determine the number of transitions versus transversions over time. Such an argument seemingly requires a priori knowledge of the evolutionary time frame or the "true" evolutionary tree for the taxa of interest. This a priori knowledge is then used to decide how to weight molecular characters, which are then used to re-establish the same, or part of the same evolutionary tree.

de Queiroz et al. (1995) suggest the removal of taxa that produce conflicting trees because of introgression, lineage sorting, etc. What if the phylogenies of the "conflicting" taxa are those you want to study? Basic assumptions about dichotomous trees are not met with hybridization. Occasionally, hybrid species are placed nowhere near their parent species on an evolutionary tree. Chromosomal information may be very useful in resolving problems of hybridization, however, they are quickly being replaced by other types of molecular data.

When analyzing any data set for analysis of evolutionary relationships, an important consideration is the robustness of the data set. It is instructive to test whether a data set can withstand several types of analyses.

 BIOL 606 Home