Alternative treeing algorithms

OLD Audio recording

Video recording (.mov format, 0.7Gbytes)
Video recording (480p .mp4 format, 0.1Gbytes)
Video recording (1280p .mp4 format, 0.6Gbytes

 

What's the point?

  1. To be able to describe the basic processes of different treeing algorithms.

Fitch-Margoliash - an alternative distance-matrix treeing method

Another useful method for generating trees from distance matrices is that of Fitch and Margoliash - commonly called Fitch. This algorithm starts with two of the sequences, separated by a line equal to the length of the evolutionary distance between them:

Ab tree

Then the next sequence is added to the tree such that the distances between A, B and C are approximately equal to the evolutionary distances:


Notice that the fit isn't perfect. If we could determine the evolutionary distances exactly, they would fit the tree exactly, but since we have to estimate these distances, the numbers are fit to the tree as closely as possible using averaging or least-squares best fit.

The next step is to add the next sequence, again re-adjusting the tree to fit the distances as well as possible:


And at last we can add the final sequence and readjust the branch lengths one last time using least-squares:

Neighbor-joining and Fitch are both least-squares distance-matrix methods, but a big difference is that neighbor-joining separates the determination of the tree structure from solving branch legnths, whereas Fitch solves them together, but does so by adding terminal branches (sequences) one at a time.

Parsimony

This is actually a collection of related methods based on the same premise - Occam's Razor. In other words, the tree that requires the smallest number of sequence changes to generate the sequences in the alignment is the most likely tree. No distance matrix is calculated, instead trees are searched and each ancestral sequence calculated, then the number of "mutations" required are added up. Testing every possible tree is not usually possible, so a variety of search algorithms are used to examine only the most likely trees. Likewise, there are a variety of ways of counting (scoring) sequence changes.

For example:

                                +-----GGGAUCCC
                   +--GGGACCCC--+
                   |            +-----GGGACCCC
       --GGGRCCCC--+
                   |            +-----GGGGCCCC
                   +--GGGGCCCC--+
                                +-----GGGGGCCC
       


Each of the terminal sequences are the ones you start out with - the aligorithm then tries to find the tree and predict the internal node sequences that could lead to these sequences with the least number of changes.

Parsimony methods are usually slower than distance-matrix methods (especially neighbor-joining, which is very fast) but much, much faster than the Maximum Likelihood methods described below. Parsimony uses more of the information in an alignment, since it doesn't redure the data to a distance matrix, but in my experience works best with relatively closely-related sequences, and is not usually used for rRNA sequences.

Maximum-likelihood

This method turns the tree-construction process on it's head, starting with a cluster analysis to generate a "guide" tree, from which a very complete substitution model is estimated. The algorithm then goes back and calculates the likihood of any particular tree by summing the probabilities of all of thepossible intermediates required to get to the observed sequences. Rather than try to calculate this for all possible trees (this number grows astronomically with the number os sequences), a heuristic search is used starting with the guide tree. Sound complicated? It is, and ML tree construction is by far the most computationally intensive of the methods in common use. But it's generally also the best, in the sense that the trees are more consistant and robust. The limitation is that fewer and shorter sequences can be analyzed by ML. A tree that might take a few seconds by neighbor-joining might take a few minutes by parsimony or Fitch, but a few hours or a couple of days, by ML. This is serious - it means that you can't usually "play" with trees, testing various changes in the data or treeing parameters & seeing the result right away.

Bayesian inference

Bayesian inference a relatively new approach to tree construction. This approach starts with a random tree structure, random branch lengths, and random substitution parameters for an alignment, and the probabilty of the tree being generated from the alignment with these parameters is scored. Obviously the initial score is likely to be very poor. Then a random change is made in this tree (branch order, branch length, or substitutionmodel parameter) and the result is re-scored. Then a choice is made whether to accept the change or not; this choice is partially random, but the greater the improvement in tree score, the more likely it is to be accepted (and visa versa). If the change is accepted, the process is repeated starting with this new tree; if the change is rejected, the process is repeated starting with the old tree. After many, many cycles of this process, the algorithm settles in to a collection of trees that are nearly optimal. Various tricks are used to keep the algorithm from getting stuck in local scoring minimum zones. Notice that this method solves both the tree AND the substitution model simultaneously.