Bootstrapping

OLD Audio recording

Video recording (.mov format, 0.5Gbytes)
Video recording (480p .mp4 format, 0.1Gbytes)
Video recording (1280p .mp4 format, 0.5Gbytes

 

What's the point?

  1. To be able to describe the need addressed by bootstrap analysis
  2. To be able to describe the process of bootstrapping
  3. To be able to interpret bootstrap values on trees

Bootstrapping is a method to evaluate the reliability of a tree. Unfortunately, it is mathematically impossible to extract confidence intervals for nodes in a tree using the standard treeing methods - in other words, you cannot determine how sure you should be of the placement of each node in a tree (commonly refered to as branching-order). This is not true of branch lengths, from which confidence intervals can be readily calculated. The standard method of evaluating tree branching-order reliability is bootstrapping.

In a bootstrap analysis, the columns of a sequence alignment are randomly sampled to create a jumbled and haphazard alignment of the same length. Typically 100 or 1000 such alignments are generated from the initial alignment, and trees are generated from each. The reliability of a particular branching arrangement in a tree is judged by the frequency that the branch appears in all of the resulting trees.

The random sampling starts with the input alignment. For a 'regular' tree, the similarity matrix is generated by checking sequences against each other pair-wise, tallying-up similarity with each position in the alignment. Each position in the alignment is therefore counted once and only once. In a bootstrap sampling, a similarity matrix would be generated by comparing randomly selected positions in the alignment, so that some positions will be compared more than once, and some will not be compared at all. For example, the starting alignment:

         position  1 2 3 4 5 6 7 8 9
       sequence 1  g g u u c g c c u
       sequence 2  g c u u u g g c u
       sequence 3  g c u u u - g c u
       

... would be randomly sampled, ending up with a bootstrapped alignment, from which a similarity matrix and tree would be generated:

         position  9 2 5 8 3 9 2 1 9
       sequence 1  u g c c u u g g u
       sequence 2  u c u c u u c g u
       sequence 3  u c u c u u c g u

Notice that some positions in the alignment are included multiple times (9 and 2) and some are not included (4, 6, 7). In realistically large alignments, such randomly-sampled alignments yield good trees if the branching arrangements are well-supported by the sequence data. So in a bootstrap analysis, a large number of trees are generated from random samples of the alignment, and the number of these trees that agree with each branch of the the reference tree (generated the usual way) is shown on the tree. Often, the same type of analysis is performed using more than one method of tree construction.

The evaluation of bootstrap scores is subjective, but generally branches that show up in 50% of trees generated from bootstrapped datasets are considered to be reliable.

Here is an example of how this is usually shown on a tree:

bootstrapped tree
A phylogenetic tree with bootstrap values shown on each branch
from Barnes, et al, 1994 PNAS 91:1609-1613