Phylogenetic analysis - Creating phylogenetic trees

OLD Audio recording

Video recording (.mov format, 2.2Gbytes)
Video recording (480p .mp4 format, 0.3Gbytes)
Video recording (1280p .mp4 format, 1.8Gbytes


What's the point?

  1. To be able to generate a phylgenetic tree from a sequence alignment using the Neighbor-joining method
  2. to be able to generate a similarity matrix from an alignment
  3. To be able to generate a distance matrix from a similarity matrix
  4. To be able to generate the structure of a tree using the neighbor-joining method from a distance matrix
  5. to be able to determine branch lengths within a tree using the distance matrix
  6. What is a "root"?

There is a tree construction problem set that tests these skills.

The most common type of phylogenetic analysis is tree construction. A tree is nothing more than a graph representing the similarity relationships between the sequences in an alignment. This is why we’ll be going through this in such detail, to show that trees aren’t rocket surgery, they come from straightforward mathematical transformations of sequence data.

There are several methods for building trees. In this module, we'll cover the 'neighbor-joining' method in some detail as an example, because it is conceptually straightforward and commonly used. In the next Chapter, we’ll briefly cover some other approaches.

Tree construction : the neighbor-joining method

Tree construction starts with an alignment. Neighbor-joining is a distance matrix method, meaning that the alignment is first reduced into a table of evolutionary distances, a distance matrix. The distance matrix cannot be generated directly from the alignment, however; before this, the alignment is reduced to a table of observed similarity, the similarity matrix. The distance matrix is calculated from the similarity matrix, and then the tree is generated from the distance matrix.

Generating a similarity matrix

The similarity matrix is just a table of fractional similarities; for example, in this alignment of 6 sequences with 20 positions:


Just count the fraction of identical bases in every pair of sequences in the alignment.

               |X|||||||||X|||||||| -> 18/20 identical = 0.90 similar

The similarity values for all pairs of sequences are calculated in the same way and assembled into a table:

       Similarity matrix:

           A     B     C     D     E     F
       A  ----  ----  ----  ----  ----  ----
       B  0.90  ----  ----  ----  ----  ----
       C  0.70  0.70  ----  ----  ----  ----
       D  0.75  0.75  0.85  ----  ----  ----
       E  0.40  0.40  0.40  0.45  ----  ----
       F  0.45  0.45  0.45  0.50  0.75  ----

In this example, sequences A and B are 0.90 (= 90%) similar, A and C are 0.70 similar, B and C are 0.70 similar, &c, &c. Note that values on the diagonal (A:A, B:B, &c) don’t need to be calculated, they will always be 1. Likewise, there is no reason to calculate both above and below the diagonal - the value for A:B would be the same as B:A, &c.

Converting a similarity matrix into an evolutionary distance matrix

Next is the estimation of evolutionary distances from their sequence similarity. You might think that distance would just be 1- similarity, and you’d be right if it weren’t for the fact that the number of differences you count between any two sequences misses some of the changes that might have occurred between them. More than one evolutionary change at a single position (e.g. A -> G -> U) counts as only one difference between two sequences, and in the case of reversion counts as no change at all (e.g. A -> G -> A). As a result, the observed similarity between two sequences underestimates the evolutionary distance that separates them.

One common way to correct evolutionary distances for this underestimation is the Jukes & Cantor method. The Jukes and Cantor converts observed sequence similarity to estimated evolutionary distance using the following equation:

Evolutionary distance = -3/4 ln(1-4(1-similarity)/3)

Jukes & Cantor curve
The Jukes & Cantor equation graphed out
Cheezy graph by Jim Brown

You can see that similarity and distance are very closely related initially (e.g. 0.90 similarity ≈ 0.10 distance), but levels off to 0.25 similarity, where evolutionary distance is infinite. This makes sense; for two sequence that are very similar, the probably frequency of more than one change at a single site is low, requiring only a small correction, whereas two sequences that have changed beyond all recognition (infinite evolutionary distance) will still be be approximately 25% similar just because there are only 4 bases and so about 1 out of 4 will match entirely by chance.

To convert a similarity matrix to a distance matrix, just convert each value in the similarity matrix to evolutionary distance using either the graph (by hand) or the equation. In our example:

       Similarity matrix:                   ---> Distance matrix:

           A     B     C     D     E     F          A     B     C     D     E     F
       A  ----  ----  ----  ----  ----  ----      ----  ----  ----  ----  ----  ----
       B  0.90  ----  ----  ----  ----  ----      0.11  ----  ----  ----  ----  ----
       C  0.70  0.70  ----  ----  ----  ----      0.38  0.38  ----  ----  ----  ----
       D  0.75  0.75  0.85  ----  ----  ----      0.30  0.30  0.17  ----  ----  ----
       E  0.40  0.40  0.40  0.45  ----  ----      1.21  1.21  1.21  1.00  ----  ----
       F  0.45  0.45  0.45  0.50  0.75  ----      1.00  1.00  1.00  0.82  0.30  ----

Generating a tree from a distance matrix

In the neighbor-joining method, the structure of the tree is determined first, then the branch lengths are fit to this skeleton.

Solving the tree structure

The tree starts out with a single internal node at a branch out to each sequence; an n-pointed star, where n is the number of sequences in the alignment. The pair of sequences with the smallest evolutionary distance separating them are joined onto a single branch (i.e. the neighbors are joined, thus the name of the method), and then the process is repeated after merging these two sequences in the distance matrix by averaging their distances from each other sequence in the matrix.

Using our distance matrix, the tree starts out like this (remember that we're sorting out the structure of the tree, not the branch lengths, yet).

NJ tree 1
Neighbor-joining tree, step 1
Cheezy hand drawing by Jim Brown

The closest neighbors in the distance matrix are A and B (0.11 evolutionary distance), so these branches are joined:

NJ tree 2
Neighbor-joining tree, step 2
Cheezy hand drawing by Jim Brown

The distances to A and B from each of the other sequences are then averaged to reduce the distance matrix:

       Starting matrix:                           ---> Reduced matrix:
           A     B     C     D     E     F 
       A  ----  ----  ----  ----  ----  ----             A/B    C     D     E     F
       B  0.11  ----  ----  ----  ----  ----      A/B   ----  ----  ----  ----  ----
       C  0.38  0.38  ----  ----  ----  ----       C    0.38  ----  ----  ----  ----
       D  0.30  0.30  0.17  ----  ----  ----       D    0.30  0.17  ----  ----  ----
       E  1.21  1.21  1.21  1.00  ----  ----       E    1.21  1.21  1.00  ----  ----
       F  1.00  1.00  1.00  0.82  0.30  ----       F    1.00  1.00  0.82  0.30  ----

In this case, the averages are trivial - the average of 0.38 and 0.38 is, of course, 0.38, &c. This is not usually the case.

Then the process is repeated. The closest neighbors in the reduced matrix is D with C:

NJ step 3
Neighbor-joining tree, step 3
Cheezy hand drawing by Jim Brown

Once again the original distance matrix is reduced by averaging. But be sure to average from the original distance matrix, not the previously reduced matrix:

       Starting matrix:                            ---> Reduced matrix:
            A     B     C     D     E     F 
       A  ----  ----  ----  ----  ----  ----             A/B   C/D    E     F
       B  0.11  ----  ----  ----  ----  ----      A/B   ----  ----  ----  ----
       C  0.38  0.38  ----  ----  ----  ----      C/D   0.34  ----  ----  ----
       D  0.30  0.30  0.17  ----  ----  ----       E    1.21  1.11  ----  ----
       E  1.21  1.21  1.21  1.00  ----  ----       F    1.00  0.91  0.30  ----
       F  1.00  1.00  1.00  0.82  0.30  ----

Note that value listed for A/B with C/D (0.34) is the average between four values in the original matrix: A to C (0.38), A to D (0.30), B to C (0.38) and B to D (0.30).

Once again, the smallest number in the matrix represents the nearest neighbors, in this case E with F (0.30), so join these two branches:

NJ step 4
Neighbor-joining tree, step 4
Cheezy hand drawing by Jim Brown

Each node on this tree has three and only three branches connecting to it; all of the nodes are completely resolved. This means you’re finished determining the structure of the tree! If there were more sequences, you'd reduce the matrix (joining E with F), and repeat the process until all of the nodes were resolved.

Determining branch length

The next step is to determine the lengths of the branches on this tree. Basically, what is done is to go through each node & figure out where along the branch it is by figuring out the average difference in length along each of two branches. By choosing various sets of three sequences in a tree, you can sort out the branch lengths just like a puzzle.

In our example, the distance between A and B is 0.11, and so the length of the two branches connecting them must add up to 0.11. But where along this branch is the node? If you look at the distance from any other sequence (C, D, or F) to A and to B, the distance is always the same. This means that the node must be midway between A and B; each branch, then, has a length of 0.055. For example, with D used as reference:

NJ step 5
Neighbor-joining tree, step 5
Cheezy hand drawing by Jim Brown

C and D are also simple neighbors, so we can easily solve these two connecting branches as well. The distance between C and D is 0.17. However, the distance to either C or D from elsewhere in the tree differs; this means that the node connecting C and D is not equidistant between them. In fact, this difference in distance between C and D varies a bit depending on which reference you use: 0.08 from A, 0.08 from B, 0.21 from E, and 0.18 from F. These differences result from the fact that we can only estimate evolutionary distance. On average, however, it is 0.14 further to C than to D, which is the value we’ll use. This means that the two branches connecting C and D (which add up to a length of 0.17) must be approximately 0.155 and 0.015, respectively:

NJ step 6
Neighbor-joining tree, step 6
Cheezy hand drawing by Jim Brown

The distance between E and F can be calculated the same way. The distance between E and F is 0.30, and on average it is 0.20 further to E than to F from anywhere else in the tree (the average of 0.21, 0.21, 0.021, and 0.18). This means that the branch leading to E is 0.25 and the branch leading to F is 0.05:

NJ step 7
Neighbor-joining tree, step 7
Cheezy hand drawing by Jim Brown

The same process can be used to sort out the lengths of the internal branches as well, but it requires you to know that the branch lengths are additive and you already know the length of some of the segments. For example, to sort out the lengths of the branches connecting A/B with C/D, the average distance between A/B and C/D is 0.34 (look this up in your last reduced matrix). Subtract from this the average lengths of the branches leading to A and B (0.055) and to C and D (0.155 and 0.015 = 0.085) to get the lengths of the connecting branch, 0.20.

Nj step 8
Neighbor-joining tree, step 8/9
Cheezy hand drawing by Jim Brown

But where along this length of 0.20 is the node connecting to E/F? The average distance between E/F and A/B is 1.11, the average distance between E/F and D/C is 1.01, so the node connecting these is 0.1 closer to D/C than A/B. But C/D is already 0.03 larger on average than A/B, so the difference along the branch connecting A/B with C/D must be 0.13. This means that the branch to A/B must be 0.165 and the branch to C/D must be 0.035:

NJ step 9
Neighbor-joining tree, step 10
Cheezy hand drawing by Jim Brown

All that remains is to determine the length of the branch connecting E/F with A/B/C/D. The average distance between E and A, B, C, and D (1.21, 1.21, 1.21 and 1.00) is 1.158. The average distance between F and A, B, C, and D is 0.955, and so the average distance between E/F and A/B/C/D is 1.057. Subtract from this the average distance between E and F and the node connecting them (0.15) and the average distance between A, B, C, D, and the node connecting all of these (0.17), and the remainder is the length of the branch connecting E/F with A/B/C/D; 0.737.

NJ step 11
Neighbor-joining tree, step 11
Cheezy hand drawing by Jim Brown

Giving you a final tree that looks like:

fina ltree
Neighbor-joining final tree
Cheezy hand drawing by Jim Brown

… or like this, with the branch lengths drawn to scale:

NJ final tree, to scale
Neighbor-joining final tree, drawn to scale
Cheezy hand drawing by Jim Brown

Notice that in the final tree, the evolutionary distance between any two sequences is approximately equal to the sum of the length of the line segments joining those two sequences - in other words, the tree is additive. This should be true on any quantitative phylogenetic tree.

The neighbor-joining method is very fast, and so can be used on trees with much larger numbers of sequences than other methods, which we’ll discuss in the next module.

Rooting the tree with an outgroup

Whenever possible, include an outgroup sequence in the analysis; an outgroup is a sequence that is known to be outside of the group you're interested in treeing. Only by including an outgroup can you locate the root of a tree. For example, if you were building trees from mammalian sequences, you might include the sequence from a reptile as an outgroup. Outgroups provide the root to the rest of the tree - although no tree generated by these methods has a real root, if you know from other information that one (or a set) of the sequences is unrelated to the rest, wherever that branch connects to the rest of the tree defines the root (common ancestor) of that portion of the tree. In our example tree above, sequences A - D might be mammalian whereas E and F might be a reptilian sequence. If the tree included only mammalian sequences, it would be impossible to know where the root is, but the inclusion of an outgroup provides that information.

Molecular phylogenetic trees depict the relationships between gene sequences

The last "step" in the tree-building process (or perhaps the first step of interpretation) is the leap of faith that the tree depicting the relationships between the sequence similarities in an alignment also depicts the evolutionary history of the organisms the sequences came from. You can go wrong for many reasons at this step.

The common reason is the choice of sequence for use in the analysis - step 1 of the whole process. What would happen if, for example, you made a tree of mammals using globin genes, but use alpha globin sequences from some species and beta globin genes from others? What might happen if, in an analysis of plants, you used a nuclear ssu-rRNAs sequences from some plants, chloroplast ssu-rRNAs from others, and mitochondrial sequences from still others? What if, in an analysis of bacterial species, you used a gene like penicillin-resistance, that clearly moves readily from species to species? The trees you would build with these sequences might be perfectly valid trees, accurately representing the relationships between the genes used, but would not represent the relationships between the organisms!

So, if a tree just fundamentally doesn't look right, check the alignment first, but then think about the sequences used & how a tree of these sequences might not reflect the relationships between the 'host' organisms.