Tree Building Problem set key


Videos of tree constructions

Question 3

Question 4

Question 6 of the 2014 midterm exam #1

 


1. Construct a similarity matrix from this alignment:

   Seq A   G A U C U U U G G A U C
   Seq B   A A U C U C U G G A U U
   Seq C   C A U C U U U - G A U G
   Seq D   A A U C U U U G G A U U
   Seq E   C A U C U C U G A A U G

            Seq A  Seq B  Seq C  Seq D  Seq E
     Seq A    XXX    XXX    XXX    XXX    XXX
     Seq B   0.75    XXX    XXX    XXX    XXX
     Seq C   0.75   0.67    XXX    XXX    XXX
     Seq D   0.83   0.92   0.75    XXX    XXX
     Seq E   0.67   0.75   0.75   0.67    XXX

2. Given the Jukes & Cantor equation, or the conversion graph in the notes, convert the similarity matrix from question 1 to a distance matrix.



             Seq A  Seq B  Seq C  Seq D  Seq E
     Seq A    XXX    XXX    XXX    XXX    XXX
     Seq B   0.27    XXX    XXX    XXX    XXX
     Seq C   0.27   0.50    XXX    XXX    XXX
     Seq D   0.21   0.08   0.27    XXX    XXX
     Seq E   0.50   0.27   0.27   0.50    XXX

3. Given this distance matrix, draw a tree relating these sequences and label the lengths of the branches of the tree:

           Seq A   Seq B  Seq C  Seq D
   Seq A     -      -      -      -
   Seq B    0.2     -      -      -
   Seq C    0.6    0.6     -      -
   Seq D    0.9    0.9    0.9     -

This is an easy one - only one joining to do (A and B, the shortest distance in the matrix), so no math required for the tree structure, and the branch lengths are all perfect & even:

tree

4. Given this distance matrix, draw the structure of a tree relating the sequences using the neighbor-joining method, and label the lengths of the branches:

           Seq A   Seq B   Seq C   Seq D   Seq E   Seq F
   Seq A     -       -       -       -       -       -
   Seq B    0.2      -       -       -       -       -
   Seq C    0.7     0.7      -       -       -       -
   Seq D    0.6     0.6     0.5      -       -       -
   Seq E    0.7     0.7     0.8     0.7      -       -
   Seq F    0.6     0.6     0.7     0.6     0.3      -

A quick look at the matrix shows that the smallest distance is between A and B (0.2), so let's join these in the starting tree:



... then reduce the matrix. The averages are easy - the distances from C, D, E, or F to A and B are all the same.
       
             SeqA/B  Seq C   Seq D   Seq E   Seq F
     Seq A/B   -       -       -       -       -
     Seq C    0.7      -       -       -       -
     Seq D    0.6     0.5      -       -       -
     Seq E    0.7     0.8     0.7      -       -
     Seq F    0.6     0.7     0.6     0.3      -

The smallest distance in this reduced matrix is between E and F (0.3), and so let's join these branches:


... then reduce the matrix accordingly (remember always to go back to the original distance matrix to do the averages):

           SeqA/B  Seq C   Seq D   SeqE/F
   Seq A/B   -       -       -       -
   Seq C    0.7      -       -       -
   Seq D    0.6     0.5      -       -
   Seq E/F  0.65    0.75    0.65     -

The smallest distance in this matrix is C-D (0.5), so we'll join these:


And we're done with the structure of the tree.

To solve the branch lengths, let's start do them in the order we joined them, which often is the most straightforward approach.

Starting then with A and B, the total distance between them is 0.2 (look this up on the original distance matrix). Notice that from anywhere in the tree to A or to B is always the same: C to A = 0.7, C to B is also 0.7, D to either A or B is 0.6, E to either A or B is 0.7, and F to either A or B is 0.6. This means that the two branches leading to A and B from their common node are the same length, and so must be 0.2 divided by 2 = 0.1.

Moving to E and F, the total distance between them is 0.3. Notice that the distance from anywhere else in the tree is always 0.1 greater to E than to F (e.g. A to F is 0.6, A to E is 0.7).This means that the line connecting E and F is 0.1 longer on the E side, or 0.1 to F and 0.2 to E:

For C and D, the distance from anywhere else in the tree to C is always 0.1 greater than to D, so the 0.5 distance between them must be 0.1 more toward C than D, or:

This finishes the tip branches, but what about the internal branches? If you go from A to F, this distance is 0.6. But 0.1 of this is the distance from the E/F node to E, and 0.1 is the distance from the A/B node to A, so the total length of the remaining internal branches is 0.4. Likewise, the length of the internal branches between A/B and C/D is 0.3, and the internal branch length from C/D to E/F is 0.3.

So, to bring these lengths together, the internal branch leading to A/B must be 0.2, the one leading to C/D must be 0.1, and the one leading to E/F must be 0.2:

You're done!

5. Generate a tree from this alignment using the neighbor-joining method, with approximate branch lengths, from these X RNA sequences.

   Thiobacillus ferrooxidans  GAAUUCCCGGGAG-GGGCCAGGCGACCCCCGAAUUCCCGG
   Escherichia coli           GAAUUCCCGGAAGCAGACCAGACAGUCGCCGAAUUCCCGG
   Serratia marcescens        GAAUUCCCGGAAGUAGACCAGACAGUCACCGAAUUCCCGG
   Chromatium vinosum         GAAUUCCCGGGAG-GGGCCAGACAGUCCCUGAAUUCCCG-

From this alignment, the following similarity matrix can be calculated:

                  T.ferr   E.coli   S.mars   C.vin
T.ferrooxidans      --       --       --       --
E.coli            0.775      --       --       --
S.marcescens      0.775    0.950      --       --
C.vinosum         0.875    0.850    0.850      --

The Jukes & Cantor conversion results in the following similarity matrix (notice that at these high levels of similarity, the estimated distances ploted by eye are very close or identical to the 'dissimilarities' of the sequences):

                  T.ferr   E.coli   S.mars   C.vin
T.ferrooxidans      --       --       --       --
E.coli            0.225      --       --       --
S.marcescens      0.225    0.050      --       --
C.vinosum         0.125    0.150    0.150      --

If you use the equation you'll get slightly different numbers, but if you're willing to round these off as you go, sorting out the bracnh lengths is still easy.

This one is trivial to sort out the structure of the tree using neighbor-joining - with only 4 sequences, there's only one joining to do, so no need to condense the matrix even once. The smallest distance is between E. coli and S. marsescens, so they are joined on a branch.

The next step is to sift out the distances. Notice that the distance from T.ferrooxidans to E.coli is the same as the distance from T. ferrooxidans to S. marsescens. The distances from C.vinosum to E.coli or S. marsescens is also identical - this means that the branches from the node to E.coli and to S.marescnes are the same length. Sincethe total length of these two branches is 0.050, each branch has a length of 0,025.

The distance from either E.coli or S.marsescens to T.ferrooxidans is 0.225. The distance from E.coli or S.marsescens to C.vinosum is 0.150. This means the branch from the shared node to T.ferrooxidans is 0.075 longer (0.225-0.150=0.075) than the branch to C. vinosum. Because the distance from T.ferrooxidans to C. vinosum is 0.125, these branches must be 0.100 and 0.025, respectively.

The last remaining branch to figure out, then, is the branch connecting the internal nodes. The distance from E. coli to T. ferrooxidans is 0.225. Subtract from this the two other branch lengths along this path (the branch from the node to E. coli (= 0.025) and the branch from the other node to T. ferrooxidans (=0.100), and the remainder is the length of this internal branch: 0.100. You also get answer if you do this for the path from E. coli to C. vinosum, S. marsescens to T. ferrooxidans, or S. marsescens to C. vinosum.

nj


Bonus tree-construction example

This is a really hard one!

Let's do another neighbor-joining tree construction, starting with this alignment:

  Seq A  G C C A U G C C G A C G A U U G G U C C
  Seq B  G C C A U G C C A A C G A U U A G U C C
  Seq C  G C C A U G C U A G C G G U U G G U U C
  Seq D  G C C A U G C U A A C G A C U G G U U C
  Seq E  G G C A C G C U A A U G A U U G G U C C
  Seq F  G C G G C G C U A A C G A U U G A U C C

This is the similarity matrix calculated from this alignment:

          A     B     C     D     E     F
     A   --    --    --    --    --    --
     B  0.90   --    --    --    --    --
     C  0.75  0.75   --    --    --    --
     D  0.80  0.80  0.85   --    --    --
     E  0.75  0.75  0.70  0.75   --    --
     F  0.70  0.70  0.65  0.70  0.75   --

... and if you feed each of these numbers into the Jukes and Cantor equation, you get this distance matrix:

          A     B     C     D     E     F
     A   --    --    --    --    --    --
     B  0.11   --    --    --    --    --
     C  0.30  0.30   --    --    --    --
     D  0.23  0.23  0.17   --    --    --
     E  0.30  0.30  0.38  0.30   --    --
     F  0.38  0.38  0.47  0.38  0.30   -- 

The smallest distance is 0.11, for A-B, so let's join these neighbors:

... and then reduce the distance matrix by averaging all distance from to A and B:

         A/B     C     D     E     F
    A/B  ---    --    --    --    --
     C  0.30    --    --    --    --
     D  0.23   0.17   --    --    --
     E  0.30   0.38  0.30   --    --
     F  0.38   0.47  0.38  0.30   -- 

This is easy, of course - the average of 0.30 and 0.30 is 0.30, the average between 0.23 and 0.23 is 0.23, &c, &c.

The nest step is to find the smallest number in this reduces matrix: 0.17 for C-D. So let's join C and D as neighbors and re-reduce the matrix:

          A/B    C/D    E     F
    A/B   ---    --    --    --
    C/D 0.265    --    --    --
     E   0.30   0.34   --    --
     F   0.38  0.425  0.30   --

Remember that to get the average numbers, you need to average the numbers from the original distance matrix, not any of the reduced matrices. So to get the AB/CD number, you average A/C (0.30), A/D (0.23), B/C (0.30), and B/D (0.23) to get 0.265. If you average AB/C (0.30) and AB/D (0.23) from the reduced matrix to get the same number in this case, but usually this is a more complex average and the numbers won't come out the same.

The smallest number in this reduced distance matrix is 0.265 for A/B to C/D, so join these two branches on the tree:

All of the internal nodes are resolved (they all have 3 branches each), so you're finished as far as the tree structure is concerned. I've labeled the internal nodes w, x, y and z for future reference. You'll need a couple of numbers from the further reduced matrix for the branch length calculations, so let's go ahead and do it now:

          AB/CD    E      F
    AB/CD   ---    ---    ---
      E     0.32   ---    ---
      F    0.403   0.30   ---

Remembering that the AB/CD to E is the average between A/E, B/E, C/E and D/E from the original distance matrix, and likewise for AB/CD to F.

OK, now let's calculate the branch lengths. One way to sort out the lengths easily is to do them in the same order that the branches were sorted out.

Branches w/A and w/B (w is the common node between A and B):

We know the total length A/B (think of this as A/w/B) = 0.11 from the original ditance matrix. But how is this 0.11 divided between line segments w/A and w/B? Notice that the distance from any other point in the tree (C, D, E or F) is the same to A or B; for example, from C it is 0.30 to A and 0.30 to B. In other words, the difference between C/A and C/B is zero (C/A - C/B = 0). This means node w must be equidistant between A and B, so w/A or w/B must be A/B divided by 2, or 0.055:

Branches x/C and x/D:

We know the total length of C/D is 0.17 from the original distance matrix. Solving the lengths of the branches from the common node x to C and D isn't as easy it was for A and B, because the differences in distances between any other part of the tree (A, B, E or F) to C or D aren't exactly the same. So we'll have to take averages:

   A/C = 0.30
   A/D = 0.23    so A/C - A/D = 0.30 - 0.23 = 0.07 = equals the difference in 
                                                     branch length between A/C and A/D

   B/C = 0.30
   B/D = 0.23        B/C - B/D = 0.30 - 0.23 = 0.07

   E/C = 0.38
   E/D = 0.30        E/C - E/D = 0.08

   F/C - F/D = 0.47 - 0.38 = 0.09

   Average (0.07, 0.07, 0.08, 0.09) = 0.08 - this is the average amount by which x/C is longer than x/D 

Because the total length of C/D (think of this as C/x/D) is 0.170, and x/C is 0.08 longer than x/D, then x/D is (0.170-0.08)/2 = 0.045. x/C is this same 0.045 plus 0.08 = 0.125:

Branches z/E and z/F:

These two branches are the same as A/B or C/D, so let's do them as well. So let's find out what the difference in length is between z/E and z/F:

   A/E - A/F = 0.30 - 0.38 = -0.08
   B/E - B/F = 0.30 - 0.38 = -0.08
   C/E - C/F = 0.38 - 0.47 = -0.09
   D/E - D/F = 0.30 - 0.38 = -0.08
   -------------------------------
                   Average = -0.08

The length of E/z/F (E/F) is 0.30, and the average difference between anywhere else in the tree and E or F is -0.08. The negative sign means the branch to F is longer than the branch to E. So z/E = (0.30-0.08)/2 = 0.11, and z/F - 0.11+0.08 = 0.19:

Internal branches w/x, x/y, and x/z:

To solve the lengths of internal braches, the same process is used for collections of branches, then the average lengths outside of the internal nodes is subtracted. In order to determiine the lengths of w/x, x/y, and z/y, the first step is to determine the total lengths of w/x, x/z and z/w, then figure out where along these to place y.

To solve for the length of w/x, the average distance between AB and CD is looked up in the reduced matrix (0.265), then the average w/A and w/B distance (AB/2 =0.11/2 = 0.055) and the average x/C and x/D distance (CD/2 =0.17/2 = 0.085) are subtracted to leave the w/x distance: 0.265 - 0.055 - 0.085 = 0.125.

To solve the length of x/z, the average CD/EF distance is calculated (we didn't do this in the reduced matrices, but it's the same process; it's 0.38) and then x/CD (CD/2 = 0.085) and x/EF (EF/2 = 0.30/2 = 0.150) are subtracted to give x/z = 0.145 Likewise, z/w is EF/AB (0.34) minus w/AB (0.055) an z/EF (0.15) = 0.135.

So: w/x = 0.125, x/z = 0.145 and z/w = 0.135. To determine where on any of these line segments node y is, pick any of the line segments, and calculate what the difference is to each end of this segment from the other branch. For example, for branch w/x:

   z/w - z/x = 0.135 - 0.145 = -0.010 (z/w is shorter than z/x by 0.010) 

   y/w = (w/x-0.10)/2 = (0.125-0.010)/2 = 0.058

   y/x = y/w + 0.010 = 0.068

Now we know all lof the line segment lengths but one, y/z. But we can get this by subtracting all of the know lengths of w/y from w/z, or all the known length of x/y from x/z, - either ay you get he same answer:

   y/z = w/z - w/y = 0.135 - 0.058 = 0.077

   y/z = x/z - y/x = 0.045 - 0.068 = 0.077  

So, our final tree looks like this:

... or, like this with the branches drawn to scale: