Alternative substitution models

OLD Audio recording

Video recording (.mov format, 1.1Gbytes)
Video recording (480p .mp4 format, 0.2Gbytes)
Video recording (1280p .mp4 format, 0.9Gbytes


What's the point?

  1. To define a substitution model, and why it is important
  2. To describe some alternative substitution models
  3. To describe some of the issues introduced by gaps
  4. To describe G+C bias can be countered in trees
  5. To define long-branch attraction and how it can occur

In the previous modules we talked about the Jukes & Cantor method to estimate evolutionary distance from sequence similarity. This is a simple method, but there are several other more sophistiateed methods.

In the Jukes & Cantor method, any change is scored equivalently - for each position in a pairwise comparison, the bases are either a match or not. A commonly-used alternative is the Kimura 2-parameter model, in which transitions (purine to purine or pyrimidine to pyrimidine) and transversions (purine to pyrimidine or pyrimidine to purine) are scored differently because transitions are much more common than are transversions. These scores are based on pre-sifting the alignment to determine their relative frequency of transitions to transversions, and these different types of changes are cored accordingly.

Kimura 2-parameter model
Kimura 2-parameter model
Cheezy hand-drawing by Jim Brown

For example, if you see in your alignment that transitions occur twice as frequently as transversions, then transversions might be scores as a full mismatch (-1), whereas transitions would count as only 1/2 a mismatch (-0.5).

You can even have a 6-parameter model, in which each type of substitution (G-A, G-C, G-U, A-U, A-C, U-C) is scored differently.

6-parameter model

6-parameter model - another cheezy hand drawing by Jim Brown

It is also possible to "weigh" the score of each position (column) in an alignment differently based on how conserved that position is; a change in a conserved position is then scored as a greater change than in more variation positions. This requires alignments with lots of sequences, so that variability at each position can be measured reliably, and so very often these are pre-determined for the class of RNA being analysed. The "Weighbor" algorithm used by the RDP does this; the name stands for "weighted neighbor-joining". Distance matrices from protein alignments usually use a scoring table derived from the observed relative frequencies with which any amino acid is coverted to another from a huge collection of aligned protein sequences, e.g. the PAM tables.

There are also different ways gaps can be used. In most treeing algorithms, gaps are ignored - these positions are counted as neither a match nor a mismatch. This isn't because they aren't impotant, in fact because insertions and deletions are less common that nucleotide substitutions, these are potentially more imformative. But it's not clear how to deal with gaps for a variety of reasons. The obvious case is where the alignment contains sequence fraqgments instead of full-length sequences. Partial sequences have two kinds of gaps: those for bases that aren't present in that sequence, and bases that are outside of the region for which the sequence is available. The algorithm can't distinguish between these.

An alignment with partial sequences
From Jim Brown, and the RNase P Database (SeaView is the alignment editor)

It is also difficult to deal with the fact that adjacent gaps aren't independent - These represent the insertion or deletion of more than one base at the same time, not the one-at-a-time insertion/deletion of bases. Sophisticated algorithms will use a large scoring penalty for a gap, but then only a very small additional penalty for additional adjacent gaps. In addition, it is not clear how to deal with variation in where the 5' nd 3' ends of the RNA are; for example, some RNase P RNAs have the rho-independent terminator stem/loop at the end of the RNA removed, some do not (and in at least some organisms, the RNA exists in both versions). What to do with all the gaps in aligned RNAs in which this structural element is removed?

These and any other method for estimating evolutionary distance amount to an attempt to describe how sequences change. In other words, they are mathematical models of the process of evolution of these sequences, and are therefore usually called "substitution models". Choice of an appropriate substitution model is critical, and often underappreciated.

Special case - G+C bias

Sometimes even in rRNA sequences the sequences change adaptively. The biggest example is the tendency of sequences to differ in G+C vs A+U content either because the genome has an unusual G+C content (i.e. there is pressure toward either GC or AT richness in the DNA) or because the organism is a thermophile & so prefers G=C basepairs over A=U in it's RNAs. This can cause havoc in a tree. A way around this is to do a transversion analysis, which is just a conversion of the sequences in the alignment so that A=G and C=U (commonly all G's are converted to A's and all C's to U's). Trees are generated from these alignments in the usual fashion. These trees are, of course based on less data - you've thrown out more than half of the phylogenetic information in the alignment - but should be free of G+C bias artifacts.

Long-branch attraction

One of the things substitution models fight is a treeing artifact called long branch attraction, that affects treeing algorithms generally, and is the result (primarily) of a underestimation of evolutionary distance of distantly-related sequences. This underestimation results in a tendency for the longest branches in a tree to cluster together artifactually (and so the artifical clustering of short banches. Here is a very simple eample of long-branch attraction can happen:

long branch attration
How long-brach attraction occurs in Neighbor-joining trees
Cheezy hand drawing by Jim Brown

If the sub-tree to the left is the true representation of how these sequences are related, imagine what would happen in a neighbor-joining analysis. Sequences A and B are more alike (have a smaller evolutionary distance) than either is to C, and so they will be erroneously joined as in the tree to the right. This happens because of the difference in evolutionary rates in the branches.