Deciding on a sequence for analysis - Molecular Clocks

OLD Audio recording

Video recording (.mov format, 6.3Gbytes)
Video recording (480p .mp4 format, 0.3Gbytes)
Video recording (1280p .mp4 format, 1.2Gbytes

 

What's the point?

  1. What are the features of a useful molecular clock?
  2. What are some of the attributes that result in clock-like behavior?
  3. What are some of the features of a molecule that make it a poor clock?
  4. Why is ssu-rRNA such a good molecular clock?

The sequences of genes, RNAs, or proteins contain two very different kinds of information: structural/functional information, and historical information. Think of it this way; any particular amino acid in a specific protein is what it is in part because it facilitates the formation of the correct structure and function of the protein. This is structural/functional information. But usually there are a number of alternatives that might function just as well. The reason it is what it is, and not any of these alternatives, is because it was inherited from a successful ancestor. This is historical information. Comparisons amongst an aligned collection of homologous sequences can be used to sort out both the structure of the functional molecule (especially for RNAs) and their historical relationships - a phylogenetic tree.

Phylogenetic trees are usually generated using alignments of single genes, RNAs or proteins, but no such sequence is either ideal or universally useful for the generation of informative phylogenetic trees. This being said, some sequences do carry more phylogenetic information than others; these sequence can be called “molecular clocks”.

Features required of a good molecular clock

Clock-like behavior

The sequences of genes, RNAs and proteins change over time. If this change is entirely random (within the constraint of the structure and function of the molecule - i.e. by genetic drift), then the amount of divergence between any particular sequence in two organisms should be be a measure of how long ago these organisms diverged from their common ancestor. If this is true, then these sequence can be said to exhibit clock-like behavior.

evolutionary distance vs similarity
The relationship between sequence similarity and time for a good molecular clock
Cheezy hand drawing by Jim Brown

Clock-like behavior depends mostly on functional constancy of the sequence; a change in the function (or functional properties) leads to large, non-random sequence change - adaptation. Clock-like behavior also depends on the sequence being long enough to provide statistically-significant information, and be comprised of a large number of independently-evolving “bits” so that random changes in one part of the sequence doesn’t influence changes in other parts of the sequence. The sequence must also have an appropriate amount of sequence variation; too little variation doesn’t provide enough difference to be statistically meaningful, too much makes alignment difficult or impossible, and decreases the reliability of the treeing algorithm (see Chapter 4 on evolutionary models). Non-functional sequences (e.g. introns) usually change too fast for analysis except of the very closest of relatives.

Phylogenetic range

In order to be useful for a phylogenetic analysis, a sequence must be present and identifiable in all of the organisms to to analyzed, must exhibit clock-like behavior within this range. Watch out for gene families, because each member of the family is probably specialized for a slightly different function and it is often difficult to identify the correct ortholog or confirm that it really does have the same function.

Absence of horizontal transfer

This means that the gene must be acquired only by inheritance from parent to offspring, not by transfer from one organism to another. An example of frequently horizontally transferred genes are those encoding antibiotic resistance, but any gene has the potential to be transferred horizontally . You can still generate a tree with sequences that have been horizontally transferred, but if the sequence is otherwise a good molecular clock, the resulting perfectly valid tree will reflect the phylogenetic relationships between the sequences - but not the organisms that carry that sequence.

The availability of sequence information

It is of great pragmatic importance to choose a sequence, whenever possible, for which a great deal of the sequence data required is already available and annotated, and perhaps already aligned. If you’re interested in the phylogenetic placement of organism X, it’s better if you don’t have to obtain or identify the sequence data yourself for a large number of organisms it might (or might not) be related to.

The standard : small subunit ribosomal RNA

In most cases, the best molecular clock for phylogenetic analysis is the small subunit ribosomal RNA (ssu-rRNA). This sequence is always the best starting point; only after you know where your organism resides in an ssu-rRNA phylogenetic tree can you decide what other sequences might provide additional information.

The ssu-rRNA is so often the best sequence of choice because:

  1. it is present in all living cells,
  2. it the same function in all cells,
  3. it is comprised of 1500-2000 residues; large enough to be statistically useful but not too large to be onerous to sequence,
  4. it is made up of ca. 50 independently-evolving helices and ca. 500 independently-evolving base-pairs,
  5. it is conserved enough in sequence & structure of be easily and accurately aligned,
  6. it contains both rapidly and slowly evolving regions; the rapidly-evolving regions are useful for determining close
  7. relationships, whereas the slowly-evolving regions are useful for determining distant relationships,
  8. horizontal transfer of rRNA genes is exceedingly rare (most genes of the central information processing pathways of the cell are also resistant to horizontal transfer), and...
  9. there are huge datasets of sequences, alignments, and analysis tools available.

Here is the secondary structure of the ssu-rRNA of Escherichia coli:

E.coli ssu-rRNA
E.coli ssu-rRNA secondary structure
From Robin Gutell and the Comparative RNA Database (www.rna.icmb.utexas.edu)

Deciding what organisms to include

Usually, deciding what organisms to include is part of the treeing process rather than something to decide at this stage. We’ll see later that most often you start out by generating a tree with representatives from a wide range of organisms scattered around the tree, in order to identify what kind of organism it is in very general terms, then replace most of these disparate representatives with representatives that you now know are likely to be closely related. The resulting tree gives you more specific information about what group your organism is a member of, which can be used again to choose even closer relatives, and so on until you’re satisfied with the representation of the tree. For example, if you have a new organism to identify, an initial tree containing one or two representatives from each bacterial Phylum might show you that your organism is a member of the Firmicutes. With this information in hand, a second tree populated by representatives of each Order and Family of Firmicutes might show you that you might have a member of the Family Veillonellaceae. From there, you could populate a final tree with most of the species in this Family.