Alternatives to ssu-rRNA

OLD Audio recording

Video recording (.mov format, 2.0Gbytes)
Video recording (480p .mp4 format, 0.3Gbytes)
Video recording (1280p .mp4 format, 1.6Gbytes

 

What's the point?

  1. to be able to describe why ssu-rRNA is not always the best choice for phylogenetic analysis
  2. to describe how other RNA sequences can be used for phylogenetic analysis
  3. to describe how protein sequences can be used for phylogenetic analysis - and why these are not as good (in general) as RNA-encoding sequences
  4. to describe how multiple sequences can be used at once to generate a phylogenetic tree

ssu-rRNA cannot be used to distinguish closely-related organisms

Molecular phylogenetic analysis using ssu-rRNA sequences is generally the most useful method for determining the phylogeny of an organisms, and is also one of the most definitive ways to verify the identity of an organism. For example, in some labs, whenever a new organism comes from another lab, whether from across the hall or across the world, whether from someone well known and trusted, or someone asked out of the blue, from the American Type Culture Collection (ATCC) or DSMZ (the German equivalent), the very first thing done is streak it out for purity, then sequence it's ssu-rRNA to make sure it is what it should be. This is usually a lesson learned the hard way.

But although ssu-rRNA molecular phylogenetic analysis is usually the most useful method for identifying organisms, and the method you should at least consider to start with, it is not always the best method, nor the final word.

The main problem with using ssu-rRNA to identify organisms is that close relatives cannot be distinguished using rRNA sequences. As a general rule of thumb, ssu-rRNA analysis can determine the genus, but not the species (at least not reliably), of an organism. Sometimes phenotypically distinct but closely-related species can have very similar or even identical rRNA sequences; this occurs in the genus Bacillus, for example. On the other hand, often even the rRNA operons in the same organism vary to some extent. Usually this is just a handful of differences, but how much variation exists, and how this clusters in the populations, is usually unexplored. In a very few cases, different rRNA operons in the same organism may be quite different, either because the gene conversion process has failed to keep them evolving in concert, or because the organism is a recent hybrid of two species or has recently acquired rRNA genes by horizontal transfer (yes, it does, rarely, happen).

Another part of the problem is the larger issue of how species are defined in non-sexually reproducing organisms. As problematic as the "species concept" is in the world of sexually-reproducing species (a topic of discussion at length in The Origin of Species), no rational species concept exists for organisms in which reproduction and sexual exchange are not linked - including Bacteria, Archaea, and most unicellular eukaryotes. The result is that species are typically defined arbitrarily in the microbial world. These arbitrary boundaries are not very closely related to how different these "species" are in terms of rRNA sequence.

Generally speaking, ssu-rRNA sequences that are greater than about 97% identical cannot be meaningfully distinguished and their relationships cannot be reliably determined in trees. The ability to distinguish and relate very closely related organisms is critically important for medical and food microbiologists; for example, different strains of Staphylococcus aureus are very different in their pathogenicity, and must be carefully identified. Likewise, in food fermentations using natural microbes, such as in cheeze, different very closely-related strains make the difference between a perfect Stilton and spoiled milk.

So, although ssu-rRNA is the standard method for general phylogenetic analysis, other methods are often needed for fine-scale analyses or identifications. In many cases, alternative sequences are useful. Other non-sequence-based approaches are described in the module on non-sequence-based alternatives.

Alternative sequences

In cases where ssu-rRNA sequences are too close to give reliable trees, it is common practice to use other molecular sequences in the same way to create phylogenetic trees.

Other RNAs

In most cases, it would be best to use other RNA-encoding genes that retain the advantages of ssu-rRNA sequences that evolve faster than ssu-rRNA. The large subunit rRNA (lsu-rRNA) is sometimes used, but has not proven to be as robust or useful as the ssu-rRNA. The “other” lsu-rRNA, the 5S rRNA, is too short at ca. 120nt to be useful in most circumstances. Transfer RNAs are both too short and too highly conserved to be generally useful for phylogenetic analysis. Many other larger RNAs, such as the SRP RNA and the tmRNA, are not conserved enough in structure to be reliably aligned with the precision required for good phylogenetic analysis.

However, the RNA subunit of ribonuclease P (a catalytic RNA involved in tRNA biosynthesis) has been used extensively to examine relationships in the Bacteria and Archaea. The RNase P RNA in Bacteria evolves about 6-fold faster than does the ssu-rRNA overall, too fast to be useful for probing the deep branches of the tree. However, RNase P RNA provides resolution amongst close relatives not available by analysis of ssu-rRNA, and is also sometimes used to test relationships in ssu-rRNA trees.

M.thermoautotrophicus RNase P RNA
From the RNase P database, Jim Brown

Archaeal RNase P RNA tree
from JK Harris, ES Haas, DA Williams,& JW Brown 2001 RNA 7:220-232

It is common, however, that even these RNAs are too highly conserved for molecular phylogenetic analysis to give the resolution required. In this situation, alternative approaches are required.

rRNA spacer sequence analysis

This is a lot like ssu-rRNA analysis, but takes advantage of the fact that, in most organisms, the small and large subunit rRNAs are directly adjacent to each other in the genome and in this order, with a small spacer in between. This makes it easy to PCR amplify and sequence this highly variable spacer using primers that hybridize to highly conserved sequences the 3´-end of the ssu-rRNA and the 5´-end of the lsu-rRNA.

rRNA operon pacer
Schematic of an rRNA operon, with the intergenic spacer indicated (JWBrown)

This small spacer sequence evolves very quickly, and can be used to distinguish different closely-related species, or even strains of the same species. This sequence is often used to analyze the relationships between animal species, using the spacer sequence from the mitochondrial DNA; mitochondrial genes evolve much faster than nuclear genes.

The main disadvantage of this approach is that there are usually many copies of the rRNA gene cluster and the spacer sequences aren't usually the same in the different clusters. For example, E. coli has 7 rRNA operons, 3 of which have one spacer sequence and 4 have another sequence. So you have to be sure to obtain and compare the corresponding spacers for the analysis to work. This is usually easy; these spacers generally have tRNA genes in them and these can serve to indicate which spacer sequence corresponds to which. In most Bacteria, one spacer type contains a single gene for tRNA(Glu), and the other spacer type contains genes for tRNA(Ile) and tRNA(Val). Although a genome might have several of each, all of the spacers within a class generally evolve together, and so it is not necessary to sequence ll of the pacers in the genome and identify which rRNA operons specifically correspond.

Protein sequence analysis

You might imagine that alignments and trees based on the amino acid sequences of proteins would be more informative than those based on DNA or RNA sequences, because there are only 4 nucleic acids bases but 20 standard amino acids (BTW, there are at least 2 known "non-standard" amino acids that are built into proteins during translation, selenocysteine and pyrolysine). But despite the lower information content of nucleic acid sequences, there are fundamental reasons why non-mRNAs are generally better for phylogenetic analysis:

  1. Protein sequences are aligned automatically by programs like CLUSTAL that maximize the pair-wise similarites between sequences. This is not an advantage. It is objective, but it is far less accurate than alignment based on structure, which is how RNAs are aligned unless the sequences are very similar. Add a new sequence, and the rest of the alignment will change, unlike an RNA alignment. RNAs but not proteins can be aligned on the basis of structure because RNAs (but not proteins) have a predictable, well-defined secondary structure. A few alignments have been made on the basis of solved or predicted three-dimensional structure, but this is a very rare exception.
  2. The basepairs and helices of an RNA are independently-evolving substructures, a prerequisite for a molecule to exhibit clock-like behavior. Protein structure is much less hierarchical than is RNA structure - a change in any internal amino acid can have significant effects throughout it's domain. In other words, the amino acids changes in a protein are not as independent as base (actually base-pair) changes in an RNA.
  3. Although there is less information per position in a protein alignment, there are usually more phylogenetically-informative positions in an RNA (at least ssu-rRNA) sequence. These RNAs are longer (in terms of the number of residues) than most proteins, and more of the residues can be reliably aligned and yet are variable.
  4. Gene families are much more of an issue for proteins than RNAs.
  5. Horizontal transfer seems to be more frequent in the case of protein-encoding genes than for RNA-encoding genes. Amongst protein-encoding genes, those involved in metabolism are more likely to be have been horizontally transferred than those of “information processing” (DNA replication, transcription, and translation).

Nevertheless, in some situations, protein-based trees can be informative. Usually it is the encoded animo-acid sequence rather than the DNA sequence of the encoding gene that is used for phylogenetic analysis. Highly-conserved proteins that have proven to be useful for phylogenetic analysis include RNA polymerase subunits, DNA polymerase subunits, ATPase subunits, glycolytic enzymes, electron-transport enzymes, and stress-response proteins.

Catenated alignments

It is also possible to use more than one gene at a time for an analysis. The alignment just contains more than one molecular sequence in each row, i.e. the sequences are catenated. Very often such catenated alignments yield trees that are more reliable than the trees of the individual sequences of which it is made. For example, the alignment might be composed of ssu-rRNA, lsu-rRNA and RNase P RNAs all together. Or the entire set of ribosomal proteins and RNase polymerase subunits.

The extreme version of using catenated lginments is to make alignments composed of ALL of the genes of the organisms to be analyzed, i.e. the entire genome. This is a lot of data, but is possible (just) with modern computing power. The problem is that you're throwing in good molecular clocks and poor ones all together, genes that transfer frequently horizontally and those that do not, and it is not clear that these trees are going to be generally better than rRNA trees. This is a very promising area of active research.