Assembling sequences in a multiple sequence alignment
( = identifying homologous residues)

OLD Audio recording

Video recording (.mov format, 6.4Gbytes)
Video recording (480p .mp4 format, 0.3Gbytes)
Video recording (1280p .mp4 format, 1.2Gbytes

 

What is the point?

  1. To be able to describe what a sequence alignment is
  2. To abe able to align sequences manually ("by eye") on the basis of gross similarity
  3. To be able to align RNA structures
  4. To be able to draw RNA structures from a structural alignment

There is a sequence alignment problem set that tests these skills.


The raw material used by a phylogenetic tree generating program is an “alignment”. A sequence alignment is a 2-dimensional matrix of multiple sequences. Each sequence is in a line (row) of the matrix. Each position (column) in an alignment contains homologous (corresponding) residues of each sequence. Gaps (usually shown as dashes) are added where needed to maintain the alignment - these gaps represent “absent” bases in the that are present in some other sequence(s) in the alignment. They are sometimes referred to as "indels" - "INsertions/DELetionS".

alignment example
A chunk of an ssu-rRNA sequence alignment
Cheezy image from Jim Brown, using Se-Al and the RDP representative bacterial ssu-rRNA alignment

Most alignments are generated using computer programs that align sequences using algorithms (e.g. CLUSTAL) that attempt to maximize the similarity (measured in a variety of ways) of all of the sequences. Where the sequences in an alignment are very similar, this approach can generate very good alignments. This is especially true for protein-encoding sequences, with 20 possible amino acids and good scoring matrices to count how similar/different any two amino acids are from each other. This is less true for gene or RNA sequences, with only 4 possible bases and where similarity between pairs of bases is less meaningful chemically.

Very often, however, RNA alignments are either created by hand, or at least "tweeked" manually, Sequences must be fairly similar in sequence and length to be readily alignable 'by eye', or even by computer alignment programs (e.g. Clustal). Thank goodness, most of the length of ssu-rRNAs are highly conserved and can (with experience) be manually aligned without much trouble.

Some of the tricks to aligning sequences by hand are:

  • Sequences are often aligned sequentially - start by aligning the two most similar sequences, then add sequences to the alignment one at a time after this, starting with the sequences most similar to those already aligned aand finishing with the most distantly related sequences. Likewise if you're adding a single sequence to an existing alignment, start by identifying the most similar sequence in the alignment & use that sequence as a guide.
  • Alternatively, you can identify conserved blocks of sequence in all of the sequences, and align these. You have now broken the alignment problem into smaller, easier chunks. Add gaps as need to align the space between pre-aligned chunks according to the criteria below.
  • Start out by finding patches of very similar sequences and align these, then work out in both directions from these, adding gaps sparingly when needed. Everything after this is about rearranging (and potentially adding or removing) these gaps.
  • Where there are sequence differences, slide the gaps around to keep purines (G, A) aligned with purines & pyrimidines (C, U) aligned with pyrimidines.
  • Try also to keep differences together in variable sequence positions, and align gaps together in columns wherever possible. A single gap of two positions is a lot better than two separate gaps of one position each!
  • Try to keep what look like conserved positions (columns) conserved, and all things being equal put differences into positions already known to be variable.

In the case of RNAs, however, advanced alignment algorithms (e.g. infeRNAl) can use the secondary structures of the RNAs to align sequences. The ability to use well-defined secondary structures to identify homologous residues (i.e. to align sequences) is one of the key advantages of RNA over protein for phylogenetic analysis. In other words, you can use the secondary structure of the RNA to identify homologous parts of the RNA, rather than relying only on sequence similarity.

For example, look at these two RNase P RNAs with very difference sequence but very similar secondary structure:

Reutrophus P RNA
Ralstonia (previously Alcaligenes, Pseudomonas) eutrophus RNase P RNA
Taken from the RNase P Database - Jim Brown

Mthermoautotrophicus P RNA
Methanothermobacter thermoautotrophicus (previously Methanobacterium thermoautotrophicum strain delta H) RNase P RNA
Taken from the RNase P Database - Jim Brown

This works because in general it doesn't matter (much) to the RNA what the bases in the helices are, what matters is that these bases are complementary so that they can form the helix. As a result, the secondary structure of an RNA is much more conserved than it's sequence, because co-evolution of bases that form base-pairs maintains the secondary structure as the sequence changes. Variation in the length of the RNA is usually in hairpin lengthening or shortening. So it's usually possible to keep track of homologous parts of RNA structures even if the sequences are completely different.

example RNA structure alignment

Archaeal RNase P RNA P3 alignment
Jim Brown (from the RNase P Database)

This is an RNA alignment based on secondary structure - stem/loop P3 of RNase P RNA. In this example, the first 6 rows aren't sequences - they're annotations. The first three are just a reference numbering - in this case, the Methanothermobacter thermoautotrophicus (Mthermo) is the reference sequence. The row marked "helices" indicates the secondary structure; the 5' stranf of P3, followed by the loop and then the 3' strand. Each basepair in this stem/loop is indicated by matching right and left facing parenthesis in the following row, and are labeled alphabetically (for human readibility) in the subsequent row. In this type of alignment, the secondary structures of all of the RNAs are directly encoded in the alignment - if residue n (e.g. 24) of any sequence pairs to residue m (e.g. 29), then so should the corresponding homologous residues in all sequences.

helices
Archaeal RNase P RNA P3 secondary structures
Jim Brown (generated from .ct files from the RNase P Database)

Given this type of alignment, a computer can parse out any of the RNAs as secondary structures. Inversely, given a pre-existing alignment and an RNA with the same secondary structure, a computer algorithm can add this sequence correctly to the alignment. This is what infeRNAl does - it takes a sequence and tries to fold it into the correct secondary structure. If it can do so, it then threads this sequence into the alignment based on this structure.

“Gold standard” alignments of RNAs are crafted on the basis of both sequence and structure, by the hand of specialists with years of experience with particular RNAs, rather than computer algorithms.