Obtaining the required sequence data - databases and experimental determination

OLD Audio recording
OLD Audio recording (.mp3 format)
OLD Audio recording (.wma format)

What's the point?

  1. Where does sequence data come from?
  2. How does PCR work?
  3. How does DNA sequencing work?

Sequences for a phylogenetic analysis usually come from two sources: electronic databases and your own experimental results. Ideally, all of the sequence data needed can be obtained from databases, or at least all of the data needed except that of the specific organism(s) of interest. However, you might find that some other sequences you need for comparison aren’t available - if this is the case, you may need to obtain them yourself experimentally.


For most sequences that are commonly used for phylogenetic analysis, there are specialized databases of pre-aligned sequences. There are several databases of ssu-rRNAs and their alignments: the Ribosomal Database Project is a prime example. As of this writing, the RDP contains 1.3 million aligned ssu-rRNA sequences and a suite of software tools to access this data, including a taxonomic browser that can be used to collect any desired aligned sequences for further analysis. Databases of this type are usually the best starting point to collect an initial data set.

Very often, however, there is additional useful information that is not yet available in these specialized databases. BLAST searches of general sequences databases (e.g. GenBank), most often through the NCBI (National Center for Biotechnology Information) web site, will often identify additional useful sequences.

Obtaining sequences experimentally

The commonly used method to obtain DNA for sequence analysis is Polymerase Chain Reaction (PCR). PCR amplifies genes logarithmically - a single molecule of a gene, imbedded in the rest of the genomic DNA, is specifically amplified to up to a million molecules in just a couple of hours! In a PCR reaction, 3 steps (denaturation, primer annealing, and DNA polymerization) are cycled over-&-over, each time doubling the amount of the specific DNA fragment.

How PCR works
Cheezy drawing by Jim Brown

The PCR product DNA is then “sequenced” (i.e. it’s nucleotide sequence is determined), often using the same oligonucleotide primers that were used in the PCR reaction. Sequencing involves denaturing the DNA, annealing an oligonucleotide primer, and extending from this primer with DNA polymerase in the presence of dNTPs and small amounts of 'chain terminator' dideoxynucleotides (analogs of dNTPs that DNA polymerase cannot continue extending from). Usually this process is carried out by a commercial service rather than in a research lab.

1. Denature the DNA (separate the strands) with heat or high pH.

2. Anneal an oligonucleotide primer complementary to the DNA:

The annealing step of PCR
Cheezy drawing by Jim Brown

3. Add all the 4 dNTPs and a small amount of each of the 4 dideoxydNTP (ddNTP), each with a different fluorescent 'tag', and DNA polymerase:

How fluorescent-dideoxy-sequencing works
Cheezy drawing by Jim Brown

4. Run sample on a high-resolution gel or capillary tube that can separate DNAs that differ by only a single base:

sequence tracing
What one lane of a fluorescent sequencing gel looks like
Cheezy drawing by Jim Brown

A fluorometer at the bottom of the gel or end of the capillary detects the termination dyes as they run past. The connected computer collects this data and 'reads' the sequence from the pattern of peaks. The output from the computer looks like this:

sequence trace
DNA sequence tracing
Clipped from a sequence by Jim Brown, reaction performed by MWG BioTech

(notice that the colors used here don't match the example)

Each reaction typically yields 500-800 bases of reliable sequence data, so it is usually necessary to use several primers spaced along the length of the molecule to get the complete sequence on an rRNA gene. It is also usually expected that you will sequence both strands of the DNA to confirm the sequence.