W1_L03
Phylogenetics 101 — Part B
Core Idea (bird's-eye view)
Lesson 02 was about trees as objects (Newick/Nexus). Lesson 03 is about the data that goes into building those trees: molecular sequences. The practical is short and format-focused (FASTA, PHYLIP, Nexus-DATA), but it sits on top of a heavy theoretical layer you must carry into the meeting:
- Phylogenetics can use any heritable character, but molecular characters dominate because they are abundant, cheap, objective, standardized, and resolve both shallow and deep relationships.
- Characters used must be: heritable, independent, informative, clearly coded, homologous, and ideally non-homoplasious.
- Homology (similarity from shared ancestry) is the prerequisite. Homoplasy (convergence, reversal) is the enemy — it produces character conflict and misleads inference.
- Genetic distance between two taxa = sum of changes along the two branches connecting them to their common ancestor:
d(SP1,SP2) = (SP0 to SP1) + (SP0 to SP2). - Molecular clock hypothesis (Kimura, neutral theory): neutral mutations accumulate at a roughly constant rate, so older splits accumulate more differences. This is what allows dating divergences — but rates vary across genes, lineages, and time, so calibrations (fossils, known splits) are needed.
- Historical anchors: Nuttall (1904, immunological distances), Cavalli-Sforza & Edwards (1963, first computed trees), Fitch & Margoliash (1967, first modern sequence-distance tree).
The practical itself is teaching you how molecular data is physically stored on disk so you can feed it into alignment and tree-building tools later in the course.
Inputs
The practical introduces three sequence file formats. The "input" here is raw sequence data — nucleotides (DNA/RNA) or amino acids — for multiple taxa.
1. FASTA
The simplest and most common format. Each sequence has a header line starting with > followed by the sequence on the next line(s).
>Taxon1
ATGCGTACGTAGCTAGCTACGATCG
>Taxon2
ATGCGTACGTAGCTAGATACGATCG
>Taxon3
ATGCGTACGTAGCTAGCTACGATCC
Key points:
- Header line must start with
>. - No length declaration, no strict columns. Very flexible.
- Used everywhere — alignments, databases (NCBI, UniProt), tool inputs.
2. PHYLIP
A strict column-based format originally for the PHYLIP software package.
3 25
Taxon1 ATGCGTACGTAGCTAGCTACGATCG
Taxon2 ATGCGTACGTAGCTAGATACGATCG
Taxon3 ATGCGTACGTAGCTAGCTACGATCC
Key points:
- First line:
<number of sequences> <number of characters>. - Taxon names traditionally limited to 10 characters — longer names get truncated. This is a real source of bugs.
- Two flavors exist: sequential (one taxon at a time) and interleaved (blocks of all taxa, repeating). The example above shows the interleaved style.
- Less flexible than FASTA but still expected by many phylogenetics tools.
3. Nexus (DATA block)
The same Nexus container from Lesson 02, but here storing a sequence matrix instead of a tree.
#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=3 NCHAR=25;
FORMAT DATATYPE=DNA MISSING=? GAP=-;
MATRIX
Taxon1 ATGCGTACGTAGCTAGCTACGATCG
Taxon2 ATGCGTACGTAGCTAGATACGATCG
Taxon3 ATGCGTACGTAGCTAGCTACGATCC
;
END;
Key points:
- Block-based:
BEGIN DATA; ... END;. - Declares dimensions (NTAX, NCHAR), datatype (DNA/RNA/protein), and the symbols used for missing data and gaps.
- Most flexible — can store sequences, trees, character sets, and analysis commands all in one file. This is why programs like MrBayes and PAUP* use it natively.
Outputs
The practical does not "produce" anything computed — its outputs are conceptual:
- The ability to recognize at a glance which format a file is in.
- Understanding that the same sequence data can be written in any of these three formats and converted between them losslessly (modulo PHYLIP's name-length limit).
- Knowing which format which tool expects — FASTA for aligners (mafft), PHYLIP for many tree builders (older iqtree, RAxML), Nexus for Bayesian tools (MrBayes) and tree storage.
Interpretations
- Format choice is downstream-driven: you pick the format the next tool in your pipeline wants. Conversions are routine.
- PHYLIP's 10-character name limit is the most common gotcha. If your taxa have long names, they get silently truncated, and downstream you may end up with duplicated or confused labels.
- FASTA has no built-in concept of alignment: the sequences in a FASTA file may or may not be aligned. An "aligned FASTA" is just a FASTA where every sequence has the same length (gaps usually shown as
-). This matters because phylogenetics needs aligned data, not raw sequences. - Nexus is a container, not a format: the same
#NEXUSfile can hold a DATA block (Lesson 03 practical) or a TREES block (Lesson 02 practical), or both at once. This is why Nexus shows up in both lessons. - Connecting back to theory: each column in an aligned matrix is a character; each cell is a character state. The whole matrix is the input to every algorithm in this course. The quality of phylogenetic inference is bounded by how well the columns represent homologous positions — which is the entire job of sequence alignment (Lesson 05).
- On the molecular clock: the data in these files is what you measure distance on. Whether that distance reflects time depends on whether the rate of change has been constant — which the prof has flagged as usually NOT being the case in practice.
Lexicon to keep handy
- homology / homoplasy / convergence / reversal
- heritable, independent, informative, clear, homologous, non-homoplasious (the six character criteria)
- genetic distance, molecular clock, neutral theory
- FASTA, PHYLIP (sequential vs interleaved), Nexus DATA block
- aligned vs unaligned sequences
- NTAX, NCHAR, DATATYPE, GAP, MISSING
Possible Exam Questions
(to be filled in during the meeting)