W1_L03
Phylogenetics 101 — Part B
phylogenetics 101 - sequences
Sequence data used in phylogenies typically consists of nucleotide (DNA or RNA) or amino acid sequences that provide information about the genetic makeup of organisms. These sequences are compared across species or individuals to infer evolutionary relationships. Key points include:
Types of Sequence Data:
- Nucleotide Sequences: DNA (or RNA) sequences from specific genes or non-coding regions.
- Protein Sequences: Amino acid sequences derived from translated genes, useful for studying more conserved features.
The FASTA Format:
A simple, text-based format where each sequence is prefaced by a header line starting with a ">" character. It is widely used for sequence storage and exchange.
>Taxon1
ATGCGTACGTAGCTAGCTACGATCG
>Taxon2
ATGCGTACGTAGCTAGATACGATCG
>Taxon3
ATGCGTACGTAGCTAGCTACGATCC
The PHYLIP Format:
A format originally designed for the PHYLIP software package. It has a strict column-based format, making it less flexible than FASTA but still widely used for phylogenetic analyses.
3 25
Taxon1 ATGCGTACGTAGCTAGCTACGATCG
Taxon2 ATGCGTACGTAGCTAGATACGATCG
Taxon3 ATGCGTACGTAGCTAGCTACGATCC
Taxon1 ATGCGTACGTAGCTAGCTACGATCG
Taxon2 ATGCGTACGTAGCTAGATACGATCG
Taxon3 ATGCGTACGTAGCTAGCTACGATCC
The first line indicates the number of sequences (e.g., 3) and the number of characters (e.g., 25). Taxon names typically have a 10-character limit, so they may be truncated.
The Nexus Format:
A more versatile file format that not only stores sequence data but can also include additional information like phylogenetic trees, taxa information, and analysis commands in different blocks. Example of a DATA Block in Nexus:
#NEXUS
BEGIN DATA;
DIMENSIONS NTAX=3 NCHAR=25;
FORMAT DATATYPE=DNA MISSING=? GAP=-;
MATRIX
Taxon1 ATGCGTACGTAGCTAGCTACGATCG
Taxon2 ATGCGTACGTAGCTAGATACGATCG
Taxon3 ATGCGTACGTAGCTAGCTACGATCC
;
END;