← Back to syllabus

W7_L13

Stochastic and systematic bias

Practical Notes

stochastic and systematic bias

To immidiatly grasp of one of the more frquent phylogenetic artifacts - long branch attractio or LBA - open the file bias.aln in the folder bias.aln. I have duplicated two sequences Ptetracirrhus_bias and Ovulgaris_bias from the original Ptetracirrhus and Ovulgaris , and I wish you to insert as may As and Ts as you can in these squences,

This silly approach is kind of simulating an accellerated evolutionary rate of change in these two lineage, also in a AT prone manner ...

Then infer a quick tree using iqtree -s data/example_bias/biased.aln -fast and take a look at the tree ...

Then we will move to a very problematic real-life dataset. The Enterobacterales order of Gammaproteobacteria is a bacteria Order which includes some notable species such as E. coli or Y. pestis. They have a huge disparity in lifestyles, including several independent shifts to endosymbiosis. Endosymbionts are characteryzed by a high rate of substitutions and a strong bias towards AT.


mixture models

Let's start by concatenating again the MSAs by:

AMAS.py concat -i data/example_bias/alignments/* -t bias_concat.faa.aln -f fasta -d aa -fast

Then we can perform two inference, skipping model selection and choosing some models a priori:

  • perform a first inference assuming an unpartitioned model with precasted aminoacid matrix:
iqtree -s bias_concat.faa.aln -mset LG -mrate G4 --prefix bias_simple_model -nt auto -fast
  • perform a second inference using a profile mixtutre model:
iqtree -s bias_concat.faa.aln -mset LG+C10 -mrate G4 --prefix bias_profile_mixt -nt auto -fast

task!

Are the two trees different? It is kind of hard to tell ... maybe you could use ete3 to calculate some normalyzed RF distances. Otherwise, you have a file named annotation.tsv in thedata/example_bias/alignements/ folder and you might want to try import it into Figtree alongside with the two trees and then colour tips based on wether species are endosymionts or free-living.


data recoding

Let's try to recode our data, using a custom python script:

python data/example_bias/recode_protein_sequences.py -i bias_concat.faa.aln -o ../../bias_concat.rec.aln -s RS4

By specifying -s RS4 this simple script substitute:

  • AGNPST with A; Alanine, Glycine, Asparagine, Proline, Serine and Threonine share polarity.
  • CHWY with C; Cysteine, Histidine, Tryptophan and Tyrosine share aromaticity.
  • DEKQR to G; Aspartic Acid, Glutamic Acid, Lysine, Glutamine and Arginine share the charge.
  • FILMVR to T; Phenylalanine, Isoleucine, Leucine, Methionine, Valine share hydrophobicity.

Then we can perform an inference - as they were nucleotides - using:

iqtree -s bias_concat.rec.aa.aln -m MFP -nt auto

HRS assumptions

Traditional model relyes on the assumption that evolution of DNA sequences follow a markov process. However, again, it may not always be true. For this reason IQ-TREE implements three matched-pairs tests of symmetry (Naser-Khdour et al., 2019) to test the two assumptions of stationarity and homogeneity. Let's perform this test using:

iqtree -S data/example_bias/alignments --symtest-only

The three important columns in the data/example_bias/alignments.symtest.csv output are:

  • SymPval: a small p-value (p < 0.05) indicates that the assumptions of stationarity or homogeneity or both is rejected.
  • MarPval: a small p-value means that the assumption of stationarity is rejected.
  • IntPval: a small p-value means that the homogeneity assumption is reject.

task!

Which alignments reject the two assumptions of stationarity and homogeneity?


LMAP

Likelihood MAPping can be uesd both to test some specific hypotheses, but also to understand how informative a gene is.

We can perform such an analysis on all our genes using a for loop:

for i in $(ls data/example_bias/alignments/*aln); do iqtree -s $i -lmap 5000 -n 0 -m LG; done

Here:

  • -lmap Number of quartets for likelihood mapping analysis
  • -n 0 tells IQ-TREE to stop the analysis right after running the likelihood mapping

... and we assumed an LG model to spped up computations!

task!

Looking at the .svg files, which alignments is the least phylogenetically infomative?


task!

Now that you have gather some informations on your MSAs, you can perform phylogenetic subsampling! Concatenate only the partitions which do respect phylogenetic assumptions and which have a strong phylogenetic signal ... and see what happens :-P

task!

I have attache my two best phylogenetic hypotheses for Enterobacterlaes ...

  • concatenation_all_genes_site_heterogeneous_SR4.nwk was inferred using the concatenation of 103 single copy and ubiquitous genes (23,531 alignment positions) with RS4 recoding in combiantion with mixture models (for every C10-C60 matrix I summed the frequencies of all aminoacids in a RS4 category).
  • SpeciesRax_all_genes_SR4.nwk was inferred using an approach that leverages orthologs and paralogs (SpeciesRax) in combinatio with RS4 recoding. Here 10,564 orthogroups were used, for a total of 2,869,736 alignment positions.

This is some heavy phylogenetics, and the computation power used was a lot.


main