← Back to syllabus

W3_L07

MK models of molecular evolution

Practical Notes

MK models of molecular evolution

Before starting you should have an aligned & filtered MSA! you can find some here

For all ML analyses we will use IQ-TREE. Some of the greates advantages of this software are:

  • It can do almost everything (take a look at the manual).
  • Each analysis can be highly customizable.
  • Really faster compared to other softwares.
  • Very easy to run as you want (if you study a bit the manual).

For a quick-and-dirty Model Selection on an MSA we can just type:

iqtree2 -s data/example_alignements/N0.HOG0000096.ref.aln -m TESTONLY

You can see that it instantly prints some usefull stats which reconnect to concepts we stubled upon earlier - parsimony-informative sites! Some other are new, like distinct patterns, which refers to the unique configurations of nucleotides or amino acids observed across all sequences at specific alignment positions. The test performed here indicates no significant compositional heterogeneity among the sequences, this is somethin which we will get back to in the lesson on biases.

Alignment most likely contains protein sequences
WARNING: 2 sites contain only gaps or ambiguous characters.
Alignment has 8 sequences with 503 columns, 358 distinct patterns
109 parsimony-informative, 220 singleton sites, 174 constant sites
                Gap/Ambiguity  Composition  p-value
   1  A_granulata      23.06%    passed    100.00%
   2  B_glabrata       24.85%    passed     99.23%
   3  C_virginica      23.06%    passed     99.90%
   4  H_robusta         9.34%    passed     66.99%
   5  L_gigantea       22.86%    passed    100.00%
   6  O_bimaculoides   24.85%    passed     98.19%
   7  P_fucata         30.82%    passed     99.82%
   8  S_constricta     24.06%    passed     99.88%
****  TOTAL            22.86%  0 sequences failed composition chi2 test (p-value<5%; df=19)

You can appreciate from what is shown in the standard outpout that the ptogram is doing a tree before model selection! And it does use both parsimony and likelihood ... However, when the models started to be tested, and by now you should have a grasp of what the text here means!


 No. Model         -LnL         df  AIC          AICc         BIC
  1  LG            4783.673     13  9593.346     9594.090     9648.214
  2  LG+I          4762.582     14  9553.164     9554.025     9612.252
  3  LG+G4         4730.778     14  9489.556     9490.416     9548.644
  4  LG+I+G4       4731.634     15  9493.267     9494.253     9556.576
  7  LG+F+G4       4680.996     33  9427.992     9432.777     9567.272
  8  LG+F+I+G4     4681.699     34  9431.398     9436.484     9574.898

The -m TESTONLY word stands for standard model selection, which tells IQ-TREE to perform ModelFinder without taking into consideration FreeRate model (look at them here. You should just add -m MF flag to compute a full Model Selection): this tool computes the log-likelihoods of an initial parsimony tree for many different models and the Akaike information criterion (AIC), corrected Akaike information criterion (AICc), and the Bayesian information criterion (BIC). Then ModelFinder chooses the model that minimizes the BIC score (you can also change to AIC or AICc by adding the option -AIC or -AICc, respectively)

The output are:

  • ...ckp.gz: checkpoint file.
  • ...iqtree: summary human readble file.
  • ...log: log file.
  • ...model.gz computer readble result file.
  • ...treefile Tree used for ModelFinder.

Let's have a look at the iqtree summary file, where we can find a lot of usefull informations:

cat data/example_alignements/N0.HOG0000096.ref.aln.iqtree
  • Q.insect: This denotes a user-defined or empirical amino acid substitution matrix obtained from insect sequences. It is expected to better represent the evolutionary patterns of particular groups, but they are just numbers. Do not be surprise when Q.plant is your best model for an animal!

  • +F: This component indicates that the model incorporates empirical amino acid frequencies, calculated directly from the provided alignment data. Alternatives are: +FQ = equal base frequencies; +FO = optimized base frequencies by maximum-likelihood.

  • +G4: This signifies the use of a discrete Gamma distribution with four rate categories to model rate heterogeneity across sites.Additionally: +I allows for a proportion of invariable sites and +R is the FreeRate model that generalizes the +G model by relaxing the assumption of Gamma-distributed rates.

The -m flag can also specify a model name to use during the analyses, which can be a priori specified by the user (here's a list of models implemented in ModelFinder). Usually this is done to save computational time when the best-fit model has been already pre-computed.


Quiz!

  • Is JTT or LG a better model for our alignement?
  • Here you can find a nucleotide alignement ... can you tell me the probability of A↔C?!?

main