W3_L07
MK models of molecular evolution
MK models of molecular evolution
Before starting you should have an aligned & filtered MSA! you can find some here
For all ML analyses we will use IQ-TREE. Some of the greates advantages of this software are:
- It can do almost everything (take a look at the manual).
- Each analysis can be highly customizable.
- Really faster compared to other softwares.
- Very easy to run as you want (if you study a bit the manual).
For a quick-and-dirty Model Selection on an MSA we can just type:
iqtree2 -s data/example_alignements/N0.HOG0000096.ref.aln -m TESTONLY
You can see that it instantly prints some usefull stats which reconnect to concepts we stubled upon earlier - parsimony-informative sites! Some other are new, like distinct patterns, which refers to the unique configurations of nucleotides or amino acids observed across all sequences at specific alignment positions. The test performed here indicates no significant compositional heterogeneity among the sequences, this is somethin which we will get back to in the lesson on biases.
Alignment most likely contains protein sequences
WARNING: 2 sites contain only gaps or ambiguous characters.
Alignment has 8 sequences with 503 columns, 358 distinct patterns
109 parsimony-informative, 220 singleton sites, 174 constant sites
Gap/Ambiguity Composition p-value
1 A_granulata 23.06% passed 100.00%
2 B_glabrata 24.85% passed 99.23%
3 C_virginica 23.06% passed 99.90%
4 H_robusta 9.34% passed 66.99%
5 L_gigantea 22.86% passed 100.00%
6 O_bimaculoides 24.85% passed 98.19%
7 P_fucata 30.82% passed 99.82%
8 S_constricta 24.06% passed 99.88%
**** TOTAL 22.86% 0 sequences failed composition chi2 test (p-value<5%; df=19)
You can appreciate from what is shown in the standard outpout that the ptogram is doing a tree before model selection! And it does use both parsimony and likelihood ... However, when the models started to be tested, and by now you should have a grasp of what the text here means!
No. Model -LnL df AIC AICc BIC
1 LG 4783.673 13 9593.346 9594.090 9648.214
2 LG+I 4762.582 14 9553.164 9554.025 9612.252
3 LG+G4 4730.778 14 9489.556 9490.416 9548.644
4 LG+I+G4 4731.634 15 9493.267 9494.253 9556.576
7 LG+F+G4 4680.996 33 9427.992 9432.777 9567.272
8 LG+F+I+G4 4681.699 34 9431.398 9436.484 9574.898
The -m TESTONLY word stands for standard model selection, which tells IQ-TREE to perform ModelFinder without taking into consideration FreeRate model (look at them here. You should just add -m MF flag to compute a full Model Selection): this tool computes the log-likelihoods of an initial parsimony tree for many different models and the Akaike information criterion (AIC), corrected Akaike information criterion (AICc), and the Bayesian information criterion (BIC). Then ModelFinder chooses the model that minimizes the BIC score (you can also change to AIC or AICc by adding the option -AIC or -AICc, respectively)
The output are:
- ...ckp.gz: checkpoint file.
- ...iqtree: summary human readble file.
- ...log: log file.
- ...model.gz computer readble result file.
- ...treefile Tree used for ModelFinder.
Let's have a look at the iqtree summary file, where we can find a lot of usefull informations:
cat data/example_alignements/N0.HOG0000096.ref.aln.iqtree
-
Q.insect: This denotes a user-defined or empirical amino acid substitution matrix obtained from insect sequences. It is expected to better represent the evolutionary patterns of particular groups, but they are just numbers. Do not be surprise when Q.plant is your best model for an animal!
-
+F: This component indicates that the model incorporates empirical amino acid frequencies, calculated directly from the provided alignment data. Alternatives are: +FQ = equal base frequencies; +FO = optimized base frequencies by maximum-likelihood.
-
+G4: This signifies the use of a discrete Gamma distribution with four rate categories to model rate heterogeneity across sites.Additionally: +I allows for a proportion of invariable sites and +R is the FreeRate model that generalizes the +G model by relaxing the assumption of Gamma-distributed rates.
The -m flag can also specify a model name to use during the analyses, which can be a priori specified by the user (here's a list of models implemented in ModelFinder). Usually this is done to save computational time when the best-fit model has been already pre-computed.
Quiz!
- Is JTT or LG a better model for our alignement?
- Here you can find a nucleotide alignement ... can you tell me the probability of A↔C?!?