W6_L11
Complex substitution models
complex substitution models
Sequence data that have evolved under heterotachy - i.e. rate variation across sites and lineages - are known to mislead phylogenetic inference. Tyipically, analyses which do not assume a single set of branchlengths across partitions and include the gamma parameter deal well with heterotachy. However, more complex substitution models can be essential for accurately modeling this evolutionary process.
mixture models
Both partition and mixture models allow more than one substitution model along the sequences ... but what is the difference?!
-
a partition model assigns each alignment site a given specific model
-
a mixture model will compute each site probability (or weight) of belonging to each mixture class (also categories or components)
Mixture models account for heterogeneity in the evolutionary process across sites in a sequence alignment.
To start, use the following command:
iqtree -s data/example_alignements/nt/Rhabdomeric1.nt.aln -m "MIX{JC,GTR}+G4" -wspmr --prefix mixture
Here, we specfied a mixture model via the MIX keyword in the model string -m "MIX{JC,GTR}+G4". The components - a JC model, and a GTR model - are given in curly brackets and separated with a comma. Furthermore, other than the two mixture components we specify four Gamma rate categories using +G4. Overall there are eight mixture components.
IQ-TREE will then estimate the parameters of each mixture components as well as their weights: the proportion of sites belonging to each component. The flag -wspmr can be used to write site posterior probabilities per mixture class and rate category to a .siteprob file. Take a look at it using head mixture.siteprob.
As for traditional model selection, also when using mixture models we need to find some objective criteria. MixtureFinder is an approach to select the optimum number of classes and the substitution model in each class for a mixture model of Q matrices (Ren et al. 2024).
iqtree -s data/example_alignements/nt/Rhabdomeric1.nt.aln -m MIX+MFP -mset JC,GTR
-mrate E,I,G,I+G -qmax 5 --prefix mixture
Here:
-m MIX+MFPselect mixture model and then do the tree search-msetselect only among the specified substitution models-mrateselect among the rate heterogeneity across sites models-qmaxthe maximum number of Q-mixture classes
profile mixture models
Standard models assume that all sites in a protein alignment share the same equilibrium frequencies. Profile mixture models fix this by allowing different sites to evolve under different amino acid compositions. These empirical protein models improve model fit and reducing systematic errors like long-branch attraction.
iqtree -s data/example_alignements/aa/concatenated.out -mset LG+C60 -mrate G4 --prefix mixture
This command will use the LG model in combination with 60 empirically derived profiles - each a 20-dimensional vector of amino acid frequencies - originally developed from large datasets of protein alignments (Le et al. 2008).
heterotachy models
A cool model to adressed heterotachy is the General Heterogeneous evolution On a Single Topology (GHOST) model (Crotty et al. 2020).
GHOST is an edge-unlinked mixture model consisting of several site classes, each having a separate set of model parameters and edge lengths on the same tree topology. In contrast to an edge-unlinked partition model, the GHOST model does not require the a priori data partitioning, a possible source of model misspecification.
The GHOST model with k mixture classes is executed by adding +H to the flag -m. For example if one wants to fit a GHOST model with 4 classes in conjunction with the GTR one would use the following command:
iqtree -s data/example_alignements/nt/Rhabdomeric1.nt.aln -m GTR+H4 --prefix GHOST
By default the above command will link GTR parameters across all classes. If you want to unlink GTR parameters, so that IQ-TREE estimates them separately for each class, replace +H4 by *H4:
iqtree -s data/example_alignements/nt/Rhabdomeric1.nt.aln -m GTR*H4 --prefix GHOST
Note that this infers one set of empirical base frequencies and apply those to all classes. If one wishes to infer separate base frequencies for each class then the +FO option is required:
iqtree -s data/example_alignements/nt/Rhabdomeric1.nt.aln -m GTR+FO*H4 --prefix GHOST
Now take a look at the SUBSTITUTION PROCESS block within the .iqtree file and discuss it!
Mixture model of substitution: MIX{GTR+FO,GTR+FO,GTR+FO,GTR+FO}*H4
No Component Rate Weight Parameters
1 GTR 1.0000 1.0000 GTR{3.23531,5.09771,5.63821,4.34027,10.9183}+FO{0.134545,0.200869,0.313036,0.35155}
2 GTR 1.0000 1.0000 GTR{43.4898,4.25695,4.84095,13.5686,1.97549}+FO{0.262485,0.423655,7.3671e-05,0.313786}
3 GTR 1.0000 1.0000 GTR{0.138133,1.71075,0.562908,0.777249,5.75425}+FO{0.422682,0.235491,0.183529,0.158298}
4 GTR 1.0000 1.0000 GTR{0.0031294,99.9989,0.0015635,99.9995,0.000203349}+FO{0.457386,0.457386,4.57386e-05,0.0851817}
Model of rate heterogeneity: Rate heterotachy with 4 categories
Heterotachy weights: 0.5729 0.0209 0.1846 0.2215
Heterotachy tree lengths: 0.3894 12.9526 3.2180 0.2476
Category Relative_rate Proportion
1 1.0000 0.5729
2 1.0000 0.0209
3 1.0000 0.1846
4 1.0000 0.2215
Again, the -wspmr option will generate a .siteprob output file.