← Back to syllabus

W6_L11

Complex substitution models

Practical Notes

complex substitution models

Sequence data that have evolved under heterotachy - i.e. rate variation across sites and lineages - are known to mislead phylogenetic inference. Tyipically, analyses which do not assume a single set of branchlengths across partitions and include the gamma parameter deal well with heterotachy. However, more complex substitution models can be essential for accurately modeling this evolutionary process.


mixture models

Both partition and mixture models allow more than one substitution model along the sequences ... but what is the difference?!

  • a partition model assigns each alignment site a given specific model

  • a mixture model will compute each site probability (or weight) of belonging to each mixture class (also categories or components)

Mixture models account for heterogeneity in the evolutionary process across sites in a sequence alignment.

To start, use the following command:

iqtree -s data/example_alignements/nt/Rhabdomeric1.nt.aln -m "MIX{JC,GTR}+G4" -wspmr --prefix mixture

Here, we specfied a mixture model via the MIX keyword in the model string -m "MIX{JC,GTR}+G4". The components - a JC model, and a GTR model - are given in curly brackets and separated with a comma. Furthermore, other than the two mixture components we specify four Gamma rate categories using +G4. Overall there are eight mixture components.

IQ-TREE will then estimate the parameters of each mixture components as well as their weights: the proportion of sites belonging to each component. The flag -wspmr can be used to write site posterior probabilities per mixture class and rate category to a .siteprob file. Take a look at it using head mixture.siteprob.


As for traditional model selection, also when using mixture models we need to find some objective criteria. MixtureFinder is an approach to select the optimum number of classes and the substitution model in each class for a mixture model of Q matrices (Ren et al. 2024).

iqtree -s data/example_alignements/nt/Rhabdomeric1.nt.aln -m MIX+MFP -mset JC,GTR 
-mrate E,I,G,I+G -qmax 5 --prefix mixture

Here:

  • -m MIX+MFP select mixture model and then do the tree search
  • -mset select only among the specified substitution models
  • -mrate select among the rate heterogeneity across sites models
  • -qmax the maximum number of Q-mixture classes

profile mixture models

Standard models assume that all sites in a protein alignment share the same equilibrium frequencies. Profile mixture models fix this by allowing different sites to evolve under different amino acid compositions. These empirical protein models improve model fit and reducing systematic errors like long-branch attraction.

iqtree -s data/example_alignements/aa/concatenated.out -mset LG+C60 -mrate G4 --prefix mixture 

This command will use the LG model in combination with 60 empirically derived profiles - each a 20-dimensional vector of amino acid frequencies - originally developed from large datasets of protein alignments (Le et al. 2008).


heterotachy models

A cool model to adressed heterotachy is the General Heterogeneous evolution On a Single Topology (GHOST) model (Crotty et al. 2020).

GHOST is an edge-unlinked mixture model consisting of several site classes, each having a separate set of model parameters and edge lengths on the same tree topology. In contrast to an edge-unlinked partition model, the GHOST model does not require the a priori data partitioning, a possible source of model misspecification.

The GHOST model with k mixture classes is executed by adding +H to the flag -m. For example if one wants to fit a GHOST model with 4 classes in conjunction with the GTR one would use the following command:

iqtree -s data/example_alignements/nt/Rhabdomeric1.nt.aln -m GTR+H4 --prefix GHOST

By default the above command will link GTR parameters across all classes. If you want to unlink GTR parameters, so that IQ-TREE estimates them separately for each class, replace +H4 by *H4:

iqtree -s data/example_alignements/nt/Rhabdomeric1.nt.aln -m GTR*H4 --prefix GHOST

Note that this infers one set of empirical base frequencies and apply those to all classes. If one wishes to infer separate base frequencies for each class then the +FO option is required:

iqtree -s data/example_alignements/nt/Rhabdomeric1.nt.aln -m GTR+FO*H4 --prefix GHOST

Now take a look at the SUBSTITUTION PROCESS block within the .iqtree file and discuss it!

Mixture model of substitution: MIX{GTR+FO,GTR+FO,GTR+FO,GTR+FO}*H4

  No  Component      Rate    Weight   Parameters
   1  GTR           1.0000   1.0000   GTR{3.23531,5.09771,5.63821,4.34027,10.9183}+FO{0.134545,0.200869,0.313036,0.35155}
   2  GTR           1.0000   1.0000   GTR{43.4898,4.25695,4.84095,13.5686,1.97549}+FO{0.262485,0.423655,7.3671e-05,0.313786}
   3  GTR           1.0000   1.0000   GTR{0.138133,1.71075,0.562908,0.777249,5.75425}+FO{0.422682,0.235491,0.183529,0.158298}
   4  GTR           1.0000   1.0000   GTR{0.0031294,99.9989,0.0015635,99.9995,0.000203349}+FO{0.457386,0.457386,4.57386e-05,0.0851817}

Model of rate heterogeneity: Rate heterotachy with 4 categories
Heterotachy weights:      0.5729 0.0209 0.1846 0.2215
Heterotachy tree lengths: 0.3894 12.9526 3.2180 0.2476

 Category  Relative_rate  Proportion
  1         1.0000         0.5729
  2         1.0000         0.0209
  3         1.0000         0.1846
  4         1.0000         0.2215

Again, the -wspmr option will generate a .siteprob output file.


main