← Back to syllabus

W3_L08

Maximum Likelihood (ML)

Practical Notes

Maximum likelihood (ML)

Concatenation of MSAs

Let's start by having a look at the AMAS help:

AMAS.py --help
usage: AMAS <command> [<args>]

The AMAS commands are:
  concat      Concatenate input alignments
  convert     Convert to other file format
  replicate   Create replicate data sets for phylogenetic jackknife
  split       Split alignment according to a partitions file
  summary     Write alignment summary
  remove      Remove taxa from alignment
  translate   Translate DNA alignment into protein alignment
  trim        Remove columns from alignment

Use AMAS <command> -h for help with arguments of the command of interest

positional arguments:
  command     Subcommand to run

optional arguments:
  -h, --help  show this help message and exit

and specifically, to the concat function:

AMAS.py concat --help
usage: AMAS.py [-h] [-p CONCAT_PART] [-t CONCAT_OUT] [-u {fasta,phylip,nexus,phylip-int,nexus-int}] [-y {nexus,raxml,unspecified}] [-e] [-c CORES] -i IN_FILES [IN_FILES ...] -f
               {fasta,phylip,nexus,phylip-int,nexus-int} -d {aa,dna}

Concatenate input alignments

optional arguments:
  -h, --help            show this help message and exit
  -p CONCAT_PART, --concat-part CONCAT_PART
                        File name for the concatenated alignment partitions. Default: 'partitions.txt'
  -t CONCAT_OUT, --concat-out CONCAT_OUT
                        File name for the concatenated alignment. Default: 'concatenated.out'
  -u {fasta,phylip,nexus,phylip-int,nexus-int}, --out-format {fasta,phylip,nexus,phylip-int,nexus-int}
                        File format for the output alignment. Default: fasta
  -y {nexus,raxml,unspecified}, --part-format {nexus,raxml,unspecified}
                        Format of the partitions file. Default: 'unspecified'
  -e, --check-align     Check if input sequences are aligned. Default: no check
  -c CORES, --cores CORES
                        Number of cores used. Default: 1

required arguments:
  -i IN_FILES [IN_FILES ...], --in-files IN_FILES [IN_FILES ...]
                        Alignment files to be taken as input. You can specify multiple files using wildcards (e.g. --in-files *fasta)
  -f {fasta,phylip,nexus,phylip-int,nexus-int}, --in-format {fasta,phylip,nexus,phylip-int,nexus-int}
                        The format of input alignment
  -d {aa,dna}, --data-type {aa,dna}
                        Type of data

Now we can concatenate our alignments:

AMAS.py concat -i data/example_alignements/aa/*ref.aln -f fasta -d aa --part-format nexus
mv concatenated.out  data/example_alignements/aa/concatenated.out
mv partitions.txt  data/example_alignements/aa/partitions.txt

As a results you should have two files, a merged MSA and a partition file, where are stored the coordinate of the gene boundaries. Let's have a look at it:

#NEXUS

Begin sets;
	charset p1_N0 = 1-1532;
	charset p2_N0 = 1533-1771;
	charset p3_N0 = 1772-3456;
	charset p4_N0 = 3457-3876;
	charset p5_N0 = 3877-4845;
	charset p6_N0 = 4846-9727;
	charset p7_N0 = 9728-10328;
	charset p8_N0 = 10329-11046;
	charset p9_N0 = 11047-11594;
	charset p10_N0 = 11595-12097;
end;

NOTE: We are working with proteins so our partitions file just store the starting and the ending position of each gene. However, if we were using nucleotide sequences of PCGs, we might also want the coordinates of all the first, second and third codon positions. Indeed, they evolve differently due to the gen code degeneracy and codon usage bias.

Now we are ready to carry out our model selection in a Maximum Likelihood framework...


model selection

What we've seen until now is the process through which we select the "best" model of evolution for our sequence data, according to a metric of choice. However, when using concatenated alignments we don't want to loose a priori the information of the single genes boundaries. So let's use our partition file:

iqtree -s data/example_alignements/concatenated.out -Q data/example_alignements/partitions.txt -m TESTONLY --prefix edge-unlinked

where the -Q option allows to take into consideration heterotachy ( allowing each partitions to have its own set of branch lengths. NB very parameter rich, take a look at this pubblication). Options are:

  • -p edge-linked proportional partition model - all partitions share the same set of branch lengths - unrealistic!
  • -q edge-linked equal partition model - each partition to has its evolution rate but in a proportional way - recommended!
  • -Q edge-unlinked partition model - each partition has its own set of branchlengths - too complex!
  • -S separate tree inference ...

The resulting file should look something like this:

#nexus
begin sets;
  charset p1_N0 = 1-1532;
  charset p2_N0 = 1533-1771;
  charset p3_N0 = 1772-3456;
  charset p4_N0 = 3457-3876;
  charset p5_N0 = 3877-4845;
  charset p6_N0 = 4846-9727;
  charset p7_N0 = 9728-10328;
  charset p8_N0 = 10329-11046;
  charset p9_N0 = 11047-11594;
  charset p10_N0 = 11595-12097;
  charpartition mymodels =
    Q.insect+G4: p1_N0,
    Q.yeast+I+G4: p2_N0,
    LG+F+I+G4: p3_N0,
    Q.insect+G4: p4_N0,
    LG+G4: p5_N0,
    Q.insect+I+G4: p6_N0,
    WAG+G4: p7_N0,
    LG+F+G4: p8_N0,
    Q.insect+G4: p9_N0,
    Q.insect+G4: p10_N0;
end;

Here we can see the best-fit model for each partition, but take a look at the iqtree file for detailed informations.

Now, we can rerun the analyses using the -q option:

iqtree -s data/example_alignements/concatenated.out -q data/example_alignements/partitions.txt -m TESTONLY --prefix edge-linked

quiz!

  • The choice between an edge-linked and edge-unlinked model can be made by looking at their BIC scores. Which one is best? Remember, the lower the BIC score, the higher the fitness of the model.

partitioned inference:

There are several reasons for which we wanto to merge partitions which can be described by similar models of evolution, possibly including a better estimation of model parameters (less parameters to estimates = better estimation). ModelFinder implements a greedy strategy (Lanfear et al., 2012) that starts with the full partition model and subsequentially merges two genes until the model fit does not increase any further. Since this process can be very long when analyszing a lot of partitions, ModelFinder implements the relaxed hierarchical clustering algorithm (Lanfear et al., 2014), which is invoked via -rcluster option and can save a great amount of computational time.

To carry out simultaneously model of evolution & partitioning scheme selection let's use:

iqtree -s data/example_alignements/concatenated.out -Q data/example_alignements/partitions.txt -m MFP -rcluster 10

where with -rcluster 10 we are only examining the top 10% partition merging schemes (better to carry on a full greedy search in real scenarios).

Now we can have a look at the resulting partitioning scheme ...


further reading:

Some great basic and advanced tutorials are available from the authors themselves on ModelFinder and IQ-TREE.


main