W3_L08
Maximum Likelihood (ML)
Maximum likelihood (ML)
Concatenation of MSAs
Let's start by having a look at the AMAS help:
AMAS.py --help
usage: AMAS <command> [<args>]
The AMAS commands are:
concat Concatenate input alignments
convert Convert to other file format
replicate Create replicate data sets for phylogenetic jackknife
split Split alignment according to a partitions file
summary Write alignment summary
remove Remove taxa from alignment
translate Translate DNA alignment into protein alignment
trim Remove columns from alignment
Use AMAS <command> -h for help with arguments of the command of interest
positional arguments:
command Subcommand to run
optional arguments:
-h, --help show this help message and exit
and specifically, to the concat function:
AMAS.py concat --help
usage: AMAS.py [-h] [-p CONCAT_PART] [-t CONCAT_OUT] [-u {fasta,phylip,nexus,phylip-int,nexus-int}] [-y {nexus,raxml,unspecified}] [-e] [-c CORES] -i IN_FILES [IN_FILES ...] -f
{fasta,phylip,nexus,phylip-int,nexus-int} -d {aa,dna}
Concatenate input alignments
optional arguments:
-h, --help show this help message and exit
-p CONCAT_PART, --concat-part CONCAT_PART
File name for the concatenated alignment partitions. Default: 'partitions.txt'
-t CONCAT_OUT, --concat-out CONCAT_OUT
File name for the concatenated alignment. Default: 'concatenated.out'
-u {fasta,phylip,nexus,phylip-int,nexus-int}, --out-format {fasta,phylip,nexus,phylip-int,nexus-int}
File format for the output alignment. Default: fasta
-y {nexus,raxml,unspecified}, --part-format {nexus,raxml,unspecified}
Format of the partitions file. Default: 'unspecified'
-e, --check-align Check if input sequences are aligned. Default: no check
-c CORES, --cores CORES
Number of cores used. Default: 1
required arguments:
-i IN_FILES [IN_FILES ...], --in-files IN_FILES [IN_FILES ...]
Alignment files to be taken as input. You can specify multiple files using wildcards (e.g. --in-files *fasta)
-f {fasta,phylip,nexus,phylip-int,nexus-int}, --in-format {fasta,phylip,nexus,phylip-int,nexus-int}
The format of input alignment
-d {aa,dna}, --data-type {aa,dna}
Type of data
Now we can concatenate our alignments:
AMAS.py concat -i data/example_alignements/aa/*ref.aln -f fasta -d aa --part-format nexus
mv concatenated.out data/example_alignements/aa/concatenated.out
mv partitions.txt data/example_alignements/aa/partitions.txt
As a results you should have two files, a merged MSA and a partition file, where are stored the coordinate of the gene boundaries. Let's have a look at it:
#NEXUS
Begin sets;
charset p1_N0 = 1-1532;
charset p2_N0 = 1533-1771;
charset p3_N0 = 1772-3456;
charset p4_N0 = 3457-3876;
charset p5_N0 = 3877-4845;
charset p6_N0 = 4846-9727;
charset p7_N0 = 9728-10328;
charset p8_N0 = 10329-11046;
charset p9_N0 = 11047-11594;
charset p10_N0 = 11595-12097;
end;
NOTE: We are working with proteins so our partitions file just store the starting and the ending position of each gene. However, if we were using nucleotide sequences of PCGs, we might also want the coordinates of all the first, second and third codon positions. Indeed, they evolve differently due to the gen code degeneracy and codon usage bias.
Now we are ready to carry out our model selection in a Maximum Likelihood framework...
model selection
What we've seen until now is the process through which we select the "best" model of evolution for our sequence data, according to a metric of choice. However, when using concatenated alignments we don't want to loose a priori the information of the single genes boundaries. So let's use our partition file:
iqtree -s data/example_alignements/concatenated.out -Q data/example_alignements/partitions.txt -m TESTONLY --prefix edge-unlinked
where the -Q option allows to take into consideration heterotachy ( allowing each partitions to have its own set of branch lengths. NB very parameter rich, take a look at this pubblication). Options are:
-pedge-linked proportional partition model - all partitions share the same set of branch lengths - unrealistic!-qedge-linked equal partition model - each partition to has its evolution rate but in a proportional way - recommended!-Qedge-unlinked partition model - each partition has its own set of branchlengths - too complex!-Sseparate tree inference ...
The resulting file should look something like this:
#nexus
begin sets;
charset p1_N0 = 1-1532;
charset p2_N0 = 1533-1771;
charset p3_N0 = 1772-3456;
charset p4_N0 = 3457-3876;
charset p5_N0 = 3877-4845;
charset p6_N0 = 4846-9727;
charset p7_N0 = 9728-10328;
charset p8_N0 = 10329-11046;
charset p9_N0 = 11047-11594;
charset p10_N0 = 11595-12097;
charpartition mymodels =
Q.insect+G4: p1_N0,
Q.yeast+I+G4: p2_N0,
LG+F+I+G4: p3_N0,
Q.insect+G4: p4_N0,
LG+G4: p5_N0,
Q.insect+I+G4: p6_N0,
WAG+G4: p7_N0,
LG+F+G4: p8_N0,
Q.insect+G4: p9_N0,
Q.insect+G4: p10_N0;
end;
Here we can see the best-fit model for each partition, but take a look at the iqtree file for detailed informations.
Now, we can rerun the analyses using the -q option:
iqtree -s data/example_alignements/concatenated.out -q data/example_alignements/partitions.txt -m TESTONLY --prefix edge-linked
quiz!
- The choice between an edge-linked and edge-unlinked model can be made by looking at their BIC scores. Which one is best? Remember, the lower the BIC score, the higher the fitness of the model.
partitioned inference:
There are several reasons for which we wanto to merge partitions which can be described by similar models of evolution, possibly including a better estimation of model parameters (less parameters to estimates = better estimation). ModelFinder implements a greedy strategy (Lanfear et al., 2012) that starts with the full partition model and subsequentially merges two genes until the model fit does not increase any further. Since this process can be very long when analyszing a lot of partitions, ModelFinder implements the relaxed hierarchical clustering algorithm (Lanfear et al., 2014), which is invoked via -rcluster option and can save a great amount of computational time.
To carry out simultaneously model of evolution & partitioning scheme selection let's use:
iqtree -s data/example_alignements/concatenated.out -Q data/example_alignements/partitions.txt -m MFP -rcluster 10
where with -rcluster 10 we are only examining the top 10% partition merging schemes (better to carry on a full greedy search in real scenarios).
Now we can have a look at the resulting partitioning scheme ...
further reading:
Some great basic and advanced tutorials are available from the authors themselves on ModelFinder and IQ-TREE.