W6_L12
Discordance, ILS & the coalescent
discordance, ILS the coalescencent
gene tree(s) inference
Coalescent -based methods are (mostly) based on gene trees .. so let's start by inferring some:
iqtree2 -S data/example_alignements/aa --prefix loci
Then we can see that there ten different trees have been produced in the file loci.treefile. It is istantly clear that these trees are different ... but what if we want to measure their discordance?
For this purpose, IQ-TREE implements the Robinson-Foulds distances calculation, a metric able to compare topologies between binary and unrooted trees (see here for an in-depth explanation on computational methods). Let's calculate them using:
iqtree -t loci.treefile -rf_all
The result of the command is written in the loci.treefile.rfdist file. However ... these distances are not normalyzed ...
task!
A friend of yours has told you that
ete3implements a lot of distance metrics between trees ... install it and calculate some normalyzed distances.
Eventually, reflect which can be the (biological) reasons for which gene tree(s) can differ from the species tree. We will explore the technical aspect in thr practical no. 14.
coalescence-based species-tree inference
Now that we have some gene trees we can move to the actual coalescence-based inference, using ASTRAL. That is a summary coalescent-based method for inferring species trees from a set of gene trees. Run it with:
astral -i loci.treefile
This is what will be printed on the screen, reflecting that 10 gene trees were provided as input, containing a total of 8 species. ASTRAL will ran 4 rounds of optimization to explore tree space and 4 trees were sampled or explored per round.
#Genetrees: 10
#Species: 8
#Rounds: 4
#Samples: 4
#Threads: 1
Then:
#NNI moves:0/36
(((((P_fucata,C_virginica),S_constricta),(B_glabrata,L_gigantea)),(A_granulata,O_bimaculoides)),H_robusta);
#NNI moves:0/36
((((((P_fucata,C_virginica),S_constricta),(L_gigantea,B_glabrata)),H_robusta),O_bimaculoides),A_granulata);
#NNI moves:0/36
((((((A_granulata,O_bimaculoides),H_robusta),(B_glabrata,L_gigantea)),S_constricta),P_fucata),C_virginica);
#NNI moves:0/36
((((((O_bimaculoides,A_granulata),H_robusta),(L_gigantea,B_glabrata)),S_constricta),P_fucata),C_virginica);
Initial score: 470
Initial tree: ((O_bimaculoides,(H_robusta,((L_gigantea,B_glabrata),(S_constricta,(P_fucata,C_virginica))))),A_granulata);
*** Subsample Process ***
#NNI moves:0/36
((((((C_virginica,P_fucata),S_constricta),(L_gigantea,B_glabrata)),H_robusta),O_bimaculoides),A_granulata);
#NNI moves:0/36
((((((O_bimaculoides,A_granulata),H_robusta),(L_gigantea,B_glabrata)),S_constricta),P_fucata),C_virginica);
#NNI moves:0/36
(((((A_granulata,O_bimaculoides),H_robusta),((P_fucata,C_virginica),S_constricta)),L_gigantea),B_glabrata);
#NNI moves:0/36
(((((C_virginica,P_fucata),S_constricta),(L_gigantea,B_glabrata)),(A_granulata,O_bimaculoides)),H_robusta);
Current score: 470
Current tree: ((O_bimaculoides,((((C_virginica,P_fucata),S_constricta),(L_gigantea,B_glabrata)),H_robusta)),A_granulata);
Final Tree: ((O_bimaculoides,((((C_virginica,P_fucata),S_constricta),(L_gigantea,B_glabrata)),H_robusta)),A_granulata);
((O_bimaculoides:0.817934,((((C_virginica:0.221151,P_fucata:0.251133)0.999704:0.167198,S_constricta:0.415062)0.974288:0.027019,(L_gigantea:0.304880,B_glabrata:0.491562)0.821452:0.029472)0.791118:0.043285,H_robusta:0.724897)0.495476:0.007902):0.143127,A_granulata:0.143127);
Score: 470
NI Moves: #NNI moves: 0/36 indicates: ASTRAL explored 36 possible Nearest Neighbor Interchange (NNI) rearrangements on each proposed tree. It accepted 0 moves — meaning the topology wasn't improved by these moves, because it had already found an optimal soloution.
ASTRAL starts the optimization of an initial tree, which gets a score. The higher the score, the better the fit! But how is the score calculated? From each gene tree, ASTRAL extracts all quartets. Then, for each quartet in the candidate species tree, ASTRAL checks how many gene trees support the same quartet topology. The final ASTRAL score is the sum of all matches across all quartets and all gene trees. Here, the final species tree topology explains 470 quartets from the gene trees — i.e., for 470 quartet cases, the gene tree topology matches the inferred species tree. This is the raw count of agreement, not a percentage or likelihood. A higher score means more agreement between gene trees and species tree which implies stronger support, under the multispecies coalescent model.
During the *** Subsample Process *** ASTRAL creates random subsets of loci (gene trees) and taxa (species). This is repeated multiple times (e.g. 4 rounds in our run) The goal is to generate smaller “views” of the data that still reflect key evolutionary signals but are computationally cheaper to analyze. Then ASTRAL infers a species tree using only the subset of taxa and/or gene trees: these subsampled trees are often less prone to noise and can highlight different quartet relationships. ASTRAL aggregates these mini-trees to extract frequent and informative bipartitions. These identified from the subsampled trees are used to generate new candidate species trees, which might include new groupings not found in the initial tree.
If you did not specify any output file, the final tree will be printed to the screen in Newick format, with:
- branch lengths: numbers after colons typically represent coalescent units (proxy for time).
- branch support: numbers between parentheses represent local posterior probability (LPP) for that internal node.
use of paralogs in species-tree inference
ASTRAL-Pro is an extension of ASTRAL specifically designed to handle gene duplication and loss - i.e., it can incorporate paralogous gene copies into species tree inference.
We have to start by moving in the folder data/example_paralogs and then align all the genes using:
for i in *fa; do mafft $i > $i.aln; done
Then, we can infer gene trees using IQTREE:
iqtree -S data/example_paralogs/ --prefix paralogs
An essential step is the reformmating of loci for what is needed from one program (IQTREE) to what is needed to another program (ASTRAL). The problem here is that IQTREE does niot allow tips to be named the same, so the gene IDs were kept until here, but ASTRAL wants all tips in genetrees from each species to be renamed in the same way. You can reformat the gene trees from IQTREE for ASTRAL, using:
sed -E 's/\|[^:]+//g' paralogs.treefile > paralogs.ref.nwk
Then, similarly to what we have done before, we can recall ASTRAL-PRO:
astral-pro -i paralogs.ref.nwk
ASTRAL-Pro can handle multiple gene copies per species by accounting for gene duplication and loss alongside incomplete lineage sorting. It recognizes paralogous copies from the same species and groups gene copies into equivalence classes — treating those with similar evolutionary histories as functionally the same. From each gene tree, ASTRAL-Pro identifies orthologous subsets and extracts the most ortholog-like quartets. After this filtering, it runs the same quartet-based inference as standard ASTRAL, using only the informative signal.
task!
Are the two trees inferred using only single copy orthologs or only paralogs the same?!