W3_L09
Bayesian Inference (BI)
Bayesian Inference (BI)
In this lesson we will use the infamous Phylobayes (Lartillot et al. 2013) to understand the underlying principles of Bayesian Inference in phylogenetics. Remember that in a Bayesian framework we estimate parameters from their posterior distribution, instead of finding the best single estimates as in a Maximum Likelihood framework.
MSA format conversion:
At this stage we have already used .fas and .nxs formats. In this lesson we will also need a .phy - a phylip-formatted file. This format name comes from a very ancient piece of software - called of course PHYLIP - which is over 44 years old! You can do it quickly using:
AMAS.py convert -d dna -u phylip -f fasta -i data/example_alignements/Rhabdomeric1.nt.aln
Bayesian Inference:
Having a .phy alignement we can quickly lunch the BI inference ... twice! The reason for that is that we will need to asses the convergence of the two chains, as you should have learned from the practicals! So:
mpirun -np 2 pb_mpi -d Rhabdomeric1.nt.aln-out.phy chain1
and
mpirun -np 2 pb_mpi -d Rhabdomeric1.nt.aln-out.phy chain2
... and then take a looong break, just bcause BI is very computationally intensive, even on such small datasets! Btw if you do not know how to execute two commands in the shell ... just open two shells using ctrl + t!
... while you wait, you can have fun with (this)[https://plewis.github.io/applets/mcmc-robot/] MCMC thing!
post-inference diagnostics
We can kill the two processes by just pressing ctrl + C. Then take a look to what happened during the break:
-
.chaincontains the state of the MCMC chain, including the current tree topology, branch lengths, and model parameters. It allows for the resumption of the MCMC run from the last saved state after an interruption. -
.tracecontains trace information of various summary statistics (e.g., log-likelihood, tree length, number of components in the mixture model) across MCMC iterations. It is used to assess convergence and mixing by plotting these statistics over the MCMC run. -
.treeliststores the list of sampled tree topologies from the MCMC run. These trees are used to construct consensus trees and to estimate posterior probabilities of clades.
You can do a first diagnostic using the command tracecomp and the name of your chain. Some parameters have decent ESS!
setting upper limit to : 1231
Rhabd1.trace burnin : 246 sample size : 985
name effsize rel_diff
loglik 985 0
length 17 0
alpha 15 0
Nmode 236 0
statent 985 0
statalpha 840 0
rrent 79 0
rrmean 135 0
Then you can also compare the two traces by specifing them both with tracecomp Rhabd1 Rhabd2. As you can see, ESS are much worse and only the minimum shared number of trees is considered for both chains.
setting upper limit to : 451
Rhabd1.trace burnin : 90 sample size : 361
Rhabd2.trace burnin : 90 sample size : 361
name effsize rel_diff
loglik 17 0.480347
length 49 0.337666
alpha 27 0.0793266
Nmode 37 0.0322054
statent 17 0.444933
statalpha 20 0.339387
rrent 58 0.27114
rrmean 308 0.123704
exercise!
-
How can you adjust the burnin?! Just recall the
tracecompcommands to see the options! -
Why when we combined the two runs the ESS fell to the floor?
... a nice way to carry out this post-inference diagnostics (that is often done during the inference) is Tracer,
which offers an easy way to explore what's going on in our inference. You should have it allready installed and you have to grag and drop the two .trace files into its window. What you whish to see is the famous fuzzy caterpillar - by far the most loved animals by phylogeneticists - which implies that there's no autocorrelation in our analysis and that it has reached stationarity.
exercise!
-
Take a look in Tracer at an awful and coolsampling coming fromt two distinct runs, to get a sense of how much they can be different. Has the target distribution been reached?
-
And also ... any guess why there are a ton of parameters here? Come ask to me!
consensus tree building
Now let's build our consensus tree and take a look at it! It will be terrible, you are advised ...
bpcomp Rhabd1 Rhabd2
This generates two very interesting files:
-
bpcomp.bplistlists all bipartitions. -
bpcomp.con.treis the consensus tree.
Probably in many of your trees you will see politomies (i.e more than two branches descending from a single node). Remember that MrBayes doesn't produce multifurcating trees during the run! When we have used the bpcomp command we have also sumarized all sampled trees in a consensus one (after burn-in discarding). While doing this it has calculated the Posterior Probabilities of each node/bipartition (i.e The proportion of the time that the bipartition is found in the sampled trees). If the posterior probability of a node was below the default cutoff of 0.5, and by default the command collapsed that branch in a politomy. However ... phylobayes is quite opaque on how it builds the consensus tree building and you might want to rely on the mighty Treeannotator, which you should have allready installed.
exercise(s)!
-
Can you interpret the
bpcomp.bplistfile? What do you think..*.**..represents? What us the need for that?! Let's discuss! -
Use Treeannotator to build some different conensus trees ... it seems you can play around also with node heights ...
further reading:
The PhyloBayes manual! Have fun with that! :-P