W3_L09

Bayesian Inference (BI)

Slides Practicals

Bayesian Inference (BI)

In this lesson we will use the infamous Phylobayes (Lartillot et al. 2013) to understand the underlying principles of Bayesian Inference in phylogenetics. Remember that in a Bayesian framework we estimate parameters from their posterior distribution, instead of finding the best single estimates as in a Maximum Likelihood framework.


MSA format conversion:

At this stage we have already used .fas and .nxs formats. In this lesson we will also need a .phy - a phylip-formatted file. This format name comes from a very ancient piece of software - called of course PHYLIP - which is over 44 years old! You can do it quickly using:

AMAS.py convert -d dna -u phylip -f fasta -i data/example_alignements/nt/Rhabdomeric1.nt.aln

Bayesian Inference:

Having a .phy alignment we can quickly launch the BI inference ... twice! The reason for that is that we will need to assess the convergence of the two chains, as you should have learned from the theoretical lesson! So:

 mpirun -np 2 pb_mpi -d Rhabdomeric1.nt.aln-out.phy chain1

and

 mpirun -np 2 pb_mpi -d Rhabdomeric1.nt.aln-out.phy chain2

... and then take a looong break, just because BI is very computationally intensive, even on such small datasets! Btw if you do not know how to execute two commands in the shell ... just open two shells using ctrl + t!

... while you wait, you can have fun with this MCMC thing!


post-inference diagnostics

We can kill the two processes by just pressing ctrl + C. Then take a look at what happened during the break:

  • .chain contains the state of the MCMC chain, including the current tree topology, branch lengths, and model parameters. It allows for the resumption of the MCMC run from the last saved state after an interruption.​

  • .trace contains trace information of various summary statistics (e.g., log-likelihood, tree length, number of components in the mixture model) across MCMC iterations. It is used to assess convergence and mixing by plotting these statistics over the MCMC run.​

  • .treelist stores the list of sampled tree topologies from the MCMC run. These trees are used to construct consensus trees and to estimate posterior probabilities of clades.​

You can do a first diagnostic using the command tracecomp and the name of your chain. Some parameters have decent ESS!

setting upper limit to : 1231
chain1.trace	 burnin : 246	sample size : 985
name                effsize	rel_diff

loglik              985		0
length              17		0
alpha               15		0
Nmode               236		0
statent             985		0
statalpha           840		0
rrent               79		0
rrmean              135		0

Then you can also compare the two traces by specifying them both with tracecomp Rhabd1 Rhabd2. As you can see, ESS are much worse and only the minimum shared number of trees is considered for both chains.

setting upper limit to : 451
chain1.trace	 burnin : 90	sample size : 361
chain2.trace	 burnin : 90	sample size : 361
name                effsize	rel_diff

loglik              17		0.480347
length              49		0.337666
alpha               27		0.0793266
Nmode               37		0.0322054
statent             17		0.444933
statalpha           20		0.339387
rrent               58		0.27114
rrmean              308		0.123704

exercise!

  • How can you adjust the burnin?! Just recall the tracecomp commands to see the options!

  • Why when we combined the two runs the ESS fell to the floor?

... a nice way to carry out this post-inference diagnostics (that is often done during the inference) is Tracer, which offers an easy way to explore what's going on in our inference. You should have it already installed and you have to drag and drop the two .trace files into its window. What you wish to see is the famous fuzzy caterpillar - by far the most loved animals by phylogeneticists - which implies that there's not (much) autocorrelation in our analysis and that it has reached stationarity.

exercise!

  • Take a look in Tracer at an awful and cool sampling coming from two distinct runs, to get a sense of how much they can be different. Has the target distribution been reached?

  • And also ... any guess why there are a ton of parameters here? Come ask me!


consensus tree building

Now let's build our consensus tree and take a look at it! It will be terrible, you are advised ...

bpcomp chain1 chain2

This generates two very interesting files:

  • bpcomp.bplist lists all bipartitions.

  • bpcomp.con.tre is the consensus tree.

Probably in some of your trees you will see polytomies (i.e. more than two branches descending from a single node). Remember that phylobayes doesn't produce multifurcating trees during the run! When we have used the bpcomp command we have also summarized all sampled trees in a consensus one (after burn-in discarding). While doing this it has calculated the Posterior Probabilities of each node/bipartition (i.e. The proportion of the time that the bipartition is found in the sampled trees). If the posterior probability of a node was below the default cutoff of 0.5, and by default the command collapsed that branch in a politomy. However ... phylobayes is quite opaque on how it builds the consensus tree building and you might want to rely on Treeannotator.

exercise(s)!

  • Can you interpret the bpcomp.bplist file? What do you think ..*.**.. represents? What is the need for that?! Let's discuss!

  • Use Treeannotator to build some different consensus trees ... it seems you can play around also with node heights ...


further reading:

The PhyloBayes manual! Have fun with that! :-P


main