← Back to syllabus

W8_L15

Inferring selection

Practical Notes

inferring selection

Here we are going to carry out likelihood estimations of the dN/dS ratio (Goldman and Yang 1994) using CODEML, which is part of the PAML package (Phylogenetic Analysis Using Maximum Likelihood- Ziheng Yang 2007). PAML is an extremely versatile piece of software whose features include:

  • estimating synonymous and nonsynonymous rates
  • testing hypotheses concerning dN/dS rate ratios
  • ancestral sequence reconstruction (DNA, codon, or AAs)
  • various clock models
  • simulating nucleotide, codon, or AA sequence data sets
  • ...

Just an extensive reding of the manual will disclose all of its possibilities! However, several great and newer alternative exist, the most notable being HyPhy (Pond et al. 2019) which is also implemented in the Datamonkey web server.


Biologically, we will try to answer the question: Do squid species inhabiting the depths of the sea have specific adaptations to low-light environments? In this tutorial, we will analyze a gene encoding a rhabdomeric opsin—possibly one of the primary opsins used in the visual system of a given organism. Below are the species we will consider:

Among them the deep-sea species are:

  • Emoschata
  • Sunicirrhus
  • Ptetracirrhus

We start by moving to the folder data/example_selection/ and inferring a codon based .fasta alignement using prank with the command:

prank -d=Rhabdomeric1.fasta -o=Rhabdomeric1.aln -codon

Which will give us back the codon alignement Rhabdomeric1.aln.best.fas.

Then we will need a .ctl control file, which is how we configure a CODEML analysis. Here is how it looks like:

seqfile = 						* sequence data filename
treefile = 						* tree structure file name
outfile = 						* main result file name
noisy = 						* 0,1,2,3,9: how much rubbish on the screen
verbose = 						* 1:detailed output
runmode = 						* 0:user defined tree
seqtype = 						* 1:codons
CodonFreq =       					* 0:equal, 1:F1X4, 2:F3X4, 3:F61
model = 						* 0:one omega ratio for all branches
NSsites = 						* 0:one omega ratio (M0 in Tables 2 and 4)
icode = 						* 0:universal code
fix_kappa =       					* 1:kappa fixed, 0:kappa to be estimated
kappa = 						* initial or fixed kappa
fix_omega = 	            				* 1:omega fixed, 0:omega to be estimated
omega =  						* initial omega

In this analysis we are going to compare two models:

  • m0 where all branches have the same omega value.
  • m2 where different branches can have different omega values.

These analyses are already configured in the files Rhabdomeric1.codeml.m0.ctl and Rhabdomeric1.codeml.m2.ctl , which you will find in the folder MP25/data/example_selection/. The only difference between the two is model = 0 and model = 2.

Now we only miss a .nwk tree, which can be inferred using the command: iqtree -s Rhabdomeric1.aln.best.fas.

This analysis will generate the file Rhabdomeric1.aln.best.fas.treefile, in which we need to specify the branches we hypothesize to have a different dNdS — the foreground — from all remaining branches - the background. We can tag these terminal branches by adding #1 to the deep-sea species, so that the .nwk tree will lok something like this:

((Ovulgaris:0.0311284126,((Cmacropus:0.0200158135,Ptetracitrrhus#1:0.0311411210):0.0066669180,Sunicirrhus#1:0.0305435995):00196030089):0.0174318010,(Emoschata#1:0.0293746956,Ecirrhosa:0.0166299302):0.0530600246,Aargo:0.0915781402);

To run CODEML all we need to do is type codem Rhabdomeric1.codeml.m0.ctl and codeml Rhabdomeric1.codeml.m2.ctl, and then take a look at the outputs:

  • the .out file which we specified is what we are really interested in.
  • 2NG.t, 2NG.dS, 2NG.dN subs rates, dS & dN distances
  • rub, rst, rst1 and lnf are some rather misterios intermediate files :-P

Looking at the Rhabdomeric1.codeml.m0.out we can see:

  • a summary of codon usage counts:
------------------------------------------------------------------------------
Phe F TTT      55 | Ser S TCT      80 | Tyr Y TAT      40 | Cys C TGT      30
      TTC     122 |       TCC      64 |       TAC     114 |       TGC      26
Leu L TTA      11 |       TCA      46 | *** * TAA       0 | Trp W TGA       0
      TTG      51 |       TCG       6 |       TAG       0 |       TGG      74
------------------------------------------------------------------------------
Leu L CTT      50 | Pro P CCT      50 | His H CAT      31 | Arg R CGT      14
      CTC      33 |       CCC      57 |       CAC      11 |       CGC       0
      CTA       5 |       CCA     153 | Gln Q CAA     132 |       CGA       0
      CTG      23 |       CCG      24 |       CAG      67 |       CGG       0
------------------------------------------------------------------------------
Ile I ATT      99 | Thr T ACT      14 | Asn N AAT      42 | Ser S AGT       7
      ATC     137 |       ACC      44 |       AAC      71 |       AGC      12
Met M ATA       7 |       ACA      30 | Lys K AAA      98 | *** * AGA       0
      ATG     170 |       ACG      15 |       AAG      43 |       AGG       0
------------------------------------------------------------------------------
Val V GTT      56 | Ala A GCT     142 | Asp D GAT      58 | Gly G GGT     104
      GTC      71 |       GCC      90 |       GAC      19 |       GGC      72
      GTA      42 |       GCA      90 | Glu E GAA     102 |       GGA      59
      GTG      18 |       GCG      11 |       GAG      29 |       GGG       7
------------------------------------------------------------------------------
  • a summary of base frequencies onthe different codon positions:
position  1:    T:0.22986    C:0.20780    A:0.25224    G:0.31010
position  2:    T:0.30371    C:0.29284    A:0.27398    G:0.12948
position  3:    T:0.27877    C:0.30147    A:0.24776    G:0.17199
Average         T:0.27078    C:0.26737    A:0.25799    G:0.20386
  • statistics for each branch in the tree
  branch          t       N       S   dN/dS      dN      dS  N*dN  S*dS

   8..1      0.031  1034.0   349.0  0.0983  0.0031  0.0319   3.2  11.1
   8..9      0.017  1034.0   349.0  0.0983  0.0018  0.0178   1.8   6.2
   9..10     0.053  1034.0   349.0  0.0983  0.0053  0.0543   5.5  18.9
  10..2      0.029  1034.0   349.0  0.0983  0.0030  0.0301   3.1  10.5
  10..4      0.017  1034.0   349.0  0.0983  0.0017  0.0170   1.7   5.9
   9..7      0.092  1034.0   349.0  0.0983  0.0092  0.0937   9.5  32.7
   8..11     0.020  1034.0   349.0  0.0983  0.0020  0.0201   2.0   7.0
  11..12     0.007  1034.0   349.0  0.0983  0.0007  0.0068   0.7   2.4
  12..3      0.020  1034.0   349.0  0.0983  0.0020  0.0205   2.1   7.1
  12..5      0.031  1034.0   349.0  0.0983  0.0031  0.0319   3.2  11.1
  11..6      0.031  1034.0   349.0  0.0983  0.0031  0.0313   3.2  10.9

... when we instead look at these parameter in the Rhabdomeric1.codeml.m2.out we can see that the three branches in the forground have a higher dNdS value ...

 branch          t       N       S   dN/dS      dN      dS  N*dN  S*dS

   8..1      0.031  1034.0   349.0  0.0730  0.0025  0.0338   2.6  11.8
   8..9      0.017  1034.0   349.0  0.0730  0.0014  0.0189   1.4   6.6
   9..10     0.053  1034.0   349.0  0.0730  0.0042  0.0576   4.4  20.1
  10..2      0.029  1034.0   349.0  0.1874  0.0047  0.0250   4.8   8.7
  10..4      0.017  1034.0   349.0  0.0730  0.0013  0.0181   1.4   6.3
   9..7      0.092  1034.0   349.0  0.0730  0.0073  0.0994   7.5  34.7
   8..11     0.020  1034.0   349.0  0.0730  0.0016  0.0213   1.6   7.4
  11..12     0.007  1034.0   349.0  0.0730  0.0005  0.0072   0.5   2.5
  12..3      0.020  1034.0   349.0  0.0730  0.0016  0.0217   1.6   7.6
  12..5      0.031  1034.0   349.0  0.1874  0.0050  0.0265   5.1   9.2
  11..6      0.031  1034.0   349.0  0.1874  0.0049  0.0259   5.0   9.1

... however ... how can we find some statistical support for this claim?! The brief answer is that we need to perform a Likelihood Ratio Test and for that purpose I prepared a LRT_from_codeml.py python executable which, when fed the two CODEML output files, will perform the LRT and then give back a p value. So we can just type:

python LRT_from_codeml.py Rhabdomeric1.codeml.m0.out Rhabdomeric1.codeml.m2.out

Wow! It looks like the Rhodopsin of deep sea squids underwnt a more relaxed selection regime compare to those which live in shallow water! However ...

Task!

Try testing the m1 model, which assumes a different omega value for every branch in the phylogeny. While this model includes a large number of parameters and is generally discouraged, it can offer useful exploratory insights to help refine your hypothesis.

If you wnat to learn more on inferring selection regimws using PAML, read this paper!


main