W8_L15
Inferring selection
inferring selection
Here we are going to carry out likelihood estimations of the dN/dS ratio (Goldman and Yang 1994) using CODEML, which is part of the PAML package (Phylogenetic Analysis Using Maximum Likelihood- Ziheng Yang 2007). PAML is an extremely versatile piece of software whose features include:
- estimating synonymous and nonsynonymous rates
- testing hypotheses concerning dN/dS rate ratios
- ancestral sequence reconstruction (DNA, codon, or AAs)
- various clock models
- simulating nucleotide, codon, or AA sequence data sets
- ...
Just an extensive reding of the manual will disclose all of its possibilities! However, several great and newer alternative exist, the most notable being HyPhy (Pond et al. 2019) which is also implemented in the Datamonkey web server.
Biologically, we will try to answer the question: Do squid species inhabiting the depths of the sea have specific adaptations to low-light environments? In this tutorial, we will analyze a gene encoding a rhabdomeric opsin—possibly one of the primary opsins used in the visual system of a given organism. Below are the species we will consider:
- Ovulgaris - Octopus vulgaris
- Emoschata - Eledone moschata
- Cmacropus - Callistoctopus macropus
- Ecirrhosa - Eledone cirrhosa
- Ptetracitrrhus - Pteroctopus tetracirrhus
- Sunicirrhus - Scaeurgus unicirrhus
- Aargo - Argonauta argo
Among them the deep-sea species are:
- Emoschata
- Sunicirrhus
- Ptetracirrhus
We start by moving to the folder data/example_selection/ and inferring a codon based .fasta alignement using prank with the command:
prank -d=Rhabdomeric1.fasta -o=Rhabdomeric1.aln -codon
Which will give us back the codon alignement Rhabdomeric1.aln.best.fas.
Then we will need a .ctl control file, which is how we configure a CODEML analysis. Here is how it looks like:
seqfile = * sequence data filename
treefile = * tree structure file name
outfile = * main result file name
noisy = * 0,1,2,3,9: how much rubbish on the screen
verbose = * 1:detailed output
runmode = * 0:user defined tree
seqtype = * 1:codons
CodonFreq = * 0:equal, 1:F1X4, 2:F3X4, 3:F61
model = * 0:one omega ratio for all branches
NSsites = * 0:one omega ratio (M0 in Tables 2 and 4)
icode = * 0:universal code
fix_kappa = * 1:kappa fixed, 0:kappa to be estimated
kappa = * initial or fixed kappa
fix_omega = * 1:omega fixed, 0:omega to be estimated
omega = * initial omega
In this analysis we are going to compare two models:
m0where all branches have the same omega value.m2where different branches can have different omega values.
These analyses are already configured in the files Rhabdomeric1.codeml.m0.ctl and Rhabdomeric1.codeml.m2.ctl , which you will find in the folder MP25/data/example_selection/. The only difference between the two is model = 0 and model = 2.
Now we only miss a .nwk tree, which can be inferred using the command: iqtree -s Rhabdomeric1.aln.best.fas.
This analysis will generate the file Rhabdomeric1.aln.best.fas.treefile, in which we need to specify the branches we hypothesize to have a different dNdS — the foreground — from all remaining branches - the background. We can tag these terminal branches by adding #1 to the deep-sea species, so that the .nwk tree will lok something like this:
((Ovulgaris:0.0311284126,((Cmacropus:0.0200158135,Ptetracitrrhus#1:0.0311411210):0.0066669180,Sunicirrhus#1:0.0305435995):00196030089):0.0174318010,(Emoschata#1:0.0293746956,Ecirrhosa:0.0166299302):0.0530600246,Aargo:0.0915781402);
To run CODEML all we need to do is type codem Rhabdomeric1.codeml.m0.ctl and codeml Rhabdomeric1.codeml.m2.ctl, and then take a look at the outputs:
- the
.outfile which we specified is what we are really interested in. 2NG.t,2NG.dS,2NG.dNsubs rates, dS & dN distancesrub,rst,rst1andlnfare some rather misterios intermediate files :-P
Looking at the Rhabdomeric1.codeml.m0.out we can see:
- a summary of codon usage counts:
------------------------------------------------------------------------------
Phe F TTT 55 | Ser S TCT 80 | Tyr Y TAT 40 | Cys C TGT 30
TTC 122 | TCC 64 | TAC 114 | TGC 26
Leu L TTA 11 | TCA 46 | *** * TAA 0 | Trp W TGA 0
TTG 51 | TCG 6 | TAG 0 | TGG 74
------------------------------------------------------------------------------
Leu L CTT 50 | Pro P CCT 50 | His H CAT 31 | Arg R CGT 14
CTC 33 | CCC 57 | CAC 11 | CGC 0
CTA 5 | CCA 153 | Gln Q CAA 132 | CGA 0
CTG 23 | CCG 24 | CAG 67 | CGG 0
------------------------------------------------------------------------------
Ile I ATT 99 | Thr T ACT 14 | Asn N AAT 42 | Ser S AGT 7
ATC 137 | ACC 44 | AAC 71 | AGC 12
Met M ATA 7 | ACA 30 | Lys K AAA 98 | *** * AGA 0
ATG 170 | ACG 15 | AAG 43 | AGG 0
------------------------------------------------------------------------------
Val V GTT 56 | Ala A GCT 142 | Asp D GAT 58 | Gly G GGT 104
GTC 71 | GCC 90 | GAC 19 | GGC 72
GTA 42 | GCA 90 | Glu E GAA 102 | GGA 59
GTG 18 | GCG 11 | GAG 29 | GGG 7
------------------------------------------------------------------------------
- a summary of base frequencies onthe different codon positions:
position 1: T:0.22986 C:0.20780 A:0.25224 G:0.31010
position 2: T:0.30371 C:0.29284 A:0.27398 G:0.12948
position 3: T:0.27877 C:0.30147 A:0.24776 G:0.17199
Average T:0.27078 C:0.26737 A:0.25799 G:0.20386
- statistics for each branch in the tree
branch t N S dN/dS dN dS N*dN S*dS
8..1 0.031 1034.0 349.0 0.0983 0.0031 0.0319 3.2 11.1
8..9 0.017 1034.0 349.0 0.0983 0.0018 0.0178 1.8 6.2
9..10 0.053 1034.0 349.0 0.0983 0.0053 0.0543 5.5 18.9
10..2 0.029 1034.0 349.0 0.0983 0.0030 0.0301 3.1 10.5
10..4 0.017 1034.0 349.0 0.0983 0.0017 0.0170 1.7 5.9
9..7 0.092 1034.0 349.0 0.0983 0.0092 0.0937 9.5 32.7
8..11 0.020 1034.0 349.0 0.0983 0.0020 0.0201 2.0 7.0
11..12 0.007 1034.0 349.0 0.0983 0.0007 0.0068 0.7 2.4
12..3 0.020 1034.0 349.0 0.0983 0.0020 0.0205 2.1 7.1
12..5 0.031 1034.0 349.0 0.0983 0.0031 0.0319 3.2 11.1
11..6 0.031 1034.0 349.0 0.0983 0.0031 0.0313 3.2 10.9
... when we instead look at these parameter in the Rhabdomeric1.codeml.m2.out we can see that the three branches in the forground have a higher dNdS value ...
branch t N S dN/dS dN dS N*dN S*dS
8..1 0.031 1034.0 349.0 0.0730 0.0025 0.0338 2.6 11.8
8..9 0.017 1034.0 349.0 0.0730 0.0014 0.0189 1.4 6.6
9..10 0.053 1034.0 349.0 0.0730 0.0042 0.0576 4.4 20.1
10..2 0.029 1034.0 349.0 0.1874 0.0047 0.0250 4.8 8.7
10..4 0.017 1034.0 349.0 0.0730 0.0013 0.0181 1.4 6.3
9..7 0.092 1034.0 349.0 0.0730 0.0073 0.0994 7.5 34.7
8..11 0.020 1034.0 349.0 0.0730 0.0016 0.0213 1.6 7.4
11..12 0.007 1034.0 349.0 0.0730 0.0005 0.0072 0.5 2.5
12..3 0.020 1034.0 349.0 0.0730 0.0016 0.0217 1.6 7.6
12..5 0.031 1034.0 349.0 0.1874 0.0050 0.0265 5.1 9.2
11..6 0.031 1034.0 349.0 0.1874 0.0049 0.0259 5.0 9.1
... however ... how can we find some statistical support for this claim?! The brief answer is that we need to perform a Likelihood Ratio Test and for that purpose I prepared a LRT_from_codeml.py python executable which, when fed the two CODEML output files, will perform the LRT and then give back a p value. So we can just type:
python LRT_from_codeml.py Rhabdomeric1.codeml.m0.out Rhabdomeric1.codeml.m2.out
Wow! It looks like the Rhodopsin of deep sea squids underwnt a more relaxed selection regime compare to those which live in shallow water! However ...
Task!
Try testing the m1 model, which assumes a different omega value for every branch in the phylogeny. While this model includes a large number of parameters and is generally discouraged, it can offer useful exploratory insights to help refine your hypothesis.
If you wnat to learn more on inferring selection regimws using PAML, read this paper!