W2_L04
Orthology inference and taxon sampling
orthology infererence and taxon sampling
What are orthologous and paralogous genes?
- Orthologs are commonly defined as pairs of genes that started diverging after speciations events.
- Paragols genes, on contrary, started diverging after duplication events.
... and orthogroups?
An orthogroups is a group of orthologs genes descending from the last common ancestor (LCA) of a group of species. (i.e. extension of concept of orthology to multiple species). An orthogroup is always defined by a reference speciation event!
Dataset
In these practical we are going to use the proteoms from 7 Mollusk species taken from public database plus one Annelidae as outgroup:
- Biomphalaria glabrata (Gasteropoda)
- Acanthopleura granulata (Poliplacophora)
- Crassostrea virginica (Bivalvia)
- Lottia gigantea (Gasteropoda)
- Octopus bimaculoides (Cephalopoda)
- Sinivicula constricta (Bivalvia)
- Pinctada fucata (Bivalvia)
- Helobdella robusta (Anellidae)
Let's have a look at the first ten headers of one of the proteoms (all are stored in the Data/Proteoms folder):
for file in data/proteoms/*.pep; do grep ">" $file | head -n 3; done
As you can see the headers have been simplified compared to the NCBI default. Indeed, some programs write many files, so it is a good practice to try to reduce their weight. Furthermore, this also makes much easier to immediatly understand the species from which the sequences come from. However, it is mandatory to leave a unique reference to the original proteome in order to be able to track back quickly the full original header if necessary.
An important note: During these lessons we will probably work only with amminoacid sequences (depending on time). Our species are very distant related (see here for a recent phylogenomic and divergence time estimation) so protein can give enough phylogentic informations avoiding substial substitution saturation phenomena(be aware of this corollary). However, if we are dealing with more closely related species, we would prefered nucleotide sequences which convey more information (due to the genetic code degeneracy).
Orthofinder
In brief, the Orthofinder alghoritm is subdivided into 4 major steps:
- Orthogroups inference using reciprocal best blast hit, costruction of a graph and clustering genes into ortogroups.
- Inference and rooting of specie tree.
- gene tree inference and rooting for each orthogroup.
- Inference of orthologs and gene duplication events reconciling rooted specie and gene trees.
The detailed explanation of each step is not the aim of this course, hoever if you are interested in orthology inference you should have a look at the two Orthofinder papers (Emms and Kelly, 2015 and Emms and Kelly, 2019).
One of the most valuable aspect about Orthofinder is its usability. A quick-and-dirty Orthofinder analyses can be simply run with:
orthofinder -f <proteoms_folder>
However, we can also tune a lot of parameters. Let's take a look at them (after activating the correct environment):
orthofinder --help
OrthoFinder version 2.5.2 Copyright (C) 2014 David Emms
SIMPLE USAGE:
Run full OrthoFinder analysis on FASTA format proteomes in <dir>
orthofinder [options] -f <dir>
Add new species in <dir1> to previous run in <dir2> and run new analysis
orthofinder [options] -f <dir1> -b <dir2>
OPTIONS:
-t <int> Number of parallel sequence search threads [Default = 4]
-a <int> Number of parallel analysis threads
-d Input is DNA sequences
-M <txt> Method for gene tree inference. Options 'dendroblast' & 'msa'
[Default = dendroblast]
-S <txt> Sequence search program [Default = diamond]
Options: blast, diamond, diamond_ultra_sens, blast_gz, mmseqs, blast_nucl
-A <txt> MSA program, requires '-M msa' [Default = mafft]
Options: mafft, muscle
-T <txt> Tree inference method, requires '-M msa' [Default = fasttree]
Options: fasttree, raxml, raxml-ng, iqtree
-s <file> User-specified rooted species tree
-I <int> MCL inflation parameter [Default = 1.5]
-x <file> Info for outputting results in OrthoXML format
-p <dir> Write the temporary pickle files to <dir>
-1 Only perform one-way sequence search
-X Don't add species names to sequence IDs
-y Split paralogous clades below root of a HOG into separate HOGs
-z Don't trim MSAs (columns>=90% gap, min. alignment length 500)
-n <txt> Name to append to the results directory
-o <txt> Non-default results directory
-h Print this help text
WORKFLOW STOPPING OPTIONS:
-op Stop after preparing input files for BLAST
-og Stop after inferring orthogroups
-os Stop after writing sequence files for orthogroups
(requires '-M msa')
-oa Stop after inferring alignments for orthogroups
(requires '-M msa')
-ot Stop after inferring gene trees for orthogroups
WORKFLOW RESTART COMMANDS:
-b <dir> Start OrthoFinder from pre-computed BLAST results in <dir>
-fg <dir> Start OrthoFinder from pre-computed orthogroups in <dir>
-ft <dir> Start OrthoFinder from pre-computed gene trees in <dir>
LICENSE:
Distributed under the GNU General Public License (GPLv3). See License.md
CITATION:
When publishing work that uses OrthoFinder please cite:
Emms D.M. & Kelly S. (2019), Genome Biology 20:238
If you use the species tree in your work then please also cite:
Emms D.M. & Kelly S. (2017), MBE 34(12): 3267-3278
Emms D.M. & Kelly S. (2018), bioRxiv https://doi.org/10.1101/267914
There is a lot of stuff! IMO the most important parameters to take in mind are: -M, -S, -T and -s. However for our purpose a quick-and-dirty orthofinder run is enough (but in real situation is better to play with parameters and find the more suitable for our data).
Now run our Orthofinder search:
orthofinder -o Analyses/My_Orthology.Inference -f Data/Proteoms
NB: the -o option will create a new output directory inside /Analyses.
Orthofinder will output a lot of folders/files, overwhelming you with informations. Some of the more important are:
1: Summary tsv:
cat Analyses/My_Orthology.Inference/Results_Mar30/Comparative_Genomics_Statistics/Statistics_Overall.tsv
Number of species 8
Number of genes 3534
Number of genes in orthogroups 3129
Number of unassigned genes 405
Percentage of genes in orthogroups 88.5
Percentage of unassigned genes 11.5
Number of orthogroups 101
Number of species-specific orthogroups 23
Number of genes in species-specific orthogroups 1136
Percentage of genes in species-specific orthogroups 32.1
Mean orthogroup size 31.0
Median orthogroup size 8.0
G50 (assigned genes) 111
G50 (all genes) 97
O50 (assigned genes) 11
O50 (all genes) 13
Number of orthogroups with all species present 56
Number of single-copy orthogroups 49
Date 2021-03-30
Orthogroups file Orthogroups.tsv
Unassigned genes file Orthogroups_UnassignedGenes.tsv
Per-species statistics Statistics_PerSpecies.tsv
Overall statistics Statistics_Overall.tsv
Orthogroups shared between species Orthogroups_SpeciesOverlaps.tsv
Average number of genes per-species in orthogroup Number of orthogroups Percentage of orthogroups Number of genes Percentage of genes
<1 22 21.8 58 1.9
'1 53 52.5 437 14.0
'2 1 1.0 18 0.6
'3 0 0.0 0 0.0
'4 0 0.0 0 0.0
'5 1 1.0 46 1.5
'6 1 1.0 54 1.7
'7 5 5.0 301 9.6
'8 1 1.0 66 2.1
'9 0 0.0 0 0.0
'10 2 2.0 167 5.3
11-15 9 8.9 973 31.1
16-20 2 2.0 292 9.3
21-50 4 4.0 717 22.9
51-100 0 0.0 0 0.0
101-150 0 0.0 0 0.0
151-200 0 0.0 0 0.0
201-500 0 0.0 0 0.0
501-1000 0 0.0 0 0.0
'1001+ 0 0.0 0 0.0
Number of species in orthogroup Number of orthogroups
1 23
2 14
3 3
4 2
5 1
6 0
7 2
8 56
Try to perform the following exercise(s):
- How many single copy Orthogroups have been founded?
- Which is the percentage of assigned genes?
- In your opinion, how can the inclusion of more taxa influence these two statistics?
2: Orthogroups.GeneCount.tsv :
For simplicity I have printed only the first 30 rows, but take a look at the complete file....
cat Analyses/My_Orthology.Inference/Results_Mar30/Orthogroups/Orthogroups.GeneCount.tsv | head -n 30
Orthogroup A_granulata_2 B_glabrata_2 C_virginica_2 H_robusta_2 L_gigantea_2 O_bimaculoides_2 P_fucata_2 S_constricta_2 Total
OG0000000 1 3 1 1 2 175 1 3 187
OG0000001 7 1 65 2 11 0 46 50 182
OG0000002 0 5 167 0 3 0 2 0 177
OG0000003 0 0 0 171 0 0 0 0 171
OG0000004 0 157 0 0 0 0 0 0 157
OG0000005 0 0 0 0 135 0 0 0 135
OG0000006 17 12 16 22 19 7 20 13 126
OG0000007 0 0 0 0 124 0 0 0 124
OG0000008 0 0 2 0 0 0 116 0 118
OG0000009 14 12 12 29 13 7 13 12 112
OG0000010 0 0 0 111 0 0 0 0 111
OG0000011 0 0 0 0 0 0 103 0 103
OG0000012 0 1 0 1 1 0 3 91 97
OG0000013 0 0 0 0 0 94 0 0 94
OG0000014 87 0 0 0 0 0 1 0 88
OG0000015 0 0 83 0 0 0 3 0 86
OG0000016 0 0 81 0 0 0 0 0 81
OG0000017 66 0 0 0 0 0 0 0 66
OG0000018 10 14 5 2 1 1 20 8 61
OG0000019 0 0 2 0 1 0 58 0 61
OG0000020 4 4 3 39 3 1 2 4 60
OG0000021 4 16 10 1 2 8 11 8 60
OG0000022 6 4 11 2 10 4 10 12 59
OG0000023 16 3 2 0 9 6 2 16 54
OG0000024 0 0 0 46 0 0 0 0 46
OG0000025 0 0 5 0 1 0 12 0 18
OG0000026 0 0 1 0 0 0 12 1 14
OG0000027 2 0 0 0 3 0 4 2 11
OG0000028 0 0 0 11 0 0 0 0 11
Try to perform the following exercise(s):
- Are present specie-specific gene families? If yes ,by which OG are they rappresented?
- Can you identify some possible specie-specific gene families expansion and/or contractions?
- Can you infer events of gene family loss and gain?
3. Infered species tree
Orthofinder produce by default a rooted specie tree, indeed it is necessery to assign duplication or speciation events of the nodes of the gene trees.
Now, let's download it Analyses/My_Orthology.Inference/Results_Apr09/Species_Tree/SpeciesTree_rooted.txt and open it with Figtree. Remember that H.robusta should be our outgroup.
Try to perform the following exercise(s):
- How is the inferred species tree? Is it choerent with the phylogenomics of Kocot et al., 2011?
- Which are the most controversial branches/nodes of the tree?
4. 1-to-1 Orthogroups sequences
Inside this folder, placed in /Analyses/My_Orthology.Inference/Results_Mar30/Single_Copy_Orthologue_Sequences you can find fasta sequences of our 1-to-1 orthologs. As already said in phylogenetic inference, ideally, we should analyze strictly orthologs (gene trees, in theory, have to follow the specie tree), so for our downtream analyses we are going to use just these files. However take in mind that when dealing with real data in a real scientific work maybe 1-to-1 orthologs are not enough to catch a good ammount of "phylogenetic signal", especiallly if our species are very distant related (Increase the probability to find less 1-to-1 Orthologs).
Some nice softwares like Phylotreepruner allow to use also in-paralogs which follow user-defined taxonomic constrains (using subtrees), but check their paper for more informations. Finally, in the last few years are being developed a group of very promising tools that allow to infere specie tree from whole gene families trees (with not only speciation bu also duplciation, loss, transfer and incomplete lineage sorting events). Take a look at SpecieRax and Astral-Pro.
Orthogroup sequences can instead be used more freerly if our aim is not phylogenetic inference, but comparative genomics, for example by annotating them with BLAST and / or Interproscan
Others nice software for Orthology inference are OrthoMCL and OMA. Take a look at them if you are interested in orthology inference.