← Back to syllabus

W2_L04

Orthology inference and taxon sampling

Practical Notes

orthology infererence and taxon sampling

What are orthologous and paralogous genes?

  • Orthologs are commonly defined as pairs of genes that started diverging after speciations events.
  • Paragols genes, on contrary, started diverging after duplication events.

... and orthogroups?

An orthogroups is a group of orthologs genes descending from the last common ancestor (LCA) of a group of species. (i.e. extension of concept of orthology to multiple species). An orthogroup is always defined by a reference speciation event!


Dataset

In these practical we are going to use the proteoms from 7 Mollusk species taken from public database plus one Annelidae as outgroup:

  • Biomphalaria glabrata (Gasteropoda)
  • Acanthopleura granulata (Poliplacophora)
  • Crassostrea virginica (Bivalvia)
  • Lottia gigantea (Gasteropoda)
  • Octopus bimaculoides (Cephalopoda)
  • Sinivicula constricta (Bivalvia)
  • Pinctada fucata (Bivalvia)
  • Helobdella robusta (Anellidae)

Let's have a look at the first ten headers of one of the proteoms (all are stored in the Data/Proteoms folder):

for file in data/proteoms/*.pep; do grep ">" $file | head -n 3; done

As you can see the headers have been simplified compared to the NCBI default. Indeed, some programs write many files, so it is a good practice to try to reduce their weight. Furthermore, this also makes much easier to immediatly understand the species from which the sequences come from. However, it is mandatory to leave a unique reference to the original proteome in order to be able to track back quickly the full original header if necessary.

An important note: During these lessons we will probably work only with amminoacid sequences (depending on time). Our species are very distant related (see here for a recent phylogenomic and divergence time estimation) so protein can give enough phylogentic informations avoiding substial substitution saturation phenomena(be aware of this corollary). However, if we are dealing with more closely related species, we would prefered nucleotide sequences which convey more information (due to the genetic code degeneracy).


Orthofinder

In brief, the Orthofinder alghoritm is subdivided into 4 major steps:

  1. Orthogroups inference using reciprocal best blast hit, costruction of a graph and clustering genes into ortogroups.
  2. Inference and rooting of specie tree.
  3. gene tree inference and rooting for each orthogroup.
  4. Inference of orthologs and gene duplication events reconciling rooted specie and gene trees.

The detailed explanation of each step is not the aim of this course, hoever if you are interested in orthology inference you should have a look at the two Orthofinder papers (Emms and Kelly, 2015 and Emms and Kelly, 2019).

One of the most valuable aspect about Orthofinder is its usability. A quick-and-dirty Orthofinder analyses can be simply run with:

orthofinder -f <proteoms_folder>

However, we can also tune a lot of parameters. Let's take a look at them (after activating the correct environment):

orthofinder --help

OrthoFinder version 2.5.2 Copyright (C) 2014 David Emms

SIMPLE USAGE:
Run full OrthoFinder analysis on FASTA format proteomes in <dir>
  orthofinder [options] -f <dir>

Add new species in <dir1> to previous run in <dir2> and run new analysis
  orthofinder [options] -f <dir1> -b <dir2>

OPTIONS:
 -t <int>        Number of parallel sequence search threads [Default = 4]
 -a <int>        Number of parallel analysis threads
 -d              Input is DNA sequences
 -M <txt>        Method for gene tree inference. Options 'dendroblast' & 'msa'
                 [Default = dendroblast]
 -S <txt>        Sequence search program [Default = diamond]
                 Options: blast, diamond, diamond_ultra_sens, blast_gz, mmseqs, blast_nucl
 -A <txt>        MSA program, requires '-M msa' [Default = mafft]
                 Options: mafft, muscle
 -T <txt>        Tree inference method, requires '-M msa' [Default = fasttree]
                 Options: fasttree, raxml, raxml-ng, iqtree
 -s <file>       User-specified rooted species tree
 -I <int>        MCL inflation parameter [Default = 1.5]
 -x <file>       Info for outputting results in OrthoXML format
 -p <dir>        Write the temporary pickle files to <dir>
 -1              Only perform one-way sequence search
 -X              Don't add species names to sequence IDs
 -y              Split paralogous clades below root of a HOG into separate HOGs
 -z              Don't trim MSAs (columns>=90% gap, min. alignment length 500)
 -n <txt>        Name to append to the results directory
 -o <txt>        Non-default results directory
 -h              Print this help text

WORKFLOW STOPPING OPTIONS:
 -op             Stop after preparing input files for BLAST
 -og             Stop after inferring orthogroups
 -os             Stop after writing sequence files for orthogroups
                 (requires '-M msa')
 -oa             Stop after inferring alignments for orthogroups
                 (requires '-M msa')
 -ot             Stop after inferring gene trees for orthogroups 

WORKFLOW RESTART COMMANDS:
 -b  <dir>         Start OrthoFinder from pre-computed BLAST results in <dir>
 -fg <dir>         Start OrthoFinder from pre-computed orthogroups in <dir>
 -ft <dir>         Start OrthoFinder from pre-computed gene trees in <dir>

LICENSE:
 Distributed under the GNU General Public License (GPLv3). See License.md

CITATION:
 When publishing work that uses OrthoFinder please cite:
 Emms D.M. & Kelly S. (2019), Genome Biology 20:238

 If you use the species tree in your work then please also cite:
 Emms D.M. & Kelly S. (2017), MBE 34(12): 3267-3278
 Emms D.M. & Kelly S. (2018), bioRxiv https://doi.org/10.1101/267914

There is a lot of stuff! IMO the most important parameters to take in mind are: -M, -S, -T and -s. However for our purpose a quick-and-dirty orthofinder run is enough (but in real situation is better to play with parameters and find the more suitable for our data). Now run our Orthofinder search:

orthofinder -o Analyses/My_Orthology.Inference -f Data/Proteoms 

NB: the -o option will create a new output directory inside /Analyses.

Orthofinder will output a lot of folders/files, overwhelming you with informations. Some of the more important are:

1: Summary tsv:

cat Analyses/My_Orthology.Inference/Results_Mar30/Comparative_Genomics_Statistics/Statistics_Overall.tsv
Number of species	8
Number of genes	3534
Number of genes in orthogroups	3129
Number of unassigned genes	405
Percentage of genes in orthogroups	88.5
Percentage of unassigned genes	11.5
Number of orthogroups	101
Number of species-specific orthogroups	23
Number of genes in species-specific orthogroups	1136
Percentage of genes in species-specific orthogroups	32.1
Mean orthogroup size	31.0
Median orthogroup size	8.0
G50 (assigned genes)	111
G50 (all genes)	97
O50 (assigned genes)	11
O50 (all genes)	13
Number of orthogroups with all species present	56
Number of single-copy orthogroups	49
Date	2021-03-30
Orthogroups file	Orthogroups.tsv
Unassigned genes file	Orthogroups_UnassignedGenes.tsv
Per-species statistics	Statistics_PerSpecies.tsv
Overall statistics	Statistics_Overall.tsv
Orthogroups shared between species	Orthogroups_SpeciesOverlaps.tsv

Average number of genes per-species in orthogroup	Number of orthogroups	Percentage of orthogroups	Number of genes	Percentage of genes
<1	22	21.8	58	1.9
'1	53	52.5	437	14.0
'2	1	1.0	18	0.6
'3	0	0.0	0	0.0
'4	0	0.0	0	0.0
'5	1	1.0	46	1.5
'6	1	1.0	54	1.7
'7	5	5.0	301	9.6
'8	1	1.0	66	2.1
'9	0	0.0	0	0.0
'10	2	2.0	167	5.3
11-15	9	8.9	973	31.1
16-20	2	2.0	292	9.3
21-50	4	4.0	717	22.9
51-100	0	0.0	0	0.0
101-150	0	0.0	0	0.0
151-200	0	0.0	0	0.0
201-500	0	0.0	0	0.0
501-1000	0	0.0	0	0.0
'1001+	0	0.0	0	0.0

Number of species in orthogroup	Number of orthogroups
1	23
2	14
3	3
4	2
5	1
6	0
7	2
8	56

Try to perform the following exercise(s):

  • How many single copy Orthogroups have been founded?
  • Which is the percentage of assigned genes?
  • In your opinion, how can the inclusion of more taxa influence these two statistics?

2: Orthogroups.GeneCount.tsv :

For simplicity I have printed only the first 30 rows, but take a look at the complete file....

cat Analyses/My_Orthology.Inference/Results_Mar30/Orthogroups/Orthogroups.GeneCount.tsv | head -n 30
Orthogroup	A_granulata_2	B_glabrata_2	C_virginica_2	H_robusta_2	L_gigantea_2	O_bimaculoides_2	P_fucata_2	S_constricta_2	Total
OG0000000	1	3	1	1	2	175	1	3	187
OG0000001	7	1	65	2	11	0	46	50	182
OG0000002	0	5	167	0	3	0	2	0	177
OG0000003	0	0	0	171	0	0	0	0	171
OG0000004	0	157	0	0	0	0	0	0	157
OG0000005	0	0	0	0	135	0	0	0	135
OG0000006	17	12	16	22	19	7	20	13	126
OG0000007	0	0	0	0	124	0	0	0	124
OG0000008	0	0	2	0	0	0	116	0	118
OG0000009	14	12	12	29	13	7	13	12	112
OG0000010	0	0	0	111	0	0	0	0	111
OG0000011	0	0	0	0	0	0	103	0	103
OG0000012	0	1	0	1	1	0	3	91	97
OG0000013	0	0	0	0	0	94	0	0	94
OG0000014	87	0	0	0	0	0	1	0	88
OG0000015	0	0	83	0	0	0	3	0	86
OG0000016	0	0	81	0	0	0	0	0	81
OG0000017	66	0	0	0	0	0	0	0	66
OG0000018	10	14	5	2	1	1	20	8	61
OG0000019	0	0	2	0	1	0	58	0	61
OG0000020	4	4	3	39	3	1	2	4	60
OG0000021	4	16	10	1	2	8	11	8	60
OG0000022	6	4	11	2	10	4	10	12	59
OG0000023	16	3	2	0	9	6	2	16	54
OG0000024	0	0	0	46	0	0	0	0	46
OG0000025	0	0	5	0	1	0	12	0	18
OG0000026	0	0	1	0	0	0	12	1	14
OG0000027	2	0	0	0	3	0	4	2	11
OG0000028	0	0	0	11	0	0	0	0	11

Try to perform the following exercise(s):

  • Are present specie-specific gene families? If yes ,by which OG are they rappresented?
  • Can you identify some possible specie-specific gene families expansion and/or contractions?
  • Can you infer events of gene family loss and gain?

3. Infered species tree

Orthofinder produce by default a rooted specie tree, indeed it is necessery to assign duplication or speciation events of the nodes of the gene trees. Now, let's download it Analyses/My_Orthology.Inference/Results_Apr09/Species_Tree/SpeciesTree_rooted.txt and open it with Figtree. Remember that H.robusta should be our outgroup.

Try to perform the following exercise(s):

  • How is the inferred species tree? Is it choerent with the phylogenomics of Kocot et al., 2011?
  • Which are the most controversial branches/nodes of the tree?

4. 1-to-1 Orthogroups sequences

Inside this folder, placed in /Analyses/My_Orthology.Inference/Results_Mar30/Single_Copy_Orthologue_Sequences you can find fasta sequences of our 1-to-1 orthologs. As already said in phylogenetic inference, ideally, we should analyze strictly orthologs (gene trees, in theory, have to follow the specie tree), so for our downtream analyses we are going to use just these files. However take in mind that when dealing with real data in a real scientific work maybe 1-to-1 orthologs are not enough to catch a good ammount of "phylogenetic signal", especiallly if our species are very distant related (Increase the probability to find less 1-to-1 Orthologs).

Some nice softwares like Phylotreepruner allow to use also in-paralogs which follow user-defined taxonomic constrains (using subtrees), but check their paper for more informations. Finally, in the last few years are being developed a group of very promising tools that allow to infere specie tree from whole gene families trees (with not only speciation bu also duplciation, loss, transfer and incomplete lineage sorting events). Take a look at SpecieRax and Astral-Pro.

Orthogroup sequences can instead be used more freerly if our aim is not phylogenetic inference, but comparative genomics, for example by annotating them with BLAST and / or Interproscan

Others nice software for Orthology inference are OrthoMCL and OMA. Take a look at them if you are interested in orthology inference.


main