W2_L05
Sequence alignment and filtering
sequence alignement and filtering
All phylogenetic methods require a set of homologous characters thus the first step is to infer which nucleotides / codons / aminocid residues are homologous to each other, so that differences among these nucleotides result only from changes that convey descent information. This process is called Multiple Sequence Alignment (MSA) and is often followed by an additional step of detection and exclusion of those alignment regions whose homology we are uncertain of. Due to the large amount of data which we process nowadays, this step is often overlooked and as a result it's quite easy to find misaligned loci in modern datasets. Nonetheless this is a key step, which may affect downstream analyses (see further reading for more).
For two sequences of length x and y, there are (x+y)!x! y! possible alignments.
For three sequences, of length x , y and z say, there are (x+y+z)!x! y! z! alignments.
For n sequences of length 10, this increases rapidly:
| n | alignments |
|---|---|
| 02 | 184756 |
| 03 | 5.55 × 10^12 |
| 04 | 4.71 × 10^21 |
| 05 | 4.83 × 10^31 |
| 06 | 3.64 × 10^42 |
| 07 | 1.45 × 10^54 |
| 08 | 2.38 × 10^66 |
| 09 | 4.94 × 10^85 |
| 10 | 2.35 × 10^92 |
In general it is not possible - even with really fast computers - to perform exhaustive multiple alignments, similarly to what happens with trees!
In this tutorial, we will use the most popular tools for multiple sequence alignment, MAFFT (Katoh and Standley 2013). And subsequently filter the alignments using Gblocks (Talavera & Castresana 2007).
Let's start with a look at one of our single copy orthogroups with the cat command:
cat data/Single_Copy_Orthologue_Sequences/N0.HOG0000020.fa
The sequences are not aligned yet: this is the reason why they contain no gaps and differ in length!
Aligning with MAFFT:
MAFFT is a very fast tool, which implements several different algorithms which makes it also very versatile.
Let's start using the code:
mafft --auto data/Single_Copy_Orthologue_Sequences/N0.HOG0000020.fa > data/Single_Copy_Orthologue_Sequences/N0.HOG0000020.aln
MAFFT contains many parameters, which you are encouraged to explore: an important one is the gap-opening penalty which by default is 1.53. Let's try a different value, using the code:
mafft --auto --op 7 data/Single_Copy_Orthologue_Sequences/N0.HOG0000020.fa > data/Single_Copy_Orthologue_Sequences/N0.HOG0000020.op7.aln
Last but not least: we leveraged the --auto flag of MAFFT, which automatically decides which is the best algorithm for carrying out the MSA, but several algorithms are implemented and the user can decide which one to rely on.
Try to perform the following exercise(s):
- Can you spot any difference between the two gap-opening penalty values using Aliview?
- Can you write a script to align all OGs?
Alignment Filtering:
While dealing modern phylogenetic dataset, which consists of hundred to thousands of alignments is not possible to have a manual curation for each one, and thus it is necessary to automatically remove alignment errors.
In this tutorial we will use Gblocks: this software will select blocks of conserved sites, which can be defined with many custom parameters, described in the manual
We will use the susequent parameters and leave the rest to the default:
- -t= p (protein) d (DNA) c (codons)
- -b4 Minimum Length Of A Block
- -b5 Allowed Gap Positions
Let's use this string one PCGs gene:
Gblocks data/Single_Copy_Orthologue_Sequences/N0.HOG0000020.aln -t=p -b4=5 -b5=a
The analyses will generate two kind of files:
-gb: filtered fasta files-gb.htm: for visualising which position got removed
Try to perform the following exercise(s):
- Open the .html file and take a look. Do you think the alignement got better?
- Perform the filtering again using
-b4=59 -b5=n. Do you think that what is left will be usefull for phylogenetics?