Bioinformatics for Biologist: More efficient ways to run all-against-all blast
All-against-all blast is used when we want to search the sequences against themselves using blast. It is useful to look for paralogous genes and alternative splicing isoforms within a dataset. Basically, it involves two simple steps: 1) Create the database; 2) Blast the sequence file to its own database. You can specify which blast and algorithm to use. It is similar to the normal blast except it uses the database created from its sequence file. An example of the commands are as follow:
./makeblastdb -in file1.fa -dbtype nucl -out database1
./blastn -task megablast -db file1 -query file1.fa -out results1.txt -evalue 1E-10 -outfmt 6
However, the process takes a long time and produce a large output file because it will report huge amount of alignments of the same sequence against itself ("Seq1 Seq1"). What we are really interested in are the alignments from "Seq1 Seq2" or "Seq1 Seq3" etc .When I run blastn -blastn on a 20MB file, I estimated that it will take 16 hours and produce several GB of output. When I realized the long running time required and large output file produced, I immediately stopped the analysis. When I googled this problem, I found very few resources available, namely one discussion on Biostar, one Python script and an old software NBLAST which was published in 2002.
Here is the step-by-step guide on how I do all-against-all blast more efficiently:
1) Reduce the time by spliting the sequence file into several files and run blast simultaneously on these files (as suggested in Biostar discussion).
Using AWK, I can split the files by first splitting the file using ">" and specifying the record numbers for each smaller files. This is an example for a file which contains 66902 sequences. This script removes the first record which is empty. It gives 16724 sequences to the first three files and 16730 sequences to the last file.
awk 'BEGIN{RS=">"}NR>=1&&NR<=16725{print ">"$0}' file1.fa > file1.fa1
awk 'BEGIN{RS=">"}NR>=16726&&NR<=33449{print ">"$0}' file1.fa > file1.fa2
awk 'BEGIN{RS=">"}NR>=33450&&NR<=50173{print ">"$0}' file1.fa > file1.fa3
awk 'BEGIN{RS=">"}NR>=50174{print ">"$0}' file.fa > file1.fa4
2) To remove the alignments resulted from the same sequences. Instead of specifying the output file in the blast command, I pipe the output to AWK which will remove the alignments that has the same subject and query sequence. I also performed additional filtering for alignments with at least 40% identity and at least 300 bp of alignment length .
blastn -task megablast -db database1 -query file1.fa1 -evalue 1E-10 -outfmt 6 | awk '$1!=$2 && $3>=40 && $4>=300' > file1_1.blast
Note: If you're dealing with sequences which contain several alternative spliceforms, you might consider using the following command. This will remove the alignments between contig001.1 and contig001.2.
blastn -task megablast -db database1 -query file1.fa1 -evalue 1E-10 -outfmt 6 | awk '{split($1,a,"."); split($1,b,"."); if (a[1]!=b[1] && $3>=40 && $4>=300) print }' > file1_1.blast
3) Lastly, we need to remove the redundant alignment from any two sequences. For seq1 and seq2, you will find two highly identical alignments, namely "Seq1 Seq2" and "Seq2 Seq1". However, these two alignments can differ slightly in alignment length etc. Therefore, the best way to remove them is by using total score (column 12).
awk '{c=$1"\t"$2"\t"$12 ; b= $2"\t"$1"\t"$12; if (c in a == 0 && b in a == 0) a[$1"\t"$2"\t"$12]=$0}END{for (i in a) print a[i]}' file1_1.blast > file1_1fil.blast
Related post:
1 comments:
Post a Comment