Sunday, October 13, 2013

Bioinformatics for Biologist: More efficient ways to run all-against-all blast



All-against-all blast is used when we want to search the sequences against themselves using blast. It is useful to look for paralogous genes and alternative splicing isoforms within a dataset. Basically, it involves two simple steps: 1) Create the database; 2) Blast the sequence file to its own database. You can specify which blast and algorithm to use. It is similar to the normal blast except it uses the database created from its sequence file. An example of the commands are as follow:

./makeblastdb -in file1.fa -dbtype nucl -out database1
./blastn -task megablast -db file1 -query file1.fa -out results1.txt -evalue 1E-10 -outfmt 6

However, the process takes a long time and produce a large output file because it will report huge amount of alignments of the same sequence against itself  ("Seq1 Seq1"). What we are really interested in are the alignments from "Seq1 Seq2" or "Seq1 Seq3" etc .When I run blastn -blastn on a 20MB file, I estimated that it will take 16 hours and produce several GB of output. When I realized the long running time required and large output file produced, I immediately stopped the analysis. When I googled this problem, I found very few resources available, namely one discussion on Biostar, one Python script and an old software NBLAST which was published in 2002. 

Here is the step-by-step guide on how I do all-against-all blast more efficiently:

1) Reduce the time by spliting the sequence file into several files and run blast simultaneously on these files (as suggested in Biostar discussion). 

Using AWK, I can split the files by first splitting the file using ">" and specifying the record numbers for each smaller files. This is an example for a file which contains 66902 sequences. This script removes the first record which is empty. It gives 16724 sequences to the first three files and 16730 sequences to the last file. 

awk 'BEGIN{RS=">"}NR>=1&&NR<=16725{print ">"$0}' file1.fa > file1.fa1
awk 'BEGIN{RS=">"}NR>=16726&&NR<=33449{print ">"$0}' file1.fa > file1.fa2
awk 'BEGIN{RS=">"}NR>=33450&&NR<=50173{print ">"$0}' file1.fa  > file1.fa3
awk 'BEGIN{RS=">"}NR>=50174{print ">"$0}' file.fa > file1.fa4

2) To remove the alignments resulted from the same sequences. Instead of specifying the output file in the blast command, I pipe the output to AWK which will remove the alignments that has the same subject and query sequence. I also performed additional filtering for alignments with at least 40% identity and at least 300 bp of alignment length . 

blastn -task megablast -db database1 -query file1.fa1 -evalue 1E-10 -outfmt 6 | awk '$1!=$2 && $3>=40 && $4>=300' > file1_1.blast

Note: If you're dealing with sequences which contain several alternative spliceforms, you might consider using the following command. This will remove the alignments between contig001.1 and contig001.2.

blastn -task megablast -db database1 -query file1.fa1 -evalue 1E-10 -outfmt 6 | awk '{split($1,a,"."); split($1,b,"."); if (a[1]!=b[1] && $3>=40 && $4>=300) print }' > file1_1.blast

3) Lastly, we need to remove the redundant alignment from any two sequences. For seq1 and seq2, you will find two highly identical alignments, namely "Seq1 Seq2" and "Seq2 Seq1". However, these two alignments can differ slightly in alignment length etc. Therefore, the best way to remove them is by using total score (column 12).

awk '{c=$1"\t"$2"\t"$12 ; b= $2"\t"$1"\t"$12; if (c in a  == 0 && b in a == 0) a[$1"\t"$2"\t"$12]=$0}END{for (i in a) print a[i]}' file1_1.blast > file1_1fil.blast

Related post:

Read more...

Thursday, October 3, 2013

Applying 80/20 rule in bioinformatics analysis



I have always been a firm believer in quality before speed. Ever since I read about the 80/20 rule last month, I start questioning how productive are the things I do. It allows me to reflect on my strengths and weaknesses. Just two weeks ago, I recorded my personal fastest time to complete an analysis from scatch. It took me less than a day to assemble some 454 sequences I downloaded from NCBI SRA! I spent half a day for installation and half an hour to assemble (well, it was a small dataset). I thought I would need more than a week to trim, filter reads, install, try different software and try different parameters etc. Instead, I settled for "good enough" assembly and save the time for more important tasks. 

What is 80/20 rule? It is also known as Pareto principle which is developed by an Italian economist, Vilfredo Pareto. In 1906, he observed that 80% of the wealth was owned by 20% of the population. This concept has been widely applied in business. Although the real ratio always derive from 80/20, the rule generally implies that most results come from a small amount of efforts while a large proportion of efforts give very little impact or result. By investing our efforts on the 20% of work that is really important and reducing efforts on 80% work that is least important, we can increase our output tremendously. 

Identifying 80/20 rule in bioinformatics analysis. First, I need to breakdown my workflow into steps and identify which area is consuming more time and how to improve my efficiency. The flowchart below shows my typical bioinformatics analysis which involves four basic steps :1) Identification of appropriate bioinformatics tools; 2) Reading manuals and publications; 3) Installation; 4) Running analysis.



Running analysis is the 20% work that will give me 80% results. The most exhausting stage is choosing the appropriate tools, reading publications and manual, and installation. So naturally, I would want to reduce the time I spent in these areas. But how?

  • Do I really need to install the software locally? Is any web-server available? Are there any  written scripts by good samaritians available online? Big dataset?
  • Choosing the appropriate software. Mostly I have several software options. I will start with the most popular, well-supported and well documented software. Is the software up-to-date and the last release date is recent? If no, then there is a good chance that I will not be able to install it in my system.  
  • Reading manual. Some manuals can be 200 pages long while some can be as short as a README text file. By skimming through the whole manual and reading only what I need, I can save a lot of time. I always read the part on introduction, installation and the analysis I need. 
  • Installation. I always make sure I installed the required packages and dependancies before installing the software. Some software developers will not tell you this in the manual, and therefore, be sure to google it online.
  • The software may not be working properly due to various reasons such as incomplete installation and wrong input files. To identify the problems, I usually test run a small dataset. My latest practise is to save the print screens into a log file to look for errors.
  • It's time to call for help if the step above still can't solve the problem. The fastest way is of course by goggling the problem online. I realized that I can find a solution faster if I know the specific problem, error message and right syntax to look for. Posting at a forum or writing to the developer is the last resort as it may take a few days before I get a reply. 
  • Do something productive while waiting for the analysis to complete. I'm drafting this blog as I'm running Genscan locally. Estimating the running time is very helpful in managing my time but I often don't know how long it will take. I have a little Unix trick that will play a sound when the command finished running. In that way, I get notified when the analysis complete. Whatever you plan to do during the waiting time, be prepared to be interrupted once the analysis has completed.
Free feel to post any suggestion in the comment box.


Read more...

  © Free Blogger Templates Spain by Ourblogtemplates.com 2008

Back to TOP