Friday, June 18, 2010

Preprocessing of NGS reads: Trimming and filtering



The last few MGRC seminars I attended has been emphasizing the importance of pre-processing of NGS reads. SNPs detection relies heavily on read quality. Softwares like Maq and Samtools use quality score. But it's a different story when it comes to assembly because assembly doesn't use quality score. Some think that trimming is only necessary in very high coverage data.

Only the past few months, more tools on read quality assessment are available for the wider community. A recent IlluminaGAII 1.3+ Pipeline documentation reported that a run of bases with a quality score 2 or symbol 'B' is unreliable and should be removed (View discussion here). With increasing understanding and the tools available, I think the community can make a better decision to trim or not to trim.

Here is a list of trimming tools/scripts I found online:
1. HT Sequence Analysis with R and bioconductor
2. HTSeq by Simon Anders.
3. Softtrim.R by Jeremy Leipzig
4. TrimBWAstyle.pl by Joe Fass
5. Solexa_Sig2
6. Biopython

Bioconductor ShortRead package has been available since 2008. It requires a bit of R knowledge. It's only recently I realized that it has a function to trim reads to desired length. It only can perform right trimming.

HTSeq is able to trim adaptors but not removing low quality reads. The ht-seq-qa script is particularly useful. It generate a nice plot to show distribution of quality score over read position.

Here's how I perform trimming:
First, I installed Bioconductor ShortRead package. Make sure you use the latest version of R. Please note that trimming using R is RAM intensive. In my case, it took an hour to process a 1Gb file using 12Gb RAM computer. The program sometimes get killed or return an error message saying "Error: cannot allocate vector of size 1.5 Gb".

Then, convert the quality score of your fastq file to Sanger Phred score using IllQ2SanQ.pl script from UC Davis. Next, I used Leipzig's Softtrim R script. I can choose the min quality score, min read length of trimmed reads and enable left trimming. This script works better for me. Click here to see how to use his script. Lastly, I separated the trimmed reads into paired and single reads using a Python script.

Here's how my trimmed reads look like:

Before trimming

After trimming

*Plots are generated using htseq-qa script by Simon Anders

0 comments:

Post a Comment

  © Free Blogger Templates Spain by Ourblogtemplates.com 2008

Back to TOP