Thursday, June 24, 2010

Can't believe half year gone!



July 2010 marks the beginning of a new semester. Time flies. Lately, I noticed two questions that I was regularly asked: 1) How's your project going?; 2) When do you finish your PhD? *Sigh. Things that I should have done 3 months ago is still hanging in my list. Looking back, it's not a bad first half of the year!

A few days ago, a new desktop arrived in my lab. Finally! I had to install it twice because I couldn't partition the hard disk correctly. There is no such thing as root and swap space the last time I remember! Then, I spent two days updating the files, fixing the screen resolution and installing software. Problems never fail to arise when I use Linux. Next, I'm planning a Linux introduction workshop for my labmates.

Today it's the last day of Kolokium FST. It's an annual event where the final year postgrad student present their work. Coincidently, there is a seminar on marine biology and also one held in Inbiosis. Less people attended it compared to last year. This is a great opportunity to shamelessly helped myself to the food.

Read more...

Friday, June 18, 2010

Preprocessing of NGS reads: Trimming and filtering



The last few MGRC seminars I attended has been emphasizing the importance of pre-processing of NGS reads. SNPs detection relies heavily on read quality. Softwares like Maq and Samtools use quality score. But it's a different story when it comes to assembly because assembly doesn't use quality score. Some think that trimming is only necessary in very high coverage data.

Only the past few months, more tools on read quality assessment are available for the wider community. A recent IlluminaGAII 1.3+ Pipeline documentation reported that a run of bases with a quality score 2 or symbol 'B' is unreliable and should be removed (View discussion here). With increasing understanding and the tools available, I think the community can make a better decision to trim or not to trim.

Here is a list of trimming tools/scripts I found online:
1. HT Sequence Analysis with R and bioconductor
2. HTSeq by Simon Anders.
3. Softtrim.R by Jeremy Leipzig
4. TrimBWAstyle.pl by Joe Fass
5. Solexa_Sig2
6. Biopython

Bioconductor ShortRead package has been available since 2008. It requires a bit of R knowledge. It's only recently I realized that it has a function to trim reads to desired length. It only can perform right trimming.

HTSeq is able to trim adaptors but not removing low quality reads. The ht-seq-qa script is particularly useful. It generate a nice plot to show distribution of quality score over read position.

Here's how I perform trimming:
First, I installed Bioconductor ShortRead package. Make sure you use the latest version of R. Please note that trimming using R is RAM intensive. In my case, it took an hour to process a 1Gb file using 12Gb RAM computer. The program sometimes get killed or return an error message saying "Error: cannot allocate vector of size 1.5 Gb".

Then, convert the quality score of your fastq file to Sanger Phred score using IllQ2SanQ.pl script from UC Davis. Next, I used Leipzig's Softtrim R script. I can choose the min quality score, min read length of trimmed reads and enable left trimming. This script works better for me. Click here to see how to use his script. Lastly, I separated the trimmed reads into paired and single reads using a Python script.

Here's how my trimmed reads look like:

Before trimming

After trimming

*Plots are generated using htseq-qa script by Simon Anders

Read more...

  © Free Blogger Templates Spain by Ourblogtemplates.com 2008

Back to TOP