Friday, April 24, 2009

De novo NGS Assembly: SOLiD Bioinformatics workshop part II

Sorry it took so long to post Part II of SOLiD bioinformatics workshop. Things that were puzzling to me two months ago all make sense now. I'm in debt with Dr. Sun JianDong for his kind explanation.

Dr. Sun spoke about de novo assembly. As we know, de novo assembly is fragment assembly without reference. Genome sequences and existing ESTs databases can be used as reference although the former has more advantages. Most assemblers used for Sanger sequencing cannot handle short reads generated by Next Generation Sequencing. Because of the length, short reads must be produced in large quantities and at greater coverage depths (Velvet 2008). Since the introduction of NGS, the science community has been very concerned with the development of algorithms that are suitable for these reads. Generally, there are TWO types of de novo NGS assemblers.

Hamilton Path
  • also known as overlap-layout-consensus approach (Batzoglou 2005).
  • each node is represented by one read and each detected read overlap as an arc between the appropriate nodes (Zerbino & Birney 2008)
  • not suitable for short reads
  • Examples: SSAKE, SHARCGS, SHORTY, Edena
Eulerian approach
  • De Bruijn graph
  • less complex and more accurate
  • very sensitive to errors and low quality reads
  • reads are mapped to the path based on k-mers.
  • Examples: Velvet, Euler-SR
Programs that use De Bruijn graph are more accurate and faster compared to Hamilton path. However, these programs cannot tolerate the slighest error rates like 0.3%. (Chaisson et al 2009) Therefore, erroraneous ends must be trimmed and errors correction must be done. Latest version of Euler SR: Euler USR can assemble error-prone reads. Among all NGS assemblers, Velvet is still the widely used programs due to its fast and efficient assembly.

Next, he explained several metrics used in assembly evaluation:
  • N50 contig length - Longer size is wanted
  • Number of contigs - less number of contigs is desirable
  • Length of contigs - longer contigs is better
  • Coverage - higher coverage, better assembly.
(Note: N50 contig length = The size of contig such that 50% of the assembly is contained in contigs size of N50 or greater)

Here's another definition of N50. If we sort contigs from the largest to the smallest, and start covering the genome in that order. N50 is the length of contigs that just covers 50th percentile. Although longer contigs are better, N50 size deteriorates rapidly as low coverage region in the contigs increases in the attempt to obtain longer contigs. For microbial genome sequencing, the best assembly should only give one contig because the chromosome is circular. But in reality, that never happen.

Recent papers have revealed that paired-end reads are extremely useful in improving assembly and resolve some repeat problems. Read length also helps assembly but until it reaches a barrier according to Chaisson et al (2009). The barrier for E.coli is 35nt while assembly quality for yeast genome doesn't improve much after exceeding 60nt. This comes as a total suprise. The same goes with coverage issues. Assembly doesn't improve after coverage reaches certain threshold. Some repeats and sequence complexity cannot be resolved by high coverage.

And how exactly does what I mentioned has anything to do with SOLiD bioinformatics? Nothing ...because most assemblers mentioned aboved (except Velvet) cannot read colour space data. hehe... I think I have to write another post to come to the points of this SOLiD Bioinformatics Workshop.

Related posts:
SOLiD Bioinformatics Part I : Introduction
SOLiD Bioinformatics Part III: SOLiD softwares

Reference:
Zerbino & Birney 2008. Velvet: Algorithms for de novo short read assembly using De Bruijn graphs
Chaisson et al 2009. De novo fragment assembly with short mate-paired reads: Does the read length matter?

0 comments:

Post a Comment

  © Free Blogger Templates Spain by Ourblogtemplates.com 2008

Back to TOP