Sunday, October 13, 2013

Bioinformatics for Biologist: More efficient ways to run all-against-all blast



All-against-all blast is used when we want to search the sequences against themselves using blast. It is useful to look for paralogous genes and alternative splicing isoforms within a dataset. Basically, it involves two simple steps: 1) Create the database; 2) Blast the sequence file to its own database. You can specify which blast and algorithm to use. It is similar to the normal blast except it uses the database created from its sequence file. An example of the commands are as follow:

./makeblastdb -in file1.fa -dbtype nucl -out database1
./blastn -task megablast -db file1 -query file1.fa -out results1.txt -evalue 1E-10 -outfmt 6

However, the process takes a long time and produce a large output file because it will report huge amount of alignments of the same sequence against itself  ("Seq1 Seq1"). What we are really interested in are the alignments from "Seq1 Seq2" or "Seq1 Seq3" etc .When I run blastn -blastn on a 20MB file, I estimated that it will take 16 hours and produce several GB of output. When I realized the long running time required and large output file produced, I immediately stopped the analysis. When I googled this problem, I found very few resources available, namely one discussion on Biostar, one Python script and an old software NBLAST which was published in 2002. 

Here is the step-by-step guide on how I do all-against-all blast more efficiently:

1) Reduce the time by spliting the sequence file into several files and run blast simultaneously on these files (as suggested in Biostar discussion). 

Using AWK, I can split the files by first splitting the file using ">" and specifying the record numbers for each smaller files. This is an example for a file which contains 66902 sequences. This script removes the first record which is empty. It gives 16724 sequences to the first three files and 16730 sequences to the last file. 

awk 'BEGIN{RS=">"}NR>=1&&NR<=16725{print ">"$0}' file1.fa > file1.fa1
awk 'BEGIN{RS=">"}NR>=16726&&NR<=33449{print ">"$0}' file1.fa > file1.fa2
awk 'BEGIN{RS=">"}NR>=33450&&NR<=50173{print ">"$0}' file1.fa  > file1.fa3
awk 'BEGIN{RS=">"}NR>=50174{print ">"$0}' file.fa > file1.fa4

2) To remove the alignments resulted from the same sequences. Instead of specifying the output file in the blast command, I pipe the output to AWK which will remove the alignments that has the same subject and query sequence. I also performed additional filtering for alignments with at least 40% identity and at least 300 bp of alignment length . 

blastn -task megablast -db database1 -query file1.fa1 -evalue 1E-10 -outfmt 6 | awk '$1!=$2 && $3>=40 && $4>=300' > file1_1.blast

Note: If you're dealing with sequences which contain several alternative spliceforms, you might consider using the following command. This will remove the alignments between contig001.1 and contig001.2.

blastn -task megablast -db database1 -query file1.fa1 -evalue 1E-10 -outfmt 6 | awk '{split($1,a,"."); split($1,b,"."); if (a[1]!=b[1] && $3>=40 && $4>=300) print }' > file1_1.blast

3) Lastly, we need to remove the redundant alignment from any two sequences. For seq1 and seq2, you will find two highly identical alignments, namely "Seq1 Seq2" and "Seq2 Seq1". However, these two alignments can differ slightly in alignment length etc. Therefore, the best way to remove them is by using total score (column 12).

awk '{c=$1"\t"$2"\t"$12 ; b= $2"\t"$1"\t"$12; if (c in a  == 0 && b in a == 0) a[$1"\t"$2"\t"$12]=$0}END{for (i in a) print a[i]}' file1_1.blast > file1_1fil.blast

Related post:

Read more...

Thursday, October 3, 2013

Applying 80/20 rule in bioinformatics analysis



I have always been a firm believer in quality before speed. Ever since I read about the 80/20 rule last month, I start questioning how productive are the things I do. It allows me to reflect on my strengths and weaknesses. Just two weeks ago, I recorded my personal fastest time to complete an analysis from scatch. It took me less than a day to assemble some 454 sequences I downloaded from NCBI SRA! I spent half a day for installation and half an hour to assemble (well, it was a small dataset). I thought I would need more than a week to trim, filter reads, install, try different software and try different parameters etc. Instead, I settled for "good enough" assembly and save the time for more important tasks. 

What is 80/20 rule? It is also known as Pareto principle which is developed by an Italian economist, Vilfredo Pareto. In 1906, he observed that 80% of the wealth was owned by 20% of the population. This concept has been widely applied in business. Although the real ratio always derive from 80/20, the rule generally implies that most results come from a small amount of efforts while a large proportion of efforts give very little impact or result. By investing our efforts on the 20% of work that is really important and reducing efforts on 80% work that is least important, we can increase our output tremendously. 

Identifying 80/20 rule in bioinformatics analysis. First, I need to breakdown my workflow into steps and identify which area is consuming more time and how to improve my efficiency. The flowchart below shows my typical bioinformatics analysis which involves four basic steps :1) Identification of appropriate bioinformatics tools; 2) Reading manuals and publications; 3) Installation; 4) Running analysis.



Running analysis is the 20% work that will give me 80% results. The most exhausting stage is choosing the appropriate tools, reading publications and manual, and installation. So naturally, I would want to reduce the time I spent in these areas. But how?

  • Do I really need to install the software locally? Is any web-server available? Are there any  written scripts by good samaritians available online? Big dataset?
  • Choosing the appropriate software. Mostly I have several software options. I will start with the most popular, well-supported and well documented software. Is the software up-to-date and the last release date is recent? If no, then there is a good chance that I will not be able to install it in my system.  
  • Reading manual. Some manuals can be 200 pages long while some can be as short as a README text file. By skimming through the whole manual and reading only what I need, I can save a lot of time. I always read the part on introduction, installation and the analysis I need. 
  • Installation. I always make sure I installed the required packages and dependancies before installing the software. Some software developers will not tell you this in the manual, and therefore, be sure to google it online.
  • The software may not be working properly due to various reasons such as incomplete installation and wrong input files. To identify the problems, I usually test run a small dataset. My latest practise is to save the print screens into a log file to look for errors.
  • It's time to call for help if the step above still can't solve the problem. The fastest way is of course by goggling the problem online. I realized that I can find a solution faster if I know the specific problem, error message and right syntax to look for. Posting at a forum or writing to the developer is the last resort as it may take a few days before I get a reply. 
  • Do something productive while waiting for the analysis to complete. I'm drafting this blog as I'm running Genscan locally. Estimating the running time is very helpful in managing my time but I often don't know how long it will take. I have a little Unix trick that will play a sound when the command finished running. In that way, I get notified when the analysis complete. Whatever you plan to do during the waiting time, be prepared to be interrupted once the analysis has completed.
Free feel to post any suggestion in the comment box.


Read more...

Sunday, June 9, 2013

Ten botanical gardens in Europe




Recently I spent six weeks in Europe as my graduation vacation. It was a great time to be in Europe because it was spring although it arrived late this year. It's interesting to know that there is almost one botanical garden in every major city. When the weather is sunny and warm, locals and tourists flock to parks and gardens to have a picnic and relax. I enjoy going to these botanical gardens because I like to learn more about plants specially temperate plants. A visit to the botanical garden always make me feel inspired, connected to the nature and started dreaming having my own garden. Unfortunately, many of these botanical gardens require an entrance fee. I only visited a few due to time and cost constraint.

According to Wikipedia,  botanical garden is a well-tended area displaying a wide range of plants labelled with their botanical names. Botanical garden first started as a collection of spices, plants of medicinal and economic importance. There were many small garden in monasteries to provide food for monks. It was in one of these gardens that the fundamentals of genetics, the Mendel's law was created. These gardens continued to be maintained over the years and the few left were turned into the botanical gardens of universities. A detailed history botanical garden can be found in this link. Besides botany, botanical gardens play a big role in plant conservation, education, research and general public interest.

My online search for a comprehensive list of botanical gardens in Europe did not yield expected results. There is a list of botanical gardens in European Botanical Garden Consortium but it is not complete. Most lists and reviews are posted by traveling websites (see the following). Therefore, these articles are written according to the interest of tourists and can limited by the writer's experience.

So where are the list and reviews for plant enthusiasts like me? I guess I have to come up with my own list (sorted according to the order of my visit). Scroll down to see reviews and photos.

1. Kew Garden (London, UK)
2. Jardin Des Plantes (Paris, France)
3. Keukenhof garden (The Netherlands)
4. Botanical garden, University of Utrecht (Utrecht, The Netherlands)
5. Hortus Botanicus, University of Leiden (Leiden, The Netherlands)
6. Botanical garden, University of Potsdam (near Berlin, Germany)
7. Palm Garden of Hofburg palace (Vienna, Austria)
8. Schonbrunn palace (Vienna, Austria)
9. Botanical Garden, University of Vienna (Vienna, Austria)
10. Garden of Pitti Palace (Florence, Italy)

1. Kew Garden
Location: London, England
Admission: GBP 16
Review:
Famous for the world's oldest and largest herbarium. One can easily spend the whole day here exploring the conservatory, green houses and many other attractions in the garden. This place is more than just flowers and trees. Birds, lake, pond, arts, English countryside, river, plant market, you name it! When I was there in mid April, the cherry blossom, glory of snow, magnolia, daffodil and narcissus are in full bloom. I was delighted to see a few Eucalpytus and redwoods such as Sequoia giganteum. The weather was warm and sunny in the morning before turning to gloomy and cold weather in the late afternoon. The admission fee is pricey considering London is famous for free entrance into museums. Lucky for me, a friend doing a PhD there took me in for free! 

 Clockwise from top left: Main entrance and souvenir shop, Princess of Wales Conservatory, research building and herbarium.

 Clockwise from top left: Pink cherry blossom, blue flowers known as Glory Of Snow, an old and lazy tree which grows sideway, spring flowers in the greenhouse, Bonsai apple tree.

Clockwise from top left: Chinese pagoda, English cottage, tree platform, Japanese garden. 

2. Jardin Des Plantes
Location : Paris, France
Admission: Free. The greenhouse require an entrance free.
Review:
On a sunny day, Jardin Des Plantes is a less crowded park compared to the parks near Eiffel tower. I was delighted to know that the botanical garden is open free to all. A fee is required to enter the greenhouse. There is no way I'm paying money to see banana and palm trees! The botanical garden is well-maintained and organized. The plants are sorted into different sections according to their families. All the plants are well-labeled and descriptions are sometimes provided, however, in French. I spent about two hours there. Before leaving, I sat down on the bench to wipe my shoes because the walkway was sandy and dusty. A highly recommended place.

Clockwise from top left: Garden overview, sandy walkway and green house.


Clockwise from top left: Cherry blossom, poppy flowers and Medicago sativa

3. Keukenhof Garden
Location: Lisse, The Netherlands.
Admission: EUR 22.50 (Entrance + bus ride from Leiden trainstation)
Review:
Although Keukenhof is not a botanical garden, it is a place not to be missed in spring if you ever go to The Netherlands. The overwhelming recommendations about this place has made me skeptical at first. Honestly, this place is not overrated. It is the MOST BEAUTIFUL & COLOURFUL garden I have ever been! There is no other places like this on earth. I'll let the photos do all the talking. 

Besides tulips, there are colourful arrangements of hyacinth, daffodil and narcissus and some other flowers. I also visited the two exhibitions on arts, orchids, kalanchoe and hippeastrum. I spent about two hours there (which is enough if you're not taking photos crazily). Do bring company because you wanna take many nice photos of yourself. If you have more time, you can rent a bike and explore the tulip farms around the garden. 

 Tulips in Keukenhof garden.

Keukenhof garden, pond and exhibitions.

4. Botanical garden by The University of Utrecht
Location: Utrecht, The Netherlands
Admission: Free admission
Review: A small garden maintained by the university. It's worth having a look if you ever go to Utrecht.

 Botanical garden by The University of Utrecht 

5. Hortus Botanicus, University of Leiden
Location: Leiden, The Netherlands
Admission: EUR 7
Review:
The oldest botanical garden in The Netherlands. I didn't get the chance to go inside because I had very little time to spend in Leiden. I can see a garden and a green house from the main entrance. There are several types of tulips at the main entrance as the theme was tulip at the time of my visit. 



6. Botanical garden, University of Potsdam
Location: Potsdam, Germany
Admission: Unknown. According to one website, it's EUR 2 which is too good to be true.
Review:
Potsdam is an UNESCO heritage site that you can't miss when you go to Berlin. The botanical garden is located within the beautiful Park Sanssouci. I think tourists need to spend two full days to explore all of what Park Sanssouci has to offer. I didn't visit this botanical garden due to time constraint, however, I passed it several times. 

Botanical garden, University of Potsdam (Image source: Wikipedia)

7. Palm Garden at Hofburg palace
Location: Vienna, Austria
Admission: Unknown
Review -
It was a rainy morning when I went to Hofburg palace. I was surprised to find a palm house in the city centre. The palm house is located right next to the butterfly house. It looks like the Austrian royal families love palm houses! I didn't go in because I was never a big fan of green house and palms (I see them every day). 

Palm Garden at Hofburg palace

8. Gardens at Schonbrunn palace
Location: Vienna, Austria
Admission: Depends
There are many gardens in Schonbrunn palace such as the botanical garden, palm house, Crown Prince garden (Admission 3), Maze & Labyrinth (Admission EUR 4.5). However, a fee is charged to go in. I think all the gardens were blocked from the visitors' views by tall trees and scrubs.  According to the official site, the botanical garden was turned into English style garden. Since there are many beautiful sights around the palace such as the fountain and buildings, I wouldn't recommend spending extra to get into these garden if you only spend one day at the palace.

 Schonbrunn palace garden
Other sights in Schonbrunn palace garden

9. Botanical Garden, University of Vienna
Location: Vienna, Austria
Admission: Unknown
Review:
This botanical garden is located right next to Belvedere palace. It was constructed in 1754 under the order of Empress Maria Theresa. The entrance can be hard to spot as it was a bit hidden. From the main entrance, I can see two paths leading into the botanical garden. A few joggers were spotted. The plants near the ticket counter looked healthy and well-labeled. According to Wikipedia and Gardenvisit, the botanical garden looks quite big and consists of several greenhouses but not open to public. Unfortunately, I didn't have time to visit this place because I had to catch a train to Florence.

Top: palace garden view from Upper Belvedere; Bottom: Botanical garden, University of Vienna

10. Boboli Garden at Pitti Palace
Location: Florence, Italy
Admission: EUR 13
Review:
This garden which is located in Florence used to belong to the rich and powerful Medici family. The Italian garden seems to consist fountains and many sculptures. I caught a few glimpses of it from the Pitti Palace window. The ticket price is the same price as the palace and the combo ticket is not much cheaper. 


Views of the garden from Pitti Palace


Read more...

Wednesday, February 13, 2013

Concerns on the graduate student dropouts in Malaysia




This topic has been lingering in my mind for awhile now. For the past few months, I talked to several students regarding their choices to drop out of their graduate programs. My 4-year PhD experience tells me that at least half of the graduate students in life sciences never complete a degree. I tried to make a list of students I know who dropped out of graduate school. I stopped counting at ten because the list just gets longer and longer. Majority of them quitted during the first year of involvement. I would use the word “involvement” rather than referring to a formal graduate program because some students stopped before starting a research program. These can be students who are preparing research proposals or working as a research assistant for 6-12 months with the intention of doing a postgraduate study.


Let’s look at some online facts
  • The dropout rate for postgraduate students are generally around 50% and lower in science subjects
  • According to Chronicle of Higher Education (2004), 37% of students who begin PhD never obtain the degree.
  • Women drop out at a higher rate than men (Chronicle of Higher Education, 2004).
  • Around 25-29% of PhD students in life sciences dropped out according to Elearners.com.
  • No comprehensive national statistics is available in Malaysia. In 2011, New Straits Times reported that 3 out of 10 part-time PhD students never complete due to work and family problems.
The number of years before you think you can graduate

Why students hesitate?

1. Timely completion. I think slow progress or not getting the expected results is the main reason why many students gave up. They might think that a specific task is too daunting or impossible to complete. They do not have confidence in overcoming the problems or getting help from others. In addition, some students hold unrealistic expectation to complete their degrees within a short period of time, say three years sharp for a PhD. They might be afraid of financial or visa problems if they do not complete within time. This eventually gives rise to other problems because more importance is placed on speed rather than quality of the research.

2. Commitment and interest. Most students want to join a postgraduate program because they are interested in science and research but that interest can change over time. Some might want to improve the chances of finding a job, some dream of getting a higher degree and some hope to work in the academics. They might find that doing a postgraduate program is no longer suitable. They did not obtain the results or  satisfaction they expected. They might join a program without fully understanding it and leave when they fully understood it.  

3. Money. Money is the problem in some cases. I think that stipend and debt such as study loan pose very little problem because postgraduate students in Malaysia usually receive a monthly stipend more than sufficient to cover their living expenses and tuition fees. A lot of students worries over grant and research funding. Is it going to run out soon? Will I be able to buy the kits and consumable needed? Will the grant able to support me until I graduate?  I think it is supervisor’s responsibilities to obtain grants and ensures the students that they will have sufficient materials, resources and financial support to complete a degree. When the students think that  there is nothing they can do, they choose to leave.

4. Interaction with supervisor. The quality of relationship between student and supervisor is often cited as the heart of the problem. It will take several years to foster good relationship and trust between the two parties. This relationship is usually weak during the first year when most dropouts occurred. One common problem I spotted is students are often too timid to discuss problems with their supervisors and worry too much of what their supervisors might think.

5. Pressure to publish. Nowadays students faced increasing pressure to publish their researches because publishing has been made compulsory to graduate. In UKM, Master and PhD students are required to publish one and two papers respectively in scientific journals. Some of their worries are related to timely completion, getting enough data for a publication and fear of being rejected by a journal. I strongly support this policy because most students who graduated without publishing never publish. In addition, I think publication will give the students greater satisfaction and motivation to thrive in their researches.

6. Family.  Family commitment, getting married and starting new family are some reasons students dropped out of graduate school. Some students especially young ones find it difficult to cope with family and studies simultaneously. Few might need to sacrifice their studies for their spouses. Only the determined students who received great family support can successfully completed their studies. Based on my personal observation, students with family commitment who are sponsored by an institution or company to pursue a higher degree are less likely to drop out but most likely to extend the semesters.


My advices for the hesitating
  •  Identify what makes you hesitate
  • Think about why you want to get a Master or PhD degree. Do you still want it now?
  • Seek advices from your supervisor early on. Once you have identified the problems you faced, I strongly recommend that you let your supervisor know so he/she can help you.  
  • Seek advices from other graduate students. They are often more than willing to share their personal experiences, give technical advices and tips on how to improve communication with your supervisor and others.
  • Take a holiday if you find that work is stressing you out a lot. It’s time to take a break if you keep repeating a certain experiment and unable to get any result. It is important to have a well-balanced work and life.
  •  Get emotional support from friends and family. Let them understand that doing a postgraduate study can be a long and stressful process.

Possible future solutions

Some solutions to reduce the high dropout rates in PhD students have been proposed in the articles I read online. First, there is a need for proper statistic on the actual number of PhD dropouts. A problem well-defined is a problem half solved. Another suggestion is to choose only the best student for PhD programs. The percentage of students completing a PhD increases to 70-75% after vigorous selection of the scholarship committee. In China, students are required to take an entrance exam to enroll into postgraduate programs from the Chinese Academy of Sciences. This is similar to Graduate Record Examinations (GRE) which is compulsory to pass for all international students before joining universities in Singapore and United States. One author suggested that sufficient training in time management and manuscript/thesis writing will curb the problem. One of my suggestions is to strengthen bonds and relationships between postgraduate students through activities organized by the graduate student association. In order to prevent dropouts under the Graduate Research Assistant (GRA) scheme, perhaps there is a need for the students to sign a contract which requires compensation if the student did not complete the study.

The scenario of life sciences research in Malaysia is changing rapidly over the past few years. Nowadays principal investigators are more concerned about meeting milestone deadlines, pressure in publishing and competition in grant writing. A few years ago, it is very rare to hear that a postgraduate student was fired due to poor performance but I expect things to change pretty soon. 



Read more...

  © Free Blogger Templates Spain by Ourblogtemplates.com 2008

Back to TOP