De Nove Genome Sequence Assembly
An organism's genome consists of base pairs (bp) from two strands of complementary bases. Reading a sequence of these bases or base pairs is termed as genome sequencing. This process is central to the study of genomes for bioinformaticians. No current sequencing technology is capable of reading the code of life in its entirety in one go. Instead, these machines output a large number of genome fragments called reads. Typically, the number of reads is in the millions or even billions. Sequence assembly refers to arranging and merging the reads into longer contiguous subsequences (contigs) with the goal of reconstructing the original sequence. In de novo sequence assembly, no reference sequence is used to aid the reconstruction process.
Recent Next Generation Sequencing (NGS) technologies produce a very large number of reads in a short amount of time. They have reduced the experimental cost per base significantly with their high throughput. This way they have opened up opportunities to study organisms at the genome level, promising a deeper understanding of genome regulation and biological mechanisms. A thorough study can assist in designing more effective drugs to cure diseases. Moreover, with NGS technologies researchers can study the evolution of viruses and bacteria at an unprecedented pace, for example during a recent E. coli outbreak in Europe. Such studies can for example help to accelerate vaccine development.
The large data produced by sequencing machines requires an efficient assembly process in terms of running time and memory consumption. Our assembler PASQUAL, short for PArallel SeQUence AssembLer, is designed for shared memory parallelism, using OpenMP due to its good tradeoff between performance and programmer productivity. Shared memory parallelism has become mainstream with the widespread production of multicore commodity processors. For PASQUAL we follow the OLC approach and use a careful combination of tailored algorithms and data structures to obtain high-quality solutions.
Our experimental results show that, given enough CPU cores, our multi-threaded PASQUAL implementation is faster than any other tool we could run on our test platforms. PASQUAL is capable of handling data with billions of bases, thus enabling biologists to assemble larger data sets in less time. Unlike SOAPdenovo, which is the only tool with comparable (or in several experiments with smaller CPU core numbers even better) speed, PASQUAL is not restricted to k-mer (or overlap) lengths smaller than 128—and PASQUAL produces significantly fewer assembly errors.