Parallelizing Big De Bruijn Graph Construction on Heterogeneous Processors
Shuang Qiu and Qiong Luo
The Hong Kong University of Science and Technology, The Hong Kong University of Science and Technology

De bruijn graph construction is the first step in de novo assemblers to connect input reads into a complete sequence without a reference genome. This step is both time and memory space consuming. To address this problem, we develop ParaHash, a system that partitions the input data in a compact format, parallelizes the computation on both the CPUs and the GPUs in a single computer, and performs hash-based de bruijin graph construction. This way, ParaHash utilizes all available processors to assemble big genomes that cannot fit into memory. Furthermore, we analyze the characteristics of genome data to set the hash table size, design concurrent hashing algorithms to handle the inherent multiplicity, and pipeline the data transfer and the computation for further efficiency. Our experiments on real-world genome datasets show that the workload was balanced across heterogeneous processors, and that ParaHash was able to construct billion-node graphs on a single machine with an overall performance up to 20 times faster than the state-of-theart shared-memory assemblers.