GPU-accelerated protein family identification for metagenomics
Xerox Innovation Group, Xerox Research Center, Webster, NY, USA
IEEE 27th International Symposium on Parallel & Distributed Processing Workshops and PhD Forum, 2013
@article{wu2013gpu,
title={GPU-accelerated protein family identification for metagenomics},
author={Wu, Changjun and Kalyanaraman, Ananth},
year={2013}
}
The clustering of putative protein/Open Reading Frame (ORF) sequences available from large-scale metagenomics survey projects is a core analytical function that has led to the identification and characterization of novel protein families of environmental microbial communities. The implementation of this function, however, is currently challenged not only by data size but also by data complexity. In this paper, we present a CPU-GPU implementation of a randomized graph clustering heuristic called Shingling, which was originally developed by Gibson et al. Our implementation uses the CPU and GPU for different stages of computation, using GPUs for the most time-consuming steps. Experimental results of a 2M ocean metagenomics data set obtained from the Sorcerer II Global Ocean Sampling project show that our new implementation is able to achieve a ~7X speedup over our serial implementation without using asynchronous CPUGPU communication, with the GPU part alone contributing to over ~374X speedup in the accelerated part. Qualitative evaluation of the 2M data set shows that our method is able to improve sensitivity of clustering over existing methods, and is more successful in recruiting more sequences into the clustering without impacting the overall specificity. As a demonstration of a large scale run, we were able to cluster a real world homology graph, containing 11M vertices and 640M edges, and constructed from sequences of an ongoing Pacific Ocean metagenomics survey project, in about 94 minutes.
May 25, 2013 by hgpu