high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » On the Robust Mapping of Dynamic Programming onto a Graphics Processing Unit

On the Robust Mapping of Dynamic Programming onto a Graphics Processing Unit

Shucai Xiao, A.M. Aji, Wu-chun Feng

Department of Electrical and Computer Engineering, Virginia Tech, Blacksburg, VA, USA

15th International Conference on Parallel and Distributed Systems (ICPADS), 2009

DOI:10.1109/ICPADS.2009.110

BibTeX

Download (PDF)

View

Source

1761

views

Graphics processing units (GPUs) have been widely used to accelerate algorithms that exhibit massive data parallelism or task parallelism. When such parallelism is not inherent in an algorithm, computational scientists resort to simply replicating the algorithm on every multiprocessor of a NVIDIA GPU, for example, to create such parallelism, resulting in embarrassingly parallel ensemble runs that deliver significant aggregate speed-up. However, the fundamental issue with such ensemble runs is that the problem size to achieve this speed-up is limited to the available shared memory and cache of a GPU multiprocessor. An example of the above is dynamic programming (DP), one of the Berkeley 13 dwarfs. All known DP implementations to date use the coarse-grained approach of embarrassingly parallel ensemble runs because a fine-grained parallelization on the GPU would require extensive communication between the multiprocessors of a GPU, which could easily cripple performance as communication between multiprocessors is not natively supported in a GPU. Consequently, we address the above by proposing a fine-grained parallelization of a single instance of the DP algorithm that is mapped to the GPU. Our parallelization incorporates a set of techniques aimed to substantially improve GPU performance: matrix re-alignment, coalesced memory access, tiling, and GPU (rather than CPU) synchronization. The specific DP algorithm that we parallelize is called Smith-Waterman (SWat), which is an optimal local-sequence alignment algorithm. We then use this SWat algorithm as a baseline to compare our GPU implementation, i.e., CUDA-SWat, to our implementation on the cell broadband engine, i.e., Cell-SWat.

Tags: Algorithms, Bioinformatics, Biology, Cell processor, CUDA, Data parallelism, nVidia, nVidia GeForce GTX 280, Sequence alignment, Smith-Waterman algorithm

July 23, 2011 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

On the Robust Mapping of Dynamic Programming onto a Graphics Processing Unit

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

On the Robust Mapping of Dynamic Programming onto a Graphics Processing Unit

Share this:

Recent source codes

Most viewed papers (last 30 days)