high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Kernel Specialization for Improved Adaptability and Performance on Graphics Processing Units (GPUs)

Kernel Specialization for Improved Adaptability and Performance on Graphics Processing Units (GPUs)

Nicholas John Moore

The Department of Electrical and Computer Engineering, Northeastern University, Boston, Massachusetts

Northeastern University, 2012

BibTeX

Download (PDF)

View

Source

2183

views

Graphics processing units (GPUs) offer significant speedups over CPUs for certain classes of applications. However, maximizing GPU performance can be a difficult task due to the relatively high programming complexity as well as frequent hardware changes. Important performance optimizations are applied by the GPU compiler ahead of time and require fixed parameter values at compile time. As a result, many GPU codes offer minimum levels of adaptability to variations among problem instances and hardware configurations. These factors limit code reuse and the applicability of GPU computing to a wider variety of problems. This dissertation introduces GPGPU kernel specialization, a technique that can be used to describe highly adaptable kernels that work across different generations of GPUs with high performance. With kernel specialization, customized GPU kernels incorporating both problem- and implementation-specific parameters are compiled for each problem and hardware instance combination. This dissertation explores the implementation and parameterization of three real world applications targeting two generations of NVIDIA CUDA-enabled GPUs and utilizing kernel specialization: large template matching, particle image velocimetry, and cone-beam image reconstruction via backprojection. Starting with high performance adaptable GPU kernels that compare favorably to multi-threaded and FPGA-based reference implementations, kernel specialization is shown to maintain adaptability while providing performance improvements in terms of speedups and reduction in per-thread register usage. The proposed technique offers productivity benefits, the ability to adjust parameters that otherwise must be static, and a means to increase the complexity and parameterizability of GPGPU implementations beyond what would otherwise be feasible on current GPU hardware.

Tags: Computer science, CUDA, FPGA, Image reconstruction, nVidia, Optimization, Tesis, Tesla C1060, Tesla C2070

October 26, 2012 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Kernel Specialization for Improved Adaptability and Performance on Graphics Processing Units (GPUs)

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Kernel Specialization for Improved Adaptability and Performance on Graphics Processing Units (GPUs)

Share this:

Recent source codes

Most viewed papers (last 30 days)