high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

Haicheng Wu, Gregory Diamos, Ashwin Lele, Jin Wang, Srihari Cadambi, Sudhakar Yalamanchili, Srimat Chakradhar

School of ECE, Georgia Institute of Technology, Atlanta, GA

Workshop on Multicore and GPU Programming Models, Languages and Compilers, 2012

BibTeX

Download (PDF)

View

Source

1738

views

Data warehousing applications represent an emergent application arena that requires the processing of relational queries and computations over massive amounts of data. Modern general purpose GPUs are high core count architectures that potentially offer substantial improvements in throughput for these applications. However, there are significant challenges that arise due to the overheads of data movement through the memory hierarchy and between the GPU and host CPU. This paper proposes a set of compiler optimizations to address these challenges. Inspired in part by loop fusion/fission optimizations in the scientific computing community, we propose kernel fusion and kernel fission. Kernel fusion fuses the code bodies of two GPU kernels to i) eliminate redundant operations across dependent kernels, ii) reduce data movement between GPU registers and GPU memory, iii) reduce data movement between GPU memory and CPU memory, and iv) improve spatial and temporal locality of memory references. Kernel fission partitions a kernel into segments such that segment computations and data transfers between the GPU and host CPU can be overlapped. Fusion and fission can also be applied concurrently to a set of kernels. We empirically evaluate the benefits of fusion/fission on relational algebra operators drawn from the TPC-H benchmark suite. All kernels are implemented in CUDA and the experiments are performed with NVIDIA Fermi GPUs. In general, we observed data throughput improvements ranging from 13.1% to 41.4% for the SELECT operator and queries Q1 and Q21 in the TPC-H benchmark suite. We present key insights, lessons learned, and opportunities for further improvements.

Tags: Compilers, Computer science, CUDA, Databases, nVidia, Optimization, Tesla C2070

August 9, 2012 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission

Share this:

Recent source codes

Most viewed papers (last 30 days)