high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » CUDA » Scalable Kernel Fusion for Memory-Bound GPU Applications

Scalable Kernel Fusion for Memory-Bound GPU Applications

Mohamed Wahib and Naoya Maruyama

RIKEN Advanced Institute for Computational Science, Kobe, Japan

Proceedings of the 2014 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14), November 16-21, 2014, New Orleans

@{,

}

Download (PDF)

View

Source

3399

views

GPU implementations of HPC applications relying on finite difference methods can include tens of kernels that are memory-bound. Kernel fusion can improve performance by reducing data traffic to off-chip memory; kernels that share data arrays are fused to larger kernels where on-chip cache is used to hold the data reused by instructions originating from different kernels. The main challenges are a) searching for the optimal kernel fusions while constrained by data dependencies and kernels’ precedences and b) effectively applying kernel fusion to achieve speedup. This paper introduces a problem definition and proposes a scalable method for searching the space of possible kernel fusions to identify optimal kernel fusions for large problems. The paper also proposes a codeless performance upper-bound projection model to achieve effective fusions. Results show that using the proposed scalable method for kernel fusion improved the performance of two real-world applications containing tens of kernels by 1.35x and 1.2x.

Tags: CUDA, Memory level parallelism, nVidia GeForce GTX 750 Ti, Tesla K20, Tesla K40

September 1, 2014 by wahibium

Rating: 2.5/5. From 1 vote.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Scalable Kernel Fusion for Memory-Bound GPU Applications

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Scalable Kernel Fusion for Memory-Bound GPU Applications

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)