high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Architectural Considerations for Compiler-guided Unroll-and-Jam of CUDA Kernels

Architectural Considerations for Compiler-guided Unroll-and-Jam of CUDA Kernels

Apan Qasem

Department of Computer Science, Texas State University, San Marcos, Texas, USA

American Journal of Computer Architecture, 1(2), 12-20, 2012

DOI:10.5923/j.ajca.20120102.01

BibTeX

Download (PDF)

View

Source

2174

views

Hundreds of cores per chip and support for fine-grain multithreading have made GPUs a central player in todays HPC world. Much of the responsibility of achieving high performance on these complex systems lies with software like the compiler. This paper describes a compiler-based strategy for automatic and profitable application of the unroll-and-jam transformation to CUDA kernels. The framework supports specification of unroll factors through source-code annotation and also implements a heuristic based on register pressure and occupancy that recommends unroll factors for improved memory performance. We present experimental results on a GE 9800 GT on four CUDA kernels. The results show that the proposed strategy is generally able to select profitable unroll factors. The results also indicate that the selected unroll amounts strike the right balance between register pressure and occupancy.

Tags: Compilers, Computer science, CUDA, Memory model, nVidia, nVidia GeForce 9800 GT, Tesla C2050

November 8, 2012 by hgpu

No votes yet.

Please wait...

high performance computing on graphics processing units: hgpu.org

Architectural Considerations for Compiler-guided Unroll-and-Jam of CUDA Kernels

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Architectural Considerations for Compiler-guided Unroll-and-Jam of CUDA Kernels

Share this:

Recent source codes

Most viewed papers (last 30 days)