high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Improved Programming of GPU Architectures through Automated Data Allocation and Loop Restructuring

Improved Programming of GPU Architectures through Automated Data Allocation and Loop Restructuring

Andrea Di Biagio, Giovanni Agosta

Dipartimento di Elettronica e Informazione, Politecnico di Milano, Italy

23rd International Conference on Architecture of Computing Systems (ARCS), 2010

@article{biagio2010improved,

title={Improved Programming of GPU Architectures through Automated Data Allocation and Loop Restructuring},

author={Biagio, A.D. and Agosta, G.},

journal={ARCS 2010},

year={2010},

publisher={VDE VERLAG GmbH}

}

Source

1746

views

The programmability of recent graphic processing unit (GPU) architectures has been the main factor driving the dramatic increase in interest for this class of architectures as low-cost accelerators for a wide range of high-performance applications. Current GPU programming models, such as OpenCL and CUDA, still expose too many architectural features, such as the memory hierarchy, to the programmer. We propose to raise the abstraction level of code by mapping some constructs of the well-known OpenMP parallel programmingmodel onto the dominant CUDA GPU programming model. To this end, we are studying solutions for two main issues: the automated allocation of data on the GPU device memory hierarchy, and the translation of OpenMP parallel loops to CUDA kernels. We report some initial experimental results showing that the transformations are indeed promising.

Tags: Computer science, CUDA, nVidia, OpenMP, Programming techniques

June 21, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Improved Programming of GPU Architectures through Automated Data Allocation and Loop Restructuring

Your response

Recent source codes

MATLAB Tensor Core models

TritonForge: Transform PyTorch Operations into Optimized GPU Kernels with LLMs

RLTune: Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

tritonBLAS: A Lightweight Triton-based General Matrix Multiplication (GEMM) Library

hls4ml: Machine learning on FPGAs using HLS

NVIDIA Nemotron Parse 1.1

ThunderKittens: Tile primitives for speedy kernels

Iris: AMD RAD's multi-GPU Triton-based framework for seamless multi-GPU programming

HipKittens: Fast and Furious AMD Kernels

Fortran xDSL dialects

Most viewed papers (last 30 days)

Improved Programming of GPU Architectures through Automated Data Allocation and Loop Restructuring

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)