high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Implementing a GPU Programming Model on a non-GPU Accelerator Architecture

Implementing a GPU Programming Model on a non-GPU Accelerator Architecture

Stephen M. Kofsky, Daniel R. Johnson, John A. Stratton, Wen-Mei W. Hwu, Sanjay J. Patel, Steven S. Lumetta

University of Illinois at Urbana-Champaign, Urbana IL 61801, USA

1st Workshop on Applications for Multi and Many Core Processors, Saint Malo : France (2010), A4MMC 2010, inria-00493905

@article{kofsky2010implementing,

title={Implementing a GPU Programming Model on a non-GPU Accelerator Architecture},

author={Kofsky, S.M. and Johnson, D.R. and Stratton, J.A. and Hwu, W.M.W. and Patel, S.J. and Lumetta, S.S.},

year={2010}

}

Download (PDF)

View

Source

1790

views

Parallel codes are written primarily for the purpose of performance. It is highly desirable that parallel codes be portable between parallel architectures without significant performance degradation or code rewrites. While performance portability and its limits have been studied thoroughly on single processor systems, this goal has been less extensively studied and is more difficult to achieve for parallel systems. Emerging single-chip parallel platforms are no exception; writing code that obtains good performance across GPUs and other many-core CMPs can be challenging. In this paper, we focus on CUDA codes, noting that programs must obey a number of constraints to achieve high performance on an NVIDIA GPU. Under such constraints, we develop optimizations that improve the performance of CUDA code on a MIMD accelerator architecture that we are developing called Rigel. We demonstrate performance improvements with these optimizations over naive translations, and final performance results comparable to those of codes that were hand-optimized for Rigel.

Tags: Computer science, CUDA, nVidia, Optimization, Programming techniques, Tesla T10

March 19, 2011 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

high performance computing on graphics processing units: hgpu.org

Implementing a GPU Programming Model on a non-GPU Accelerator Architecture

Your response

Recent source codes

Agentic Code Optimization via Compiler-LLM Cooperation

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

True 4-Bit Quantized CNN Training on CPU

cuFuzz: A GPU-oriented coverage-guided fuzzer for userland CUDA application

KernelSkill: A Multi-Agent Framework for GPU Kernel Optimization

Most viewed papers (last 30 days)

Implementing a GPU Programming Model on a non-GPU Accelerator Architecture

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)