high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Optimization Techniques for Mapping Algorithms and Applications onto CUDA GPU Platforms and CPU-GPU Heterogeneous Platforms

Optimization Techniques for Mapping Algorithms and Applications onto CUDA GPU Platforms and CPU-GPU Heterogeneous Platforms

Jing Wu

University of Maryland

University of Maryland, 2014

DOI:10.13016/M22C83

@article{wu2014optimization,

title={Optimization Techniques for Mapping Algorithms and Applications onto CUDA GPU Platforms and CPU-GPU Heterogeneous Platforms},

author={Wu, Jing},

year={2014}

}

Download (PDF)

View

Source

2865

views

An emerging trend in processor architecture seems to indicate the doubling of the number of cores per chip every two years with same or decreased clock speed. Of particular interest to this thesis is the class of many-core processors, which are becoming more attractive due to their high performance, low cost, and low power consumption. The main goal of this dissertation is to develop optimization techniques for mapping algorithms and applications onto CUDA GPUs and CPU-GPU heterogeneous platforms. The Fast Fourier transform (FFT) constitutes a fundamental tool in computational science and engineering, and hence a GPU-optimized implementation is of paramount importance. We first study the mapping of the 3D FFT onto the recent, CUDA GPUs and develop a new approach that minimizes the number of global memory accesses and overlaps the computations along the different dimensions. We obtain some of the fastest known implementations for the computation of multi-dimensional FFT. We then present a highly multithreaded FFT-based direct Poisson solver that is optimized for the recent NVIDIA GPUs. In addition to the massive multithreading, our algorithm carefully manages the multiple layers of the memory hierarchy so that all global memory accesses are coalesced into 128-bytes device memory transactions. As a result, we have achieved up to 375GFLOPS with a bandwidth of 120GB/s on the GTX 480. We further extend our methodology to deal with CPU-GPU based heterogeneous platforms for the case when the input is too large to fit on the GPU global memory. We develop optimization techniques for memory-bound, and computation-bound application. The main challenge here is to minimize data transfer between the CPU memory and the device memory and to overlap as much as possible these transfers with kernel execution. For memory-bounded applications, we achieve a near-peak effective PCIe bus bandwidth, 9-10GB/s and performance as high as 145 GFLOPS for multi-dimensional FFT computations and for solving the Poisson equation. We extend our CPU-GPU based software pipeline to a computation-bound application-DGEMM, and achieve the illusion of a memory of the CPU memory size and a computation throughput similar to a pure GPU.

Tags: Computer science, CUDA, FFT, Heterogeneous systems, Laplace and Poisson equation, nVidia, nVidia GeForce GTX 280, nVidia GeForce GTX 480, Poisson equation, Tesla C1060, Tesla C2050, Thesis

October 22, 2014 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Optimization Techniques for Mapping Algorithms and Applications onto CUDA GPU Platforms and CPU-GPU Heterogeneous Platforms

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Optimization Techniques for Mapping Algorithms and Applications onto CUDA GPU Platforms and CPU-GPU Heterogeneous Platforms

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)