high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Unlocking Bandwidth for GPUs in CC-NUMA Systems

Unlocking Bandwidth for GPUs in CC-NUMA Systems

Neha Agarwal, David Nellans, Mike O’Connor, Stephen W. Keckler, Thomas F. Wenisch

University of Michigan

2015 International Symposium on High Performance Computer Architecture (HPCA), 2015

@article{agarwal2015unlocking,

title={Unlocking Bandwidth for GPUs in CC-NUMA Systems},

author={Agarwal, Neha and Nellans, David and O’Connor, Mike and Keckler, Stephen W and Wenisch, Thomas F},

year={2015}

}

Download (PDF)

View

Source

2372

views

Historically, GPU-based HPC applications have had a substantial memory bandwidth advantage over CPU-based workloads due to using GDDR rather than DDR memory. However, past GPUs required a restricted programming model where application data was allocated up front and explicitly copied into GPU memory before launching a GPU kernel by the programmer. Recently, GPUs have eased this requirement and now can employ on-demand software page migration between CPU and GPU memory to obviate explicit copying. In the near future, CCNUMA GPU-CPU systems will appear where software page migration is an optional choice and hardware cache-coherence can also support the GPU accessing CPU memory directly. In this work, we describe the trade-offs and considerations in relying on hardware cache-coherence mechanisms versus using software page migration to optimize the performance of memory-intensive GPU workloads. We show that page migration decisions based on page access frequency alone are a poor solution and that a broader solution using virtual address-based program locality to enable aggressive memory prefetching combined with bandwidth balancing is required to maximize performance. We present a software runtime system requiring minimal hardware support that, on average, outperforms CC-NUMA-based accesses by 1.95x, performs 6% better than the legacy CPU to GPU memcpy regime by intelligently using both CPU and GPU memory bandwidth, and comes within 28% of oracular page placement, all while maintaining the relaxed memory semantics of modern GPUs.

Tags: Computer science, CUDA, GPGPU-sim, nVidia, nVidia GeForce GTX 480, Prefetch

February 6, 2015 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Unlocking Bandwidth for GPUs in CC-NUMA Systems

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

Unlocking Bandwidth for GPUs in CC-NUMA Systems

Share this:

Recent source codes

Most viewed papers (last 30 days)