high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Runtime Support toward Transparent Memory Access in GPU-accelerated Heterogeneous Systems

Runtime Support toward Transparent Memory Access in GPU-accelerated Heterogeneous Systems

Feng Ji

North Carolina State University

North Carolina State University, 2013

@article{ji2013runtime,

title={Runtime Support Toward Transparent Memory Access in GPU-accelerated Heterogeneous Systems.},

author={Ji, Feng},

year={2013}

}

Download (PDF)

View

Source

1529

views

GPU has become a popular parallel accelerator in modern heterogeneous systems for its great parallelism and superior energy efficiency. However, it also extremely complicates programing the memory system in such heterogeneous systems, due to the non-continuous memory spaces on CPU and GPU, and a two-level memory hierarchy on a GPU itself. The complexity of this memory system is now fully exposed to programmers, who must manually move data to each position in the memory for correctness, and must consider data layout and locality to adapt to data access patterns in multi-threaded parallel codes for performance. In this Ph.D. thesis study, we approach the above problem through providing runtime system support that aims at both easier programming and better performance. Specifically, we show two such runtime system software approaches, either within the scope of a programming model or as a general system software. With the first approach, we focus on two popular programming models, MapReduce and MPI. For GPU-based MapReduce, we provide a transparent GPU memory hierarchy for MapReduce developers and realize performance improvement through buffering data in GPU’s shared memory, a small on-chip scratch-pad memory. On a system powered by an Nvidia GTX 280 GPU, our MapReduce outperforms a previous shared-memory-oblivious MapReduce, with a prominent Map phase speedup of 2.67x on average. For MPI, we extend its interface to enable the usage of GPU memory buffers directly in communications, and optimize such GPU-involved MPI intra-node communication, through pipelining CPU-GPU data movement with inter-process communication, and GPU DMA-assisted data movement. Comparing to manually mixing GPU data movement with MPI communication, on a multi-core system equipped with three Nvidia Tesla Fermi GPUs, we show up to 2x bandwidth speedup through pipelining and an average 4.3% improvement to the total execution time of a halo exchange benchmark; our DMA-assisted intranode data communication further shows up to 1.4x bandwidth speedup between near GPUs, and a further 4.7% improvement on the benchmark. With the second approach, we present the design of Region-based Software Virtual Memory (RSVM), a software virtual memory running on both CPU and GPU in an asynchronous and cooperative way. In addition to automatic GPU memory management and GPU-CPU data transfer, RSVM offers two novel features: 1) GPU kernel-issued on-demand data fetching from the host into the GPU memory, and 2) intra-kernel transparent GPU memory swapping into the main memory. Our study reveals important insights on the challenges and opportunities of building unified virtual memory systems for heterogeneous computing. Experimental results on real GPU benchmarks demonstrate that, though it incurs a small overhead, RSVM can transparently scale GPU kernels to large problem sizes exceeding the device memory size limit. It allows developers to write the same code for different problem sizes and further to optimize on data layout definition accordingly. Our evaluation also identifies missing GPU architecture features for better system software efficiency.

Tags: Computer science, Hadoop, Heterogeneous systems, MapReduce, MPI, nVidia, nVidia GeForce GTX 280, Tesla M2070, Thesis

December 11, 2013 by hgpu

No votes yet.

Please wait...

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Runtime Support toward Transparent Memory Access in GPU-accelerated Heterogeneous Systems

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)

Runtime Support toward Transparent Memory Access in GPU-accelerated Heterogeneous Systems

Share this:

Recent source codes

Most viewed papers (last 30 days)