Runtime Support toward Transparent Memory Access in GPU-accelerated Heterogeneous Systems
North Carolina State University
North Carolina State University, 2013
@article{ji2013runtime,
title={Runtime Support Toward Transparent Memory Access in GPU-accelerated Heterogeneous Systems.},
author={Ji, Feng},
year={2013}
}
GPU has become a popular parallel accelerator in modern heterogeneous systems for its great parallelism and superior energy efficiency. However, it also extremely complicates programing the memory system in such heterogeneous systems, due to the non-continuous memory spaces on CPU and GPU, and a two-level memory hierarchy on a GPU itself. The complexity of this memory system is now fully exposed to programmers, who must manually move data to each position in the memory for correctness, and must consider data layout and locality to adapt to data access patterns in multi-threaded parallel codes for performance. In this Ph.D. thesis study, we approach the above problem through providing runtime system support that aims at both easier programming and better performance. Specifically, we show two such runtime system software approaches, either within the scope of a programming model or as a general system software. With the first approach, we focus on two popular programming models, MapReduce and MPI. For GPU-based MapReduce, we provide a transparent GPU memory hierarchy for MapReduce developers and realize performance improvement through buffering data in GPU’s shared memory, a small on-chip scratch-pad memory. On a system powered by an Nvidia GTX 280 GPU, our MapReduce outperforms a previous shared-memory-oblivious MapReduce, with a prominent Map phase speedup of 2.67x on average. For MPI, we extend its interface to enable the usage of GPU memory buffers directly in communications, and optimize such GPU-involved MPI intra-node communication, through pipelining CPU-GPU data movement with inter-process communication, and GPU DMA-assisted data movement. Comparing to manually mixing GPU data movement with MPI communication, on a multi-core system equipped with three Nvidia Tesla Fermi GPUs, we show up to 2x bandwidth speedup through pipelining and an average 4.3% improvement to the total execution time of a halo exchange benchmark; our DMA-assisted intranode data communication further shows up to 1.4x bandwidth speedup between near GPUs, and a further 4.7% improvement on the benchmark. With the second approach, we present the design of Region-based Software Virtual Memory (RSVM), a software virtual memory running on both CPU and GPU in an asynchronous and cooperative way. In addition to automatic GPU memory management and GPU-CPU data transfer, RSVM offers two novel features: 1) GPU kernel-issued on-demand data fetching from the host into the GPU memory, and 2) intra-kernel transparent GPU memory swapping into the main memory. Our study reveals important insights on the challenges and opportunities of building unified virtual memory systems for heterogeneous computing. Experimental results on real GPU benchmarks demonstrate that, though it incurs a small overhead, RSVM can transparently scale GPU kernels to large problem sizes exceeding the device memory size limit. It allows developers to write the same code for different problem sizes and further to optimize on data layout definition accordingly. Our evaluation also identifies missing GPU architecture features for better system software efficiency.
December 11, 2013 by hgpu