high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Runtime Support toward Transparent Memory Access in GPU-accelerated Heterogeneous Systems

Runtime Support toward Transparent Memory Access in GPU-accelerated Heterogeneous Systems

Feng Ji

North Carolina State University

North Carolina State University, 2013

@article{ji2013runtime,

title={Runtime Support Toward Transparent Memory Access in GPU-accelerated Heterogeneous Systems.},

author={Ji, Feng},

year={2013}

}

Download (PDF)

View

Source

2044

views

GPU has become a popular parallel accelerator in modern heterogeneous systems for its great parallelism and superior energy efficiency. However, it also extremely complicates programing the memory system in such heterogeneous systems, due to the non-continuous memory spaces on CPU and GPU, and a two-level memory hierarchy on a GPU itself. The complexity of this memory system is now fully exposed to programmers, who must manually move data to each position in the memory for correctness, and must consider data layout and locality to adapt to data access patterns in multi-threaded parallel codes for performance. In this Ph.D. thesis study, we approach the above problem through providing runtime system support that aims at both easier programming and better performance. Specifically, we show two such runtime system software approaches, either within the scope of a programming model or as a general system software. With the first approach, we focus on two popular programming models, MapReduce and MPI. For GPU-based MapReduce, we provide a transparent GPU memory hierarchy for MapReduce developers and realize performance improvement through buffering data in GPU’s shared memory, a small on-chip scratch-pad memory. On a system powered by an Nvidia GTX 280 GPU, our MapReduce outperforms a previous shared-memory-oblivious MapReduce, with a prominent Map phase speedup of 2.67x on average. For MPI, we extend its interface to enable the usage of GPU memory buffers directly in communications, and optimize such GPU-involved MPI intra-node communication, through pipelining CPU-GPU data movement with inter-process communication, and GPU DMA-assisted data movement. Comparing to manually mixing GPU data movement with MPI communication, on a multi-core system equipped with three Nvidia Tesla Fermi GPUs, we show up to 2x bandwidth speedup through pipelining and an average 4.3% improvement to the total execution time of a halo exchange benchmark; our DMA-assisted intranode data communication further shows up to 1.4x bandwidth speedup between near GPUs, and a further 4.7% improvement on the benchmark. With the second approach, we present the design of Region-based Software Virtual Memory (RSVM), a software virtual memory running on both CPU and GPU in an asynchronous and cooperative way. In addition to automatic GPU memory management and GPU-CPU data transfer, RSVM offers two novel features: 1) GPU kernel-issued on-demand data fetching from the host into the GPU memory, and 2) intra-kernel transparent GPU memory swapping into the main memory. Our study reveals important insights on the challenges and opportunities of building unified virtual memory systems for heterogeneous computing. Experimental results on real GPU benchmarks demonstrate that, though it incurs a small overhead, RSVM can transparently scale GPU kernels to large problem sizes exceeding the device memory size limit. It allows developers to write the same code for different problem sizes and further to optimize on data layout definition accordingly. Our evaluation also identifies missing GPU architecture features for better system software efficiency.

Tags: Computer science, Hadoop, Heterogeneous systems, MapReduce, MPI, nVidia, nVidia GeForce GTX 280, Tesla M2070, Thesis

December 11, 2013 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Runtime Support toward Transparent Memory Access in GPU-accelerated Heterogeneous Systems

Your response

Recent source codes

ParaCodex: A Profiling-Guided Autonomous Coding Agent for Reliable Parallel Code Generation and Translation

SeedFold: Scaling Biomolecular Structure Prediction

Tilus: A Tile-Level GPU Kernel Programming Language

Memory-Efficient Acceleration of Block Low-Rank Foundation Models on Resource Constrained GPUs

CUDA-L2: Surpassing cuBLAS Performance for Matrix Multiplication through Reinforcement Learning

BoltzGen:Toward Universal Binder Design

cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution

MATLAB Tensor Core models

TritonForge: Transform PyTorch Operations into Optimized GPU Kernels with LLMs

RLTune: Hybrid Learning and Optimization-Based Dynamic Scheduling for DL Workloads on Heterogeneous GPU Clusters

Most viewed papers (last 30 days)

Runtime Support toward Transparent Memory Access in GPU-accelerated Heterogeneous Systems

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)