high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » An Investigation of Unified Memory Access Performance in CUDA

An Investigation of Unified Memory Access Performance in CUDA

Raphael Landaverde, Tiansheng Zhang, Ayse K. Coskun, Martin Herbordt

Electrical and Computer Engineering Department, Boston University, Boston, MA, USA

IEEE High Performance Extreme Computing Conference (HPEC), 2014

BibTeX

Download (PDF)

View

Source

2283

views

Managing memory between the CPU and GPU is a major challenge in GPU computing. A programming model, Unified Memory Access (UMA), has been recently introduced by Nvidia to simplify the complexities of memory management while claiming good overall performance. In this paper, we investigate this programming model and evaluate its performance and programming model simplifications based on our experimental results. We find that beyond on-demand data transfers to the CPU, the GPU is also able to request subsets of data it requires on demand. This feature allows UMA to outperform full data transfer methods for certain parallel applications and small data sizes. We also find, however, that for the majority of applications and memory access patterns, the performance overheads associated with UMA are significant, while the simplifications to the programming model restrict flexibility for adding future optimizations.

Tags: Computer science, CUDA, Memory model, nVidia

August 26, 2014 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

An Investigation of Unified Memory Access Performance in CUDA

Your response

Recent source codes

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

KISim: Kubernetes Intelligent Scheduling Simulator

Efficient GPU Implementation of Multi-Precision Integer Division

exa-AMD: Exascale Accelerated Materials Discovery

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Most viewed papers (last 30 days)

An Investigation of Unified Memory Access Performance in CUDA

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)