high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Architectural Support for Virtual Memory in GPUs

Architectural Support for Virtual Memory in GPUs

Bharath Subramanian Pichai

Rutgers University, Graduate School – New Brunswick

Rutgers University, 2013

BibTeX

Download (PDF)

View

Source

2825

views

The proliferation of heterogeneous compute platforms, of which CPU/GPU is a prevalent example, necessitates a manageable programming model to ensure widespread adoption. A key component of this is a shared unified address space between the heterogeneous units to obtain the programmability benefits of virtual memory. Indeed, processor vendors have already begun embracing heterogeneous systems with unified address spaces (e.g., Intel’s Haswell, AMD’s Berlin processor, and ARM’s Mali and Cortex cores). We are the first to explore GPU Translation Lookaside Buffers (TLBs) and page table walkers for address translation in the context of shared virtual memory for heterogeneous systems. To exploit the programmability benefits of shared virtual memory, it is natural to consider mirroring CPUs and placing TLBs prior (or parallel) to cache accesses, making caches physically addressed. We show the performance challenges of such an approach and propose modest hardware augmentations to recover much of this lost performance. We then consider the impact of this approach on the design of general purpose GPU performance improvement schemes. We look at: (1) warp scheduling to increase cache hit rates; and (2) dynamic warp formation to mitigate control flow divergence overheads. We show that introducing cache-parallel address translation does pose challenges, but that modest optimizations can buy back much of this lost performance. In the CPU world, the programmability benefits of address translation and physically addressed caches have outweighed their performance overheads. This paper is the first to explore similar address translation mechanisms on GPUs. We find that while cache-parallel address translation does introduce non-trivial performance overheads, modestly TLB-aware designs can move performance losses into a range deemed acceptable in the CPU world. We presume this stake-in-the-ground design leaves room for improvement but hope the larger result, that a little TLB-awareness goes a long way in GPUs, sets the stage for future work in this fruitful area.

Tags: Computer science, Hardware Architecture, Heterogeneous systems, Memory, Thesis

April 19, 2014 by hgpu

No votes yet.

Please wait...

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Architectural Support for Virtual Memory in GPUs

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)

Architectural Support for Virtual Memory in GPUs

Share this:

Recent source codes

Most viewed papers (last 30 days)