high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Random Address Permute-Shift Technique for the Shared Memory on GPUs

Random Address Permute-Shift Technique for the Shared Memory on GPUs

Koji Nakano, Susumu Matsumae, Yasuaki Ito

Department of Information Engineering, Hiroshima University

International Conference on Parallel Processing Workshops, pp. 429-438, 2014

@article{nakano2014random,

title={Random Address Permute-Shift Technique for the Shared Memory on GPUs},

author={Nakano, Koji and Matsumae, Susumu and Ito, Yasuaki},

year={2014}

}

Download (PDF)

View

Source

2145

views

The Discrete Memory Machine (DMM) is a theoretical parallel computing model that captures the essence of memory access to the shared memory of a streaming multiprocessor on CUDA-enabled GPUs. The DMM has w memory banks that constitute a shared memory, and w threads in a warp try to access them at the same time. However, memory access requests destined for the same memory bank are processed sequentially. Hence, it is very important for developing efficient algorithms to reduce the memory access congestion, the maximum number of memory access requests destined for the same bank. The main contribution of this paper is to present a novel algorithmic technique called the random address permute-shift (RAP) technique that reduces the memory access congestion. We show that the RAP reduces the memory access congestion to O(log w/log log w) for any memory access requests including malicious ones by a warp of w threads. Also, we can guarantee that the congestion is 1 both for contiguous access and for stride access. The simulation results for w=32 show that the expected congestion for any memory access is only 3.53. Since the malicious memory access requests destined for the same bank take congestion 32, our RAP technique substantially reduces the memory access congestion. We have also applied the RAP technique to matrix transpose algorithms. The experimental results on GeForce GTX TITAN show that the RAP technique is practical and can accelerate a direct matrix transpose algorithm by a factor of 10.

Tags: Computer science, CUDA, Memory model, nVidia, nVidia GeForce GTX Titan

October 14, 2014 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Random Address Permute-Shift Technique for the Shared Memory on GPUs

Your response

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)

Random Address Permute-Shift Technique for the Shared Memory on GPUs

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)