high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Thorough Evaluation of GPU Shared Memory Load and Store Instructions

Thorough Evaluation of GPU Shared Memory Load and Store Instructions

Satoshi Okamoto, Yasuaki Ito, Koji Nakano, Jacir L. Bordim

Department of Information Engineering, Hiroshima University, Kagamiyama 1-4-1, Higashi-Hiroshima, 739-8527 Japan

International Symposium on Computing and Networking, pp. 614-616, 2014

DOI:10.1109/CANDAR.2014.42

BibTeX

Download (PDF)

View

Source

1866

views

This work focuses on measuring the number of GPU clock cycles necessary to execute load/store instructions in both bank conflict and bank conflict-free shared memory access patterns. To this end, a varying number of parameters have been considered in the experiments, including the number of warps (w), the number of memory bank conflicts (k) as well as the number of load/store instructions (l) per warp. From the analysis of the experimental results, it was possible to obtain an estimate (E) on the number of the clock cycles necessary to execute l load/store instructions. The estimate is given by E = w * l * k * c1 + c2, where c1 and c2 are constants assuming values 1.047 and 337.7, respectively. From the above results, we believe that obtained estimated can be used as an approximation on the number of clock cycles necessary to execute load and store instructions.

Tags: Benchmarking, Computer science, CUDA, nVidia, nVidia GeForce GTX 780 Ti, Performance, PTX

January 13, 2015 by hgpu

No votes yet.

Please wait...

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

Engineering Supercomputing Platforms for Biomolecular Applications

high performance computing on graphics processing units: hgpu.org

Thorough Evaluation of GPU Shared Memory Load and Store Instructions

Recent source codes

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)

Thorough Evaluation of GPU Shared Memory Load and Store Instructions

Share this:

Recent source codes

Most viewed papers (last 30 days)