Automatic Optimization of Thread Mapping for a GPGPU Programming Framework

hgpu.org » Applications » Computer science » Automatic Optimization of Thread Mapping for a GPGPU Programming Framework

Automatic Optimization of Thread Mapping for a GPGPU Programming Framework

Kazuhiko Ohno, Tomoharu Kamiya, Takanori Maruyama, Masaki Matsumoto

Department of Information Engineering, Mie University, 1577 Kurimamachiya-cho, Tsu, Mie, 514-8507, Japan

International Journal of Networking and Computing, Volume 5, Number 2, pages 253-271, 2015

BibTeX

Download (PDF)

View

Source

2169

views

Although General Purpose computation on Graphics Processing Units (GPGPU) is widely used for the high-performance computing, standard programming frameworks such as CUDA and OpenCL are still difficult to use.They require low-level specifications and the hand-optimization is a large burden. Therefore we are developing an easier framework named MESI-CUDA. Based on a virtual shared memory model, MESI-CUDA hides low-level memory management and data transfer from the user. The compiler generates low-level code and also optimizes memory accesses applying conventional hand-optimizing techniques. However, creating GPU threads is same as CUDA; the user specifies thread mapping, i.e. thread indexing and the size of thread blocks run on each streaming multiprocessors (SM). The mapping largely affects the execution performance and may obstruct automatic optimization of MESI-CUDA compiler. Therefore, the user must find optimal specification considering physical parameters. In this paper, we propose a new thread mapping scheme. We introduce new thread creation syntax specifying hardware-independent logical mapping, which is converted into optimized physical mapping at compile time. Making static analysis of array index expressions, we obtain groups of threads accessing the same or neighboring array elements. Mapping such threads into the same thread block and assigning consecutive thread indices, the physical mapping is determined to maximize the effect of memory access optimization. As the result of evaluation, our scheme could find optimal mapping strategies for five benchmark programs. Memory access transactions were reduced to approximately 1/4 and 1.4-76 times speedup is achieved compared with the worst mapping.

Tags: Benchmarking, Compilers, Computer science, CUDA, Memory model, nVidia, nVidia GeForce GTX 980

July 15, 2015 by hgpu

No votes yet.

Please wait...

Your response

You must be logged in to post a comment.

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org