high performance computing on graphics processing units: hgpu.org

hgpu.org » Programming » Algorithms » Padding Free Bank Conflict Resolution for CUDA-Based Matrix Transpose Algorithm

Padding Free Bank Conflict Resolution for CUDA-Based Matrix Transpose Algorithm

Ayaz ul Hassan Khan, Mayez Al-Mouhamed, Allam Fatayer, Anas Almousa, Abdulrahman Baqais, Mohammed Assayony

Computer Engineering Department, King Fahd University of Petroleum and Minerals, Dhahran, 31261, Saudi Arabia

International Journal of Networked and Distributed Computing, Vol. 2, No. 3, 124-134, 2014

DOI:10.1109/SNPD.2014.6888709

@article{assayony2015padding,

title={Padding Free Bank Conflict Resolution for CUDA-Based Matrix Transpose Algorithm},

author={Hassan Khan, Ayaz ul and Al-Mouhamed, Mayez and Fatayer, Allam and Almousa, Anas and Baqais, Abdulrahman and Assayony, Mohammed},

year={2015}

}

Download (PDF)

View

Source

2593

views

The advances of Graphic Processing Units (GPU) technology and the introduction of CUDA programming model facilitates developing new solutions for sparse and dense linear algebra solvers. Matrix Transpose is an important linear algebra procedure that has deep impact in various computational science and engineering applications. Several factors hinder the expected performance of large matrix transpose on GPU devices. The degradation in performance involves the memory access pattern such as coalesced access in the global memory and bank conflict in the shared memory of streaming multiprocessors within the GPU. In this paper, two matrix transpose algorithms are proposed to alleviate the aforementioned issues of ensuring coalesced access and conflict free bank access. The proposed algorithms have comparable execution times with the NVIDIA SDK bank conflict – free matrix transpose implementation. The main advantage of proposed algorithms is that they eliminate bank conflicts while allocating shared memory exactly equal to the tile size (T x T) of the problem space. However, to the best of our knowledge an extra space of Tx(T+1) needs to be allocated in the published research. We have also applied the proposed transpose algorithm to recursive gaussian implementation of NVIDIA SDK and achieved about 6% improvement in performance.

Tags: Algorithms, Computer science, CUDA, Linear Algebra, nVidia, nVidia Quadro FX 7000, Performance

October 29, 2015 by hgpu

Rating: 2.5/5. From 3 votes.

Please wait...

Your response

You must be logged in to post a comment.

* * *

high performance computing on graphics processing units: hgpu.org

Padding Free Bank Conflict Resolution for CUDA-Based Matrix Transpose Algorithm

Your response

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)

Padding Free Bank Conflict Resolution for CUDA-Based Matrix Transpose Algorithm

Share this:

Your response

Recent source codes

Most viewed papers (last 30 days)