high performance computing on graphics processing units: hgpu.org

hgpu.org » Applications » Computer science » Implementing a Code Generator for Fast Matrix Multiplication in OpenCL on the GPU

Implementing a Code Generator for Fast Matrix Multiplication in OpenCL on the GPU

Kazuya Matsumoto, Naohito Nakasato, Stanislav G. Sedukhin

Graduate School of Computer Science and Enginering, The University of Aizu, Tsuruga, Ikki-Machi, Aizu-Wakamatsu City, Fukushima, 965-8580 Japan

6th IEEE International Symposium on Embedded Multicore SoCs (MCSoC-12), 2012

View

Source

2657

views

This paper presents results of an implementation of code generator for fast general matrix multiply (GEMM) kernels. When a set of parameters is given, the code generator produces the corresponding GEMM kernel written in OpenCL. The produced kernels are optimized for high-performance implementation on GPUs from AMD. Access latencies to GPU global memory is the main drawback for high performance. This study shows that storing matrix data in a block-major layout increases the performance and stability of GEMM kernels. On the Tahiti GPU (Radeon HD 7970), our DGEMM (double-precision GEMM) and SGEMM (single-precision GEMM) kernels achieve the performance up to 848 GFlop/s (90% of the peak) and 2646 GFlop/s (70%), respectively.

Tags: ATI, ATI Radeon HD 7970, Code generation, Computer science, Matrix multiplication, OpenCL

July 15, 2012 by hgpu

No votes yet.

Please wait...

PELSI: Power-Efficient Layer-Switched Inference

Efficient deep learning inference on end devices

Ouroboros: Virtualized Queues for dynamic memory management

Dynamic Memory Management on GPUs with SYCL

MSCCL++: A GPU-driven communication stack for scalable AI applications

MSCCL++: Rethinking GPU Communication Abstractions for Cutting-edge AI Applications

Benchmark compute shader of Unity against InteropUnityCUDA

InteropUnityCUDA: A Tool for Interoperability Between Unity and CUDA

Data-efficient LLM Fine-tuning for Code Generation

Data-efficient LLM Fine-tuning for Code Generation

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

GigaAPI for GPU Parallelization

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

Advances in Semantic Patching for HPC-oriented Refactorings with Coccinelle

DuoReduce: MLIR's benchmark

Hardware-Assisted Software Testing and Debugging for Heterogeneous Computing

See all packages

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Login | Sitemap | Feedback | Policy

Contact us:

contact@hpgu.org