Search Results amd gemm

Tags Results

Authors Results

Namdev Sawant

Jort Gemmeke

Posts

Apr, 21

Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

FPGAs are a promising platform for accelerating Deep Learning (DL) applications, due to their high performance, low power consumption, and reconfigurability. Recently, the leading FPGA vendors have enhanced their architectures to more efficiently support the computational demands of DL workloads. However, the two most prominent AI-optimized FPGAs, i.e., AMD/Xilinx Versal ACAP and Intel Stratix 10 […]

Jun, 23

IA-SpGEMM: An Input-aware Auto-tuning Framework for Parallel Sparse Matrix-Matrix Multiplication

Sparse matrix-matrix multiplication (SpGEMM) is a sparse kernel that is used in a number of scientific applications. Although several SpGEMM algorithms have been proposed, almost all of them are restricted to the compressed sparse row (CSR) format, and the possible performance gain from exploiting other formats has not been well studied. The particular format and […]

CUDA

Apr, 25

Optimized HPL for AMD GPU and multi-core CPU usage

The installation of the LOEWE-CSC (http://csc.uni-frankfurt.de/csc/?51) supercomputer at the Goethe University in Frankfurt lead to the development of a Linpack which can fully utilize the installed AMD Cypress GPUs. At its core, a fast DGEMM for combined GPU and CPU usage was created. The DGEMM library is tuned to hide all DMA transfer times and […]

Sep, 17

Comparing Llama-2 and GPT-3 LLMs for HPC kernels generation

We evaluate the use of the open-source Llama-2 model for generating well-known, high-performance computing kernels (e.g., AXPY, GEMV, GEMM) on different parallel programming models and languages (e.g., C++: OpenMP, OpenMP Offload, OpenACC, CUDA, HIP; Fortran: OpenMP, OpenMP Offload, OpenACC; Python: numpy, Numba, pyCUDA, cuPy; and Julia: Threads, CUDA.jl, AMDGPU.jl). We built upon our previous work […]

CUDA

Jul, 2

Evaluation of OpenAI Codex for HPC Parallel Programming Models Kernel Generation

We evaluate AI-assisted generative capabilities on fundamental numerical kernels in high-performance computing (HPC), including AXPY, GEMV, GEMM, SpMV, Jacobi Stencil, and CG. We test the generated kernel codes for a variety of language-supported programming models, including (1) C++ (e.g., OpenMP [including offload], OpenACC, Kokkos, SyCL, CUDA, and HIP), (2) Fortran (e.g., OpenMP [including offload] and […]

CUDA

Dec, 25

Extending MAGMA Portability with OneAPI

As the architectures of super-computing systems are continually changing, it is important to maintain efficient code portability in order to continue to take advantage of the computing capabilities of the diverse and evolving hardware in these systems. Intel has adopted an open standard programming interface for heterogeneous systems called oneAPI, designed to allow code portability […]

CUDA

Aug, 19

Kernel Tuner: A search-optimizing GPU code auto-tuner

A very common problem in GPU programming is that some combination of thread block dimensions and other code optimization parameters, like tiling or unrolling factors, results in dramatically better performance than other kernel configurations. To obtain highly-efficient kernels it is often required to search vast and discontinuous search spaces that consist of all possible combinations […]

CUDA

•

OpenCL

Jun, 21

cltorch: a Hardware-Agnostic Backend for the Torch Deep Neural Network Library, Based on OpenCL

This paper presents cltorch, a hardware-agnostic backend for the Torch neural network framework. cltorch enables training of deep neural networks on GPUs from diverse hardware vendors, including AMD, NVIDIA, and Intel. cltorch contains sufficient implementation to run models such as AlexNet, VGG, Overfeat, and GoogleNet. It is written using the OpenCL language, a portable compute […]

OpenCL

Feb, 16

Parallel and Scalable Sparse Basic Linear Algebra Subprograms

Sparse basic linear algebra subprograms (BLAS) are fundamental building blocks for numerous scientific computations and graph applications. Compared with Dense BLAS, parallelization of Sparse BLAS routines entails extra challenges due to the irregularity of sparse data structures. This thesis proposes new fundamental algorithms and data structures that accelerate Sparse BLAS routines on modern massively parallel […]

OpenCL

Sep, 17

CLTune: A Generic Auto-Tuner for OpenCL Kernels

This work presents CLTune, an auto-tuner for OpenCL kernels. It evaluates and tunes kernel performance of a generic, user-defined search space of possible parametervalue combinations. Example parameters include the OpenCL workgroup size, vector data-types, tile sizes, and loop unrolling factors. CLTune can be used in the following scenarios: 1) when there are too many tunable […]

OpenCL

Oct, 29

Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units

This paper presents an implementation of different matrix-matrix multiplication routines in OpenCL. We utilize the high-performance GEMM (GEneral Matrix-Matrix Multiply) implementation from our previous work for the present implementation of other matrix-matrix multiply routines in Level-3 BLAS (Basic Linear Algebra Subprograms). The other routines include SYMM (Symmetric Matrix-Matrix Multiply), SYRK (Symmetric Rank-K Update), SYR2K (Symmetric […]

OpenCL

May, 11

A portable and high-performance matrix operations library for CPUs, GPUs and beyond

High-performance computing systems today include a variety of compute devices such as multi-core CPUs, GPUs and many-core accelerators. OpenCL allows programming different types of compute devices using a single API and kernel language. However, there is no standard matrix operations library in OpenCL for operations such as matrix multiplication that works well on a variety […]

OpenCL

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Tags Results

Authors Results

Posts

Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

IA-SpGEMM: An Input-aware Auto-tuning Framework for Parallel Sparse Matrix-Matrix Multiplication

Optimized HPL for AMD GPU and multi-core CPU usage

Comparing Llama-2 and GPT-3 LLMs for HPC kernels generation

Evaluation of OpenAI Codex for HPC Parallel Programming Models Kernel Generation

Extending MAGMA Portability with OneAPI

Kernel Tuner: A search-optimizing GPU code auto-tuner

cltorch: a Hardware-Agnostic Backend for the Torch Deep Neural Network Library, Based on OpenCL

Parallel and Scalable Sparse Basic Linear Algebra Subprograms

CLTune: A Generic Auto-Tuner for OpenCL Kernels

Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units

A portable and high-performance matrix operations library for CPUs, GPUs and beyond

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)