high performance computing on graphics processing units: hgpu.org

Posts

Apr, 9

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

We present an interface and an implementation of the General Matrix Multiply (GEMM) routine for multiple small matrices processed simultaneously on NVIDIA graphics processing units (GPUs). We focus on matrix sizes under 16. The implementation can be easily extended to larger sizes. For single precision matrices, our implementation is 30% to 600% faster than the […]

CUDA

Apr, 8

pVOCL: Power-Aware Dynamic Placement and Migration in Virtualized GPU Environments

Power-hungry Graphics processing unit (GPU) accelerators are ubiquitous in high performance computing data centers today. GPU virtualization frameworks introduce new opportunities for effective management of GPU resources by decoupling them from application execution. However, power management of GPU-enabled server clusters faces significant challenges. The underlying system infrastructure shows complex power consumption characteristics depending on the […]

OpenCL

Apr, 8

A Hardware Multithreaded SpMV Kernel for the Convey HC-2ex

Applications exhibiting irregular behavior through poor memory locality have been a constant challenge for high-performance computing. Architectures supporting hardware multithreading (e.g. Tera MTA and Cray XMT) have been shown to deliver superior performance on such applications by masking memory latency. FPGAs have outperformed traditional architectures on applications that exhibit very large spatial locality and where […]

CUDA

Apr, 8

Load Balancing in a Changing World: Dealing with Heterogeneity and Performance Variability

Fully utilizing the power of modern heterogeneous systems requires judiciously dividing work across all of the available computational devices. Existing approaches for partitioning work require offline training and generate fixed partitions that fail to respond to fluctuations in device performance that occur at run time. We present a novel dynamic approach to work partitioning that […]

OpenCL

Apr, 8

Development of methods for the processing of mining images using genetic algorithms

In this paper we describe the extension of system FOTOM capabilities with respect to segmentation of specific mining images. We focus on methods that are inherently resistant against noise present in experimental pit at VSB Technical University. Here, we describe procedures employing proven active contours and evolutionary algorithms for recognizing points of interest in the […]

CUDA

Apr, 8

Highly Scalable Multiplication for Distributed Sparse Multivariate Polynomials on Many-core Systems

We present a highly scalable algorithm for multiplying sparse multivariate polynomials represented in a distributed format. This algo- rithm targets not only the shared memory multicore computers, but also computers clusters or specialized hardware attached to a host computer, such as graphics processing units or many-core coprocessors. The scal- ability on the large number of […]

CUDA

Apr, 7

Atomic-free Irregular Computations on GPUs

Atomic instructions are a key ingredient of codes that operate on irregular data structures like trees and graphs. It is well known that atomics can be expensive, especially on massively parallel GPUs, and are often on the critical path of a program. In this paper, we present two high-level methods to eliminate atomics in irregular […]

CUDA

Apr, 7

Exploring complex quantum systems with a hybrid CPU-GPU computing platform

One of the most striking features of quantum mechanics is the exponential growth of resources, required to find the states of a composite system, with the size of the system. This also is the origin of the two main bottlenecks in numerical studies of complex quantum systems, that are (i) diagonalizations of big matrices and […]

CUDA

Apr, 7

Speed up Large Integer Multiplication Using Fourier Transforms and CUDA Technology

Multiplying large integers is an operation that has many applications in Computational Science. Many cryptographic algorithms require operations on very large subsets of the integer numbers. Using Fast Fourier Transforms (FFT) and Graphics Processing Unit (GPU), we can speed up integer multiplication and make an effective multiplication algorithm. CUDA technology used to perform FFT on […]

CUDA

Apr, 7

Optimizing Sparse Matrix-Matrix Multiplication for the GPU

Sparse matrix-matrix multiplication (SpMM) is a key operation in numerous areas from information to the physical sciences. Implementing SpMM efficiently on throughput-oriented processors, such as the graphics processing unit (GPU), requires the programmer to expose substantial fine-grained parallelism while conserving the limited off-chip memory bandwidth. Balancing these concerns, we decompose the SpMM operation into three, […]

CUDA

Apr, 7

A new CUDA-based GPU implementation of the two-dimensional Athena code

We present a new version of the Athena code, which solves magnetohydrodynamic equations in two-dimensional space. This new implementation, which we have named Athena-GPU, uses CUDA architecture to allow the code execution on Graphical Processor Unit (GPU). The Athena-GPU code is an unofficial, modified version of the Athena code which was originally designed for Central […]

CUDA

Apr, 6

23rd Annual International Conference on Computer Science and Software Engineering, CASCON 2013

CASCON 2013 is the 23rd annual international conference hosted by CAS Research, IBM Canada Software Lab. Using the motto, “Innovation that matters”, this conference provides an exciting forum for exchanging ideas and experience in the ever-expanding and critical fields of software engineering and computing. The theme of this year, “Ecosystem of Engagement”, highlights the confluence […]

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

pVOCL: Power-Aware Dynamic Placement and Migration in Virtualized GPU Environments

A Hardware Multithreaded SpMV Kernel for the Convey HC-2ex

Load Balancing in a Changing World: Dealing with Heterogeneity and Performance Variability

Development of methods for the processing of mining images using genetic algorithms

Highly Scalable Multiplication for Distributed Sparse Multivariate Polynomials on Many-core Systems

Atomic-free Irregular Computations on GPUs

Exploring complex quantum systems with a hybrid CPU-GPU computing platform

Speed up Large Integer Multiplication Using Fourier Transforms and CUDA Technology

Optimizing Sparse Matrix-Matrix Multiplication for the GPU

A new CUDA-based GPU implementation of the two-dimensional Athena code

23rd Annual International Conference on Computer Science and Software Engineering, CASCON 2013

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)