high performance computing on graphics processing units: hgpu.org

Posts

Jul, 16

Sparse Matrix-Vector Multiplication on NVIDIA GPU

In this paper, we present our work on developing a new matrix format and a new sparse matrix-vector multiplication algorithm. The matrix format is HEC, which is a hybrid format. This matrix format is efficient for sparse matrix-vector multiplication and is friendly to preconditioner. Numerical experiments show that our sparse matrix-vector multiplication algorithm is efficient […]

CUDA

Jul, 16

Sparse Matrix Matrix Multiplication on Hybrid CPU+GPU Platforms

Sparse matrix-sparse/dense matrix multiplications, spgemm and csrmm, among other applications find usage in various matrix formulations of graph problems. GPU based supercomputers are presently experiencing severe performance issues on the Graph-500 benchmarks, a new HPC benchmark suite focusing on graph algorithms. Considering the difficulties in executing graph problems and the duality between graphs and matrices, […]

CUDA

Jul, 16

A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing

Large, real-world graphs are famously difficult to process efficiently. Not only they have a large memory footprint but most graph processing algorithms entail memory access patterns with poor locality, data-dependent parallelism, and a low compute-to-memory access ratio. Additionally, most real-world graphs have a low diameter and a highly heterogeneous node degree distribution. Partitioning these graphs […]

CUDA

Jul, 16

Distributed OpenCL: a platform for distributed, heterogeneous computing for domain scientists

It is possible to purchase, for as little as $10,000, a cluster of computers with the capability to rival the supercomputers of only a few years ago. Now, users that have little to no experience developing distributed applications or managing a cluster are in a position to do so. To allow domain scientists to effectively […]

OpenCL

Jul, 15

Coupling between Meshless FEM Modeling and Rendering on GPU for Real-time Physically-based Volumetric Deformation

For real-time rendering of physically-based volumetric deformation, a meshless finite element method (FEM) is proposed and implemented on the new-generation Graphics Processing Unit (GPU). A tightly coupled deformation and rendering pipeline is defined for seamless modeling and rendering: First, the meshless FEM model exploits the vertex shader stage and the transform feedback mechanism of the […]

OpenGL

Jul, 15

ab-Stream: A Framework for programming Many-core

The common approach to program many-core processor is to write processor-specific code with low level APIs for different processors, which could achieve good performance but would result in serious portability issues: programmers are required to write a specific version code for target architecture. Therefore, we present ab-Stream, an extensible framework for programming many-threaded processor based […]

CUDA

Jul, 15

Implementing a Code Generator for Fast Matrix Multiplication in OpenCL on the GPU

This paper presents results of an implementation of code generator for fast general matrix multiply (GEMM) kernels. When a set of parameters is given, the code generator produces the corresponding GEMM kernel written in OpenCL. The produced kernels are optimized for high-performance implementation on GPUs from AMD. Access latencies to GPU global memory is the […]

OpenCL

Jul, 15

Distributed OpenCL Distributing OpenCL Platform on Network Scale

This paper presents a framework that extends OpenCL by distributing computing process to many computing resources connected via network and enables the computing resources to run in parallel. Using JSON RPC (Remote Procedure Call technique relying on JavaScript Object Notation) in communication layer, Distributed OpenCL framework provides platform and operating system independency. Using this framework, […]

OpenCL

Jul, 15

A Performance Model for Memory Bandwidth Constrained Applications on Graphics Engines

Graphics engines are excellent execution platforms for high-throughput computations that exploit a large degree of available parallelism. The achieved performance is, however, highly dependent on the access patterns that the application imposes on the memory subsystem. Here, we propose an analytic model that helps improve the understanding of the performance of memory-limited kernels that employ […]

CUDA

Jul, 14

Optimizing All-to-All and Allgather Communications on GPGPU Clusters

High Performance Computing (HPC) is rapidly becoming an integral part of Science,Engineering and Business. Scientists and engineers are leveraging HPC solutions to run their applications that require high bandwidth, low latency, and very high compute capabilities. General Purpose Graphics Processing Units (GPGPUs)are becoming more popular within the HPC community because of their highly parallel structure, […]

CUDA

Jul, 14

New Techniques for Spectral Image Acquisition and Analysis

This thesis describes typical spectral imaging techniques and spectral image analysis algorithms that are in general use. Three developed spectral imaging systems are proposed. The first imaging system consists of two line scanning based spectral cameras. These cameras are combined in one simultaneous measuring process, which can be used for capturing a wide range of […]

CUDA

Jul, 14

Implementing the Approximate Message Passing (AMP) Algorithm on a GPU

We consider the recovery of sparse signals from a limited number of noisy observations using the AMP algorithm. In this paper, we present two fast implementations of this algorithm that run on a CPU and on a GPU and which can either be used for arbitrary unstructured measurement matrices or take advantage of the structure […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Sparse Matrix-Vector Multiplication on NVIDIA GPU

Sparse Matrix Matrix Multiplication on Hybrid CPU+GPU Platforms

A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing

Distributed OpenCL: a platform for distributed, heterogeneous computing for domain scientists

Coupling between Meshless FEM Modeling and Rendering on GPU for Real-time Physically-based Volumetric Deformation

ab-Stream: A Framework for programming Many-core

Implementing a Code Generator for Fast Matrix Multiplication in OpenCL on the GPU

Distributed OpenCL Distributing OpenCL Platform on Network Scale

A Performance Model for Memory Bandwidth Constrained Applications on Graphics Engines

Optimizing All-to-All and Allgather Communications on GPGPU Clusters

New Techniques for Spectral Image Acquisition and Analysis

Implementing the Approximate Message Passing (AMP) Algorithm on a GPU

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)