high performance computing on graphics processing units: hgpu.org

Posts

Aug, 3

Performance Analysis of GPU Accelerators with Realizable Utilization of Computational Density

With the rising number of application accelerators, developers are looking for ways to evaluate new and competing platforms quickly, fairly, and early in the development cycle. As high-performance computing (HPC) applications increase their demands on application acceleration platforms, graphics processing units (GPUs) provide a potential solution for many developers looking for increased performance. Device performance […]

Aug, 2

Parallelizing flow-accumulation calculations on graphics processing units – From iterative DEM preprocessing algorithm to recursive multiple-flow-direction algorithm

As one of the important tasks in digital terrain analysis, the calculation of flow accumulations from gridded digital elevation models (DEMs) usually involves two steps in a real application: (1) using an iterative DEM preprocessing algorithm to remove the depressions and flat areas commonly contained in real DEMs, and (2) using a recursive flow-direction algorithm […]

CUDA

Aug, 2

Automated Tool to Generate Parallel CUDA code from a Serial C Code

With the introduction of GPGPUs, parallel programming has become simple and affordable. APIs such as NVIDIA’s CUDA have attracted many programmers to port their applications to GPGPUs. But writing CUDA codes still remains a challenging task. Moreover, the vast repositories of legacy serial C codes, which are still in wide use in the industry, are […]

CUDA

Aug, 2

C to Cellular Automata and Execution on CPU, GPU and FPGA

Over the last decades Cellular Automata (CA) have become more and more present in solving general-purpose problems, but the main issue is how to map a problem to a Cellular Automata model. Special languages were developed for programming such models, but learning a new programming language is very time consuming. Furthermore software developers have to […]

CUDA

Aug, 2

Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems

As an important application of spatial databases in pathology imaging analysis, cross-comparing the spatial boundaries of a huge amount of segmented micro-anatomic objects demands extremely data- and compute-intensive operations, requiring high throughput at an affordable cost. However, the performance of spatial database systems has not been satisfactory since their implementations of spatial operations cannot fully […]

CUDA

Aug, 2

A GPU-Computing Approach to Solar Stokes Profile Inversion

We present a new computational approach to the inversion of solar photospheric Stokes polarization profiles, under the Milne-Eddington model, for vector magnetography. Our code, named GENESIS (GENEtic Stokes Inversion Strategy), employs multi-threaded parallel-processing techniques to harness the computing power of graphics processing units GPUs, along with algorithms designed to exploit the inherent parallelism of the […]

CUDA

Aug, 1

Interference-driven resource management for GPU-based heterogeneous clusters

GPU-based clusters are increasingly being deployed in HPC environments to accelerate a variety of scientific applications. Despite their growing popularity, the GPU devices themselves are under-utilized even for many computationally-intensive jobs. This stems from the fact that the typical GPU usage model is one in which a host processor periodically offloads computationally intensive portions of […]

CUDA

Aug, 1

GPU merge path: a GPU merging algorithm

Graphics Processing Units (GPUs) have become ideal candidates for the development of fine-grain parallel algorithms as the number of processing elements per GPU increases. In addition to the increase in cores per system, new memory hierarchies and increased bandwidth have been developed that allow for significant performance improvement when computation is performed using certain types […]

CUDA

Aug, 1

New Sparse Matrix Storage Format to Improve The Performance of Total SPMV Time

Graphics Processing Units (GPUs) are massive data parallel processors. High performance comes only at the cost of identifying data parallelism in the applications while using data parallel processors like GPU. This is an easy effort for applications that have regular memory access and high computation intensity. GPUs are equally attractive for sparse matrix vector multiplications […]

CUDA

Aug, 1

High-Level Manipulation of OpenCL-Based Subvectors and Submatrices

High-level C++ proxies for the convenient manipulation of subvectors and submatrices on OpenCL-enabled devices are introduced. It is demonstrated that the programming convenience of standard host-based code can be retained using native C++ language features only, even if massively parallel computing architectures such as graphics processing units are employed. The required modifications of the underlying […]

OpenCL

Aug, 1

GPU-Accelerated Non-negative Matrix Factorization for Text Mining

An implementation of the non-negative matrix factorization algorithm for the purpose of text mining on graphics processing units is presented. Performance gains of more than one order of magnitude are obtained.

OpenCL

Jul, 31

accULL: An User-directed Approach to Heterogeneous Programming

The world of HPC is undergoing rapid changes and computer architectures capable to achieve high performance have broadened. The irruption in the scene of computational accelerators, like GPUs, is increasing performance while maintaining low cost per GFLOP, thus expanding the popularity of HPC. However, it is still difficult to exploit the new complex processor hierarchies. […]

CUDA

•

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Performance Analysis of GPU Accelerators with Realizable Utilization of Computational Density

Parallelizing flow-accumulation calculations on graphics processing units – From iterative DEM preprocessing algorithm to recursive multiple-flow-direction algorithm

Automated Tool to Generate Parallel CUDA code from a Serial C Code

C to Cellular Automata and Execution on CPU, GPU and FPGA

Accelerating Pathology Image Data Cross-Comparison on CPU-GPU Hybrid Systems

A GPU-Computing Approach to Solar Stokes Profile Inversion

Interference-driven resource management for GPU-based heterogeneous clusters

GPU merge path: a GPU merging algorithm

New Sparse Matrix Storage Format to Improve The Performance of Total SPMV Time

High-Level Manipulation of OpenCL-Based Subvectors and Submatrices

GPU-Accelerated Non-negative Matrix Factorization for Text Mining

accULL: An User-directed Approach to Heterogeneous Programming

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)