high performance computing on graphics processing units: hgpu.org

Posts

Aug, 5

GPU schedulers: how fair is fair enough?

Blocking synchronisation idioms, e.g. mutexes and barriers, play an important role in concurrent programming. However, systems with semi-fair schedulers, e.g. graphics processing units (GPUs), are becoming increasingly common. Such schedulers provide varying degrees of fairness, guaranteeing enough to allow some, but not all, blocking idioms. While a number of applications that use blocking idioms do […]

OpenCL

Aug, 5

OpenCLIPER: an OpenCL-based C++ Framework for Overhead-Reduced Medical Image Processing and Reconstruction on Heterogeneous Devices

Medical image processing is often limited by the computational cost of the involved algorithms. Whereas dedicated computing devices (GPUs in particular) exist and do provide significant efficiency boosts, they have an extra cost of use in terms of housekeeping tasks (device selection and initialization, data streaming, synchronization with the CPU and others), which may hinder […]

OpenCL

Aug, 5

CRUM: Checkpoint-Restart Support for CUDA’s Unified Memory

Unified Virtual Memory (UVM) was recently introduced on recent NVIDIA GPUs. Through software and hardware support, UVM provides a coherent shared memory across the entire heterogeneous node, migrating data as appropriate. The older CUDA programming style is akin to older large-memory UNIX applications which used to directly load and unload memory segments. Newer CUDA programs […]

CUDA

Jul, 28

Elementary functions: towards automatically generated, efficient, and vectorizable implementations

Elementary mathematical functions are pervasive in many high performance computing programs. However, although the mathematical libraries (libms), on which these programs rely, generally provide several flavors of the same function, these are fixed at implementation time. Hence this monolithic characteristic of libms is an obstacle for the performance of programs relying on them, because they […]

Jul, 28

Optimization of OpenCL applications on FPGA

Since Moore’s Law is over, specialized accelerators have becoming more and more trending over the years. FPGA is one of this accelerators and their "reconfigurable hardware" capabilities make it really promising. FPGA are programmed with HDL languages which is hard and time-consuming so many high-level alternatives (such HLS, OpenCL, SystemC, …) have emerged to provide […]

OpenCL

Jul, 28

Smoothed-Particle Hydrodynamics Models: Implementation Features on GPUs

Parallel implementation features of self-gravitating gas dynamics modeling on multiple GPUs are considered applying the GPU-Direct technology. The parallel algorithm for solving of the self-gravitating gas dynamics problem based on hybrid OpenMP-CUDA parallel programming model has been described in detail. The gas-dynamic forces are calculated by the modified SPH-method (Smoothed Particle Hydrodynamics) while the N-body […]

CUDA

Jul, 28

gSMat: A Scalable Sparse Matrix-based Join for SPARQL Query Processing

Resource Description Framework (RDF) has been widely used to represent information on the web, while SPARQL is a standard query language to manipulate RDF data. Given a SPARQL query, there often exist many joins which are the bottlenecks of efficiency of query processing. Besides, the real RDF datasets often reveal strong data sparsity, which indicates […]

CUDA

Jul, 28

Block-Size Independence for GPU Programs

Optimizing GPU programs by tuning execution parameters is essential to realizing the full performance potential of GPU hardware. However, many of these optimizations do not ensure correctness and subtle errors can enter while optimizing a GPU program. Further, lack of formal models and the presence of non-trivial transformations prevent verification of optimizations. In this work, […]

CUDA

Jul, 21

Spatial: A Language and Compiler for Application Accelerators

Industry is increasingly turning to reconfigurable architectures like FPGAs and CGRAs for improved performance and energy efficiency. Unfortunately, adoption of these architectures has been limited by their programming models. HDLs lack abstractions for productivity and are difficult to target from higher level languages. HLS tools are more productive, but offer an ad-hoc mix of software […]

Jul, 21

cuPentBatch – A batched pentadiagonal solver for NVIDIA GPUs

We introduce cuPentBatch – our own pentadiagonal solver for NVIDIA GPUs. The development of cuPentBatch has been motivated by applications involving numerical solutions of parabolic partial differential equations, which we describe. Our solver is written with batch processing in mind (as necessitated by parameter studies of various physical models). In particular, our solver is directed […]

CUDA

Jul, 21

Abelian: A Compiler for Graph Analytics on Distributed, Heterogeneous Platforms

The trend towards processor heterogeneity and distributed-memory has significantly increased the complexity of parallel programming. In addition, the mix of applications that need to run on parallel platforms today is very diverse, and includes graph applications that typically have irregular memory accesses and unpredictable control-flow. To simplify the programming of graph applications on such platforms, […]

CUDA

Jul, 21

ARC: Adaptive Ray-tracing with CUDA, a New Ray Tracing Code for Parallel GPUs

We present the methodology of a photon-conserving, spatially-adaptive, ray-tracing radiative transfer algorithm, designed to run on multiple parallel Graphic Processing Units (GPUs). Each GPU has thousands computing cores, making them ideally suited to the task of tracing independent rays. This ray-tracing implementation has speed competitive with approximate momentum methods, even with thousands of ionization sources, […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

GPU schedulers: how fair is fair enough?

OpenCLIPER: an OpenCL-based C++ Framework for Overhead-Reduced Medical Image Processing and Reconstruction on Heterogeneous Devices

CRUM: Checkpoint-Restart Support for CUDA’s Unified Memory

Elementary functions: towards automatically generated, efficient, and vectorizable implementations

Optimization of OpenCL applications on FPGA

Smoothed-Particle Hydrodynamics Models: Implementation Features on GPUs

gSMat: A Scalable Sparse Matrix-based Join for SPARQL Query Processing

Block-Size Independence for GPU Programs

Spatial: A Language and Compiler for Application Accelerators

cuPentBatch – A batched pentadiagonal solver for NVIDIA GPUs

Abelian: A Compiler for Graph Analytics on Distributed, Heterogeneous Platforms

ARC: Adaptive Ray-tracing with CUDA, a New Ray Tracing Code for Parallel GPUs

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)