high performance computing on graphics processing units: hgpu.org

Posts

Sep, 3

Performance Portability Study of Linear Algebra Kernels in OpenCL

The performance portability of OpenCL kernel implementations for common memory bandwidth limited linear algebra operations across different hardware generations of the same vendor as well as across vendors is studied. Certain combinations of kernel implementations and work sizes are found to exhibit good performance across compute kernels, hardware generations, and, to a lesser degree, vendors. […]

OpenCL

Sep, 2

Directive-Based Compilers for GPUs

General Purpose Graphics Computing Units can be effectively used for enhancing the performance of many contemporary scientific applications. However, programming GPUs using machine-specific notations like CUDA or OpenCL can be complex and time consuming. In addition, the resulting programs are typically fine-tuned for a particular target device. A promising alternative is to program in a […]

CUDA

Sep, 2

LightPlay: Efficient Replay with GPUs

Previous deterministic replay systems reduce the runtime overhead by either relying on hardware support or by relaxing the determinism requirements for replay. We propose LightPlay that fulfills stricter determinism requirements with low overhead without requiring hardware or OS support. LightPlay guarantees that the memory state after each instruction instance in a replay run is the […]

CUDA

Sep, 2

Determining the difficulty of accelerating problems on a GPU

General-purpose computation on graphics processing units (GPGPU) has great potential to accelerate many scientific models and algorithms. However, some problems are considerably more difficult to accelerate than others, and it may be challenging for those new to GPGPU to ascertain the difficulty of accelerating a particular problem. Through what was learned in the acceleration of […]

OpenCL

Sep, 2

Optimistic Parallelism on GPUs

We present speculative parallelization techniques that can exploit parallelism in loops even in the presence of dynamic irregularities that may give rise to cross-iteration dependences. The execution of a speculatively parallelized loop consists of five phases: scheduling, computation, misspeculation check, result committing, and misspeculation recovery. While the first two phases enable exploitation of data parallelism, […]

CUDA

Sep, 2

Heterogeneous Computing on Mixed Unstructured Grids with PyFR

PyFR is an open-source high-order accurate computational fluid dynamics solver for mixed unstructured grids that can target a range of hardware platforms from a single codebase. In this paper we demonstrate the ability of PyFR to perform high-order accurate unsteady simulations of flow on mixed unstructured grids using heterogeneous multi-node hardware. Specifically, after benchmarking single-node […]

CUDA

•

OpenCL

Sep, 1

Performance Evaluations of Graph Database using CUDA and OpenMP-Compatible Libraries

Graph databases use graph structures to store data sets as nodes, edges, and properties. They are used to store and search the relationships between a large number of nodes, such as social networking services and recommendation engines that use customer social graphs. Since computation cost for graph search queries increases as the graph becomes large, […]

CUDA

Sep, 1

Fast reconstruction of 3D volumes from 2D CT projection data with GPUs

Biomedical image reconstruction applications require producing high fidelity images in or close to real-time. We have implemented reconstruction of three dimensional conebeam computed tomography(CBCT) with two dimensional projections. The algorithm takes slices of the target, weights and filters them to backproject the data, then creates the final 3D volume. We have implemented the algorithm using […]

CUDA

•

OpenCL

Sep, 1

Scalable Kernel Fusion for Memory-Bound GPU Applications

GPU implementations of HPC applications relying on finite difference methods can include tens of kernels that are memory-bound. Kernel fusion can improve performance by reducing data traffic to off-chip memory; kernels that share data arrays are fused to larger kernels where on-chip cache is used to hold the data reused by instructions originating from different […]

CUDA

Aug, 27

Optimization of Data-Parallel Scientific Applications on Highly Heterogeneous Modern HPC Platforms

Over the past decade, the design of microprocessors has been shifting to a new model where the microprocessor has multiple homogeneous processing units, aka cores, as a result of heat dissipation and energy consumption issues. Meanwhile, the demand for heterogeneity increases in computing systems due to the need for high performance computing in recent years. […]

CUDA

Aug, 27

Surface Normal Integration for Convex Space-time Multi-view Reconstruction

We show that surface normal information allows to significantly improve the accuracy of a spatio-temporal multi-view reconstruction. On one hand, normal information can improve the quality of photometric matching scores. On the other hand, the same normal information can be employed to drive an adaptive anisotropic surface regularization process which better preserves fine details and […]

CUDA

Aug, 27

High Performance Financial Simulation Using Randomized Quasi-Monte Carlo Methods

GPU computing has become popular in computational finance and many financial institutions are moving their CPU based applications to the GPU platform. Since most Monte Carlo algorithms are embarrassingly parallel, they benefit greatly from parallel implementations, and consequently Monte Carlo has become a focal point in GPU computing. GPU speed-up examples reported in the literature […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Performance Portability Study of Linear Algebra Kernels in OpenCL

Directive-Based Compilers for GPUs

LightPlay: Efficient Replay with GPUs

Determining the difficulty of accelerating problems on a GPU

Optimistic Parallelism on GPUs

Heterogeneous Computing on Mixed Unstructured Grids with PyFR

Performance Evaluations of Graph Database using CUDA and OpenMP-Compatible Libraries

Fast reconstruction of 3D volumes from 2D CT projection data with GPUs

Scalable Kernel Fusion for Memory-Bound GPU Applications

Optimization of Data-Parallel Scientific Applications on Highly Heterogeneous Modern HPC Platforms

Surface Normal Integration for Convex Space-time Multi-view Reconstruction

High Performance Financial Simulation Using Randomized Quasi-Monte Carlo Methods

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)