high performance computing on graphics processing units: hgpu.org

Posts

Sep, 6

Leveraging Memory Copy Overlap for Efficient Sparse Matrix-Vector Multiplication on GPUs

Sparse matrix-vector multiplication (SpMV) is central to many scientific, engineering, and other applications, including machine learning. Compressed Sparse Row (CSR) is a widely used sparse matrix storage format. SpMV using the CSR format on GPU computing platforms is widely studied, where the access behavior of GPU is often the performance bottleneck. The Ampere GPU architecture […]

CUDA

Sep, 6

Fortran High-Level Synthesis: Reducing the barriers to accelerating HPC codes on FPGAs

In recent years the use of FPGAs to accelerate scientific applications has grown, with numerous applications demonstrating the benefit of FPGAs for high performance workloads. However, whilst High Level Synthesis (HLS) has significantly lowered the barrier to entry in programming FPGAs by enabling programmers to use C++, a major challenge is that most often these […]

OpenCL

Sep, 6

PoCL-R: An Open Standard Based Offloading Layer for Heterogeneous Multi-Access Edge Computing with Server Side Scalability

We propose a novel computing runtime that exposes remote compute devices via the cross-vendor open heterogeneous computing standard OpenCL and can execute compute tasks on the MEC cluster side across multiple servers in a scalable manner. Intermittent UE connection loss is handled gracefully even if the device’s IP address changes on the way. Network-induced latency […]

OpenCL

Sep, 6

HPAC-Offload: Accelerating HPC Applications with Portable Approximate Computing on the GPU

The end of Dennard scaling and the slowdown of Moore’s law led to a shift in technology trends toward parallel architectures, particularly in HPC systems. To continue providing performance benefits, HPC should embrace Approximate Computing (AC), which trades application quality loss for improved performance. However, existing AC techniques have not been extensively applied and evaluated […]

Aug, 28

Compute units in OpenMP: Extensions for heterogeneous parallel programming

This article evaluates the current support for heterogeneous OpenMP 5.2 applications regarding the simultaneous activation of host and device computing units (e.g., CPUs, GPUs, or FPGAs). The article identifies limitations in the current OpenMP specification and describes the design and implementation of novel OpenMP extensions and runtime support for heterogeneous parallel programming. The Compute Unit […]

Aug, 28

Mashing load balancing algorithm to boost hybrid kernels in molecular dynamics simulations

The path to the efficient exploitation of molecular dynamics simulators is strongly driven by the increasingly intensive use of accelerators. However, they suffer performance portability issues, making it necessary both to achieve technological combinations that allow taking advantage of each programming model and device, and to define more effective load distribution strategies that consider the […]

OpenCL

Aug, 28

Novel insights on atomic synchronization for sort-based group-by on GPUs

Using heterogeneous processing devices, like GPUs, to accelerate relational database operations is a well-known strategy. In this context, the group by operation is highly interesting for two reasons. Firstly, it incurs large processing costs. Secondly, its results (i.e., aggregates) are usually small, reducing data movement costs whose compensation is a major challenge for heterogeneous computing. […]

OpenCL

Aug, 28

Performant low-order matrix-free finite element kernels on GPU architectures

Numerical methods such as the Finite Element Method (FEM) have been successfully adapted to utilize the computational power of GPU accelerators. However, much of the effort around applying FEM to GPU’s has been focused on high-order FEM due to higher arithmetic intensity and order of accuracy. For applications such as the simulation of subsurface processes, […]

Aug, 28

Sieve: Stratified GPU-Compute Workload Sampling

To exploit the ever increasing compute capabilities offered by GPU hardware, GPU-compute workloads have evolved from simple computational kernels to large-scale programs with complex software stacks and numerous kernels. Driving architecture exploration using real workloads hence becomes increasingly challenging, up to the point of becoming intractable because of extremely long simulation times using existing architecture […]

CUDA

Aug, 20

Porting Batched Iterative Solvers onto Intel GPUs with SYCL

Batched linear solvers play a vital role in computational sciences, especially in the fields of plasma physics and combustion simulations. With the imminent deployment of the Aurora Supercomputer and other upcoming systems equipped with Intel GPUs, there is a compelling demand to expand the capabilities of these solvers for Intel GPU architectures. In this paper, […]

CUDA

Aug, 20

APACE: AlphaFold2 and advanced computing as a service for accelerated discovery in biophysics

The prediction of protein 3D structure from amino acid sequence is a computational grand challenge in biophysics, and plays a key role in robust protein structure prediction algorithms, from drug discovery to genome interpretation. The advent of AI models, such as AlphaFold, is revolutionizing applications that depend on robust protein structure prediction algorithms. To maximize […]

Aug, 20

Increased reliability on Intel GPUs via software diverse redundancy

During the past decade, the industry revolutionized its processes by including Artificial Intelligence. Nowadays, this revolutionary process extends from the manufacturing industry to more critical sectors, such as the avionics, automotive, or health industry, where errors are unacceptable. One clear example of this process is the automotive industry, where the installation of Advanced Driver Assistance […]

OpenCL

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Posts

Leveraging Memory Copy Overlap for Efficient Sparse Matrix-Vector Multiplication on GPUs

Fortran High-Level Synthesis: Reducing the barriers to accelerating HPC codes on FPGAs

PoCL-R: An Open Standard Based Offloading Layer for Heterogeneous Multi-Access Edge Computing with Server Side Scalability

HPAC-Offload: Accelerating HPC Applications with Portable Approximate Computing on the GPU

Compute units in OpenMP: Extensions for heterogeneous parallel programming

Mashing load balancing algorithm to boost hybrid kernels in molecular dynamics simulations

Novel insights on atomic synchronization for sort-based group-by on GPUs

Performant low-order matrix-free finite element kernels on GPU architectures

Sieve: Stratified GPU-Compute Workload Sampling

Porting Batched Iterative Solvers onto Intel GPUs with SYCL

APACE: AlphaFold2 and advanced computing as a service for accelerated discovery in biophysics

Increased reliability on Intel GPUs via software diverse redundancy

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)