high performance computing on graphics processing units: hgpu.org

Posts

Sep, 6

HPAC-Offload: Accelerating HPC Applications with Portable Approximate Computing on the GPU

The end of Dennard scaling and the slowdown of Moore’s law led to a shift in technology trends toward parallel architectures, particularly in HPC systems. To continue providing performance benefits, HPC should embrace Approximate Computing (AC), which trades application quality loss for improved performance. However, existing AC techniques have not been extensively applied and evaluated […]

Aug, 28

Compute units in OpenMP: Extensions for heterogeneous parallel programming

This article evaluates the current support for heterogeneous OpenMP 5.2 applications regarding the simultaneous activation of host and device computing units (e.g., CPUs, GPUs, or FPGAs). The article identifies limitations in the current OpenMP specification and describes the design and implementation of novel OpenMP extensions and runtime support for heterogeneous parallel programming. The Compute Unit […]

Aug, 28

Novel insights on atomic synchronization for sort-based group-by on GPUs

Using heterogeneous processing devices, like GPUs, to accelerate relational database operations is a well-known strategy. In this context, the group by operation is highly interesting for two reasons. Firstly, it incurs large processing costs. Secondly, its results (i.e., aggregates) are usually small, reducing data movement costs whose compensation is a major challenge for heterogeneous computing. […]

OpenCL

Aug, 28

Performant low-order matrix-free finite element kernels on GPU architectures

Numerical methods such as the Finite Element Method (FEM) have been successfully adapted to utilize the computational power of GPU accelerators. However, much of the effort around applying FEM to GPU’s has been focused on high-order FEM due to higher arithmetic intensity and order of accuracy. For applications such as the simulation of subsurface processes, […]

Aug, 28

Sieve: Stratified GPU-Compute Workload Sampling

To exploit the ever increasing compute capabilities offered by GPU hardware, GPU-compute workloads have evolved from simple computational kernels to large-scale programs with complex software stacks and numerous kernels. Driving architecture exploration using real workloads hence becomes increasingly challenging, up to the point of becoming intractable because of extremely long simulation times using existing architecture […]

CUDA

Aug, 28

Mashing load balancing algorithm to boost hybrid kernels in molecular dynamics simulations

The path to the efficient exploitation of molecular dynamics simulators is strongly driven by the increasingly intensive use of accelerators. However, they suffer performance portability issues, making it necessary both to achieve technological combinations that allow taking advantage of each programming model and device, and to define more effective load distribution strategies that consider the […]

OpenCL

Aug, 20

Porting Batched Iterative Solvers onto Intel GPUs with SYCL

Batched linear solvers play a vital role in computational sciences, especially in the fields of plasma physics and combustion simulations. With the imminent deployment of the Aurora Supercomputer and other upcoming systems equipped with Intel GPUs, there is a compelling demand to expand the capabilities of these solvers for Intel GPU architectures. In this paper, […]

CUDA

Aug, 20

APACE: AlphaFold2 and advanced computing as a service for accelerated discovery in biophysics

The prediction of protein 3D structure from amino acid sequence is a computational grand challenge in biophysics, and plays a key role in robust protein structure prediction algorithms, from drug discovery to genome interpretation. The advent of AI models, such as AlphaFold, is revolutionizing applications that depend on robust protein structure prediction algorithms. To maximize […]

Aug, 20

Increased reliability on Intel GPUs via software diverse redundancy

During the past decade, the industry revolutionized its processes by including Artificial Intelligence. Nowadays, this revolutionary process extends from the manufacturing industry to more critical sectors, such as the avionics, automotive, or health industry, where errors are unacceptable. One clear example of this process is the automotive industry, where the installation of Advanced Driver Assistance […]

OpenCL

Aug, 20

Quantifying OpenMP: Statistical Insights into Usage and Adoption

In high-performance computing (HPC), the demand for efficient parallel programming models has grown dramatically since the end of Dennard Scaling and the subsequent move to multi-core CPUs. OpenMP stands out as a popular choice due to its simplicity and portability, offering a directive-driven approach for shared-memory parallel programming. Despite its wide adoption, however, there is […]

CUDA

•

OpenCL

Aug, 20

Generating Parallel OpenCL and OpenMP Programs from Dataflow Graphs

This thesis describes and analyzes the automatic generation of threads from a sequential MiniC program by translating the program to an equivalent dataflow graph and partitioning this dataflow graph. These threads are generated through different graph partitionings, including splitting the graph into its single nodes and calculating a minimum vertex-disjoint cover. The threads can be […]

OpenCL

Aug, 13

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. To address this issue, traditional approaches integrate lossy compression directly into GPU-aware collectives, which still suffer from serious issues such as underutilized GPU devices and uncontrolled data distortion. In this paper, we propose gZCCL, a general framework […]

high performance computing on graphics processing units: hgpu.org

Posts

HPAC-Offload: Accelerating HPC Applications with Portable Approximate Computing on the GPU

Compute units in OpenMP: Extensions for heterogeneous parallel programming

Novel insights on atomic synchronization for sort-based group-by on GPUs

Performant low-order matrix-free finite element kernels on GPU architectures

Sieve: Stratified GPU-Compute Workload Sampling

Mashing load balancing algorithm to boost hybrid kernels in molecular dynamics simulations

Porting Batched Iterative Solvers onto Intel GPUs with SYCL

APACE: AlphaFold2 and advanced computing as a service for accelerated discovery in biophysics

Increased reliability on Intel GPUs via software diverse redundancy

Quantifying OpenMP: Statistical Insights into Usage and Adoption

Generating Parallel OpenCL and OpenMP Programs from Dataflow Graphs

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)