high performance computing on graphics processing units: hgpu.org

Posts

Aug, 28

Performant low-order matrix-free finite element kernels on GPU architectures

Numerical methods such as the Finite Element Method (FEM) have been successfully adapted to utilize the computational power of GPU accelerators. However, much of the effort around applying FEM to GPU’s has been focused on high-order FEM due to higher arithmetic intensity and order of accuracy. For applications such as the simulation of subsurface processes, […]

Aug, 28

Sieve: Stratified GPU-Compute Workload Sampling

To exploit the ever increasing compute capabilities offered by GPU hardware, GPU-compute workloads have evolved from simple computational kernels to large-scale programs with complex software stacks and numerous kernels. Driving architecture exploration using real workloads hence becomes increasingly challenging, up to the point of becoming intractable because of extremely long simulation times using existing architecture […]

CUDA

Aug, 28

Mashing load balancing algorithm to boost hybrid kernels in molecular dynamics simulations

The path to the efficient exploitation of molecular dynamics simulators is strongly driven by the increasingly intensive use of accelerators. However, they suffer performance portability issues, making it necessary both to achieve technological combinations that allow taking advantage of each programming model and device, and to define more effective load distribution strategies that consider the […]

OpenCL

Aug, 28

Novel insights on atomic synchronization for sort-based group-by on GPUs

Using heterogeneous processing devices, like GPUs, to accelerate relational database operations is a well-known strategy. In this context, the group by operation is highly interesting for two reasons. Firstly, it incurs large processing costs. Secondly, its results (i.e., aggregates) are usually small, reducing data movement costs whose compensation is a major challenge for heterogeneous computing. […]

OpenCL

Aug, 20

Porting Batched Iterative Solvers onto Intel GPUs with SYCL

Batched linear solvers play a vital role in computational sciences, especially in the fields of plasma physics and combustion simulations. With the imminent deployment of the Aurora Supercomputer and other upcoming systems equipped with Intel GPUs, there is a compelling demand to expand the capabilities of these solvers for Intel GPU architectures. In this paper, […]

CUDA

Aug, 20

APACE: AlphaFold2 and advanced computing as a service for accelerated discovery in biophysics

The prediction of protein 3D structure from amino acid sequence is a computational grand challenge in biophysics, and plays a key role in robust protein structure prediction algorithms, from drug discovery to genome interpretation. The advent of AI models, such as AlphaFold, is revolutionizing applications that depend on robust protein structure prediction algorithms. To maximize […]

Aug, 20

Increased reliability on Intel GPUs via software diverse redundancy

During the past decade, the industry revolutionized its processes by including Artificial Intelligence. Nowadays, this revolutionary process extends from the manufacturing industry to more critical sectors, such as the avionics, automotive, or health industry, where errors are unacceptable. One clear example of this process is the automotive industry, where the installation of Advanced Driver Assistance […]

OpenCL

Aug, 20

Quantifying OpenMP: Statistical Insights into Usage and Adoption

In high-performance computing (HPC), the demand for efficient parallel programming models has grown dramatically since the end of Dennard Scaling and the subsequent move to multi-core CPUs. OpenMP stands out as a popular choice due to its simplicity and portability, offering a directive-driven approach for shared-memory parallel programming. Despite its wide adoption, however, there is […]

CUDA

•

OpenCL

Aug, 20

Generating Parallel OpenCL and OpenMP Programs from Dataflow Graphs

This thesis describes and analyzes the automatic generation of threads from a sequential MiniC program by translating the program to an equivalent dataflow graph and partitioning this dataflow graph. These threads are generated through different graph partitionings, including splitting the graph into its single nodes and calculating a minimum vertex-disjoint cover. The threads can be […]

OpenCL

Aug, 13

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. To address this issue, traditional approaches integrate lossy compression directly into GPU-aware collectives, which still suffer from serious issues such as underutilized GPU devices and uncontrolled data distortion. In this paper, we propose gZCCL, a general framework […]

Aug, 13

A Model Extraction Attack on Deep Neural Networks Running on GPUs

Deep Neural Networks (DNNs) have become ubiquitous due to their performance on prediction and classification problems. However, they face a variety of threats as their usage spreads. Model extraction attacks, which steal DNN models, endanger intellectual property, data privacy, and security. Previous research has shown that system-level side channels can be used to leak the […]

Aug, 13

SYnergy: Fine-grained Energy-Efficient Heterogeneous Computing for Scalable Energy Saving

Energy-efficient computing uses power management techniques such as frequency scaling to save energy. Implementing energy-efficient techniques on large-scale computing systems is challenging for several reasons. While most modern architectures, including GPUs, are capable of frequency scaling, these features are often not available on large systems. In addition, achieving higher energy savings requires precise energy tuning […]

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Performant low-order matrix-free finite element kernels on GPU architectures

Sieve: Stratified GPU-Compute Workload Sampling

Mashing load balancing algorithm to boost hybrid kernels in molecular dynamics simulations

Novel insights on atomic synchronization for sort-based group-by on GPUs

Porting Batched Iterative Solvers onto Intel GPUs with SYCL

APACE: AlphaFold2 and advanced computing as a service for accelerated discovery in biophysics

Increased reliability on Intel GPUs via software diverse redundancy

Quantifying OpenMP: Statistical Insights into Usage and Adoption

Generating Parallel OpenCL and OpenMP Programs from Dataflow Graphs

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

A Model Extraction Attack on Deep Neural Networks Running on GPUs

SYnergy: Fine-grained Energy-Efficient Heterogeneous Computing for Scalable Energy Saving

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)