high performance computing on graphics processing units: hgpu.org

Posts

Mar, 12

A Deep Learning Model for Loop Interchange

Loop interchange is an important code optimization that improves data locality and extracts parallelism. While previous research in compilers has tried to automate the selection of which loops to interchange, existing methods have an important limitation. They use less precise machine models. This is mainly because developing a model to predict whether to interchange two […]

Mar, 12

Bridging Control-Centric and Data-Centric Optimization

With the rise of specialized hardware and new programming languages, code optimization has shifted its focus towards promoting data locality. Most production-grade compilers adopt a control-centric mindset – instruction-driven optimization augmented with scalar-based dataflow – whereas other approaches provide domain-specific and general purpose data movement minimization, which can miss important control-flow optimizations. As the two […]

Mar, 12

Runtime Support for Performance Portability on Heterogeneous Distributed Platforms

Hardware heterogeneity is here to stay for high-performance computing. Large-scale systems are currently equipped with multiple GPU accelerators per compute node and are expected to incorporate more specialized hardware. This shift in the computing ecosystem offers many opportunities for performance improvement; however, it also increases the complexity of programming for such architectures. This work introduces […]

CUDA

Mar, 5

Harmonic CUDA: Asynchronous Programming on GPUs

We introduce Harmonic CUDA, a dataflow programming model for GPUs that allows programmers to describe algorithms as a dependency graph of producers and consumers where data flows continuously through the graph for the duration of the kernel. This makes it easier for programmers to exploit asynchrony, warp specialization, and hardware acceleration. Using Harmonic CUDA, we […]

CUDA

Mar, 5

EvoTorch: Scalable Evolutionary Computation in Python

Evolutionary computation is an important component within various fields such as artificial intelligence research, reinforcement learning, robotics, industrial automation and/or optimization, engineering design, etc. Considering the increasing computational demands and the dimensionalities of modern optimization problems, the requirement for scalable, re-usable, and practical evolutionary algorithm implementations has been growing. To address this requirement, we present […]

CUDA

Mar, 5

DeepAxe: A Framework for Exploration of Approximation and Reliability Trade-offs in DNN Accelerators

While the role of Deep Neural Networks (DNNs) in a wide range of safety-critical applications is expanding, emerging DNNs experience massive growth in terms of computation power. It raises the necessity of improving the reliability of DNN accelerators yet reducing the computational burden on the hardware platforms, i.e. reducing the energy consumption and execution time […]

Mar, 5

Interconnect Bandwidth Heterogeneity on AMD MI250x and Infinity Fabric

Demand for low-latency and high-bandwidth data transfer between GPUs has driven the development of multi-GPU nodes. Physical constraints on the manufacture and integration of such systems has yielded heterogeneous intra-node interconnects, where not all devices are connected equally. The next generation of supercomputing platforms are expected to feature AMD CPUs and GPUs. This work characterizes […]

Mar, 5

RTIndeX: Exploiting Hardware-Accelerated GPU Raytracing for Database Indexing

Data management on GPUs has become increasingly relevant due to a tremendous rise in processing power and available GPU memory. Just like in the CPU world, there is a need for performant GPU-resident index structures to speed up query processing. Unfortunately, mapping indexes efficiently to the highly parallel and hard-to-program hardware is challenging and often […]

CUDA

Feb, 26

Porting numerical integration codes from CUDA to oneAPI: a case study

We present our experience in porting optimized CUDA implementations to oneAPI. We focus on the use case of numerical integration, particularly the CUDA implementations of PAGANI and m-Cubes. We faced several challenges that caused performance degradation in the oneAPI ports. These include differences in utilized registers per thread, compiler optimizations, and mappings of CUDA library […]

CUDA

Feb, 26

Auto-SpMV: Automated Optimizing SpMV Kernels on GPU

Sparse matrix-vector multiplication (SpMV) is an essential linear algebra operation that dominates the computing cost in many scientific applications. Due to providing massive parallelism and high memory bandwidth, GPUs are commonly used to accelerate SpMV kernels. Prior studies mainly focused on reducing the latency of SpMV kernels on GPU. However, few attempts have been made […]

CUDA

Feb, 26

Orca: FSS-based Secure Training with GPUs

Secure Two-party Computation (2PC) allows two parties to compute any function on their private inputs without revealing their inputs in the clear to each other. Since 2PC is known to have notoriously high overheads, one of the most popular computation models is that of 2PC with a trusted dealer, where a trusted dealer provides correlated […]

CUDA

Feb, 26

HLSDataset: Open-Source Dataset for ML-Assisted FPGA Design using High Level Synthesis

Machine Learning (ML) has been widely adopted in design exploration using high level synthesis (HLS) to give a better and faster performance, and resource and power estimation at very early stages for FPGA-based design. To perform prediction accurately, high-quality and large-volume datasets are required for training ML models.This paper presents a dataset for ML-assisted FPGA […]

high performance computing on graphics processing units: hgpu.org

Posts

A Deep Learning Model for Loop Interchange

Bridging Control-Centric and Data-Centric Optimization

Runtime Support for Performance Portability on Heterogeneous Distributed Platforms

Harmonic CUDA: Asynchronous Programming on GPUs

EvoTorch: Scalable Evolutionary Computation in Python

DeepAxe: A Framework for Exploration of Approximation and Reliability Trade-offs in DNN Accelerators

Interconnect Bandwidth Heterogeneity on AMD MI250x and Infinity Fabric

RTIndeX: Exploiting Hardware-Accelerated GPU Raytracing for Database Indexing

Porting numerical integration codes from CUDA to oneAPI: a case study

Auto-SpMV: Automated Optimizing SpMV Kernels on GPU

Orca: FSS-based Secure Training with GPUs

HLSDataset: Open-Source Dataset for ML-Assisted FPGA Design using High Level Synthesis

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)