high performance computing on graphics processing units: hgpu.org

Posts

Aug, 22

perf4sight: A toolflow to model CNN training performance on Edge GPUs

The increased memory and processing capabilities of today’s edge devices create opportunities for greater edge intelligence. In the domain of vision, the ability to adapt a Convolutional Neural Network’s (CNN) structure and parameters to the input data distribution leads to systems with lower memory footprint, latency and power consumption. However, due to the limited compute […]

CUDA

Aug, 22

EXA2PRO: A Framework for High Development Productivity on Heterogeneous Computing Systems

Programming upcoming exascale computing systems is expected to be a major challenge. New programming models are required to improve programmability, by hiding the complexity of these systems from application developers. The EXA2PRO programming framework aims at improving developers productivity for applications that target heterogeneous computing systems. It is based on advanced programming models and abstractions […]

CUDA

•

OpenCL

Aug, 22

Parallel time integration using Batched BLAS (Basic Linear Algebra Subprograms) routines

We present an approach for integrating the time evolution of quantum systems. We leverage the computation power of graphics processing units (GPUs) to perform the integration of all time steps in parallel. The performance boost is especially prominent for small to medium-sized quantum systems. The devised algorithm can largely be implemented using the recently-specified batched […]

CUDA

Aug, 22

Performance comparison of CFD-DEM solver MFiX-Exa, on GPUs and CPUs

We present computational performance comparisons of gas-solid simulations performed on current CPU and GPU architectures using MFiX Exa, a CFD-DEM solver that leverages hybrid CPU+GPU parallelism. A representative fluidized bed simulation with varying particle numbers from 2 to 67 million is used to compare serial and parallel performance. A single GPU was observed to be […]

CUDA

Aug, 22

Better GPU Hash Tables

We revisit the problem of building static hash tables on the GPU and design and build three bucketed hash tables that use different probing schemes. Our implementations are lock-free and offer efficient memory access patterns; thus, only the probing scheme is the factor affecting the performance of the hash table’s different operations. Our results show […]

CUDA

Aug, 8

ndzip-gpu: Efficient Lossless Compression of Scientific Floating-Point Data on GPUs

Lossless data compression is a promising software approach for reducing the bandwidth requirements of scientific applications on accelerator clusters without introducing approximation errors. Suitable compressors must be able to effectively compact floating-point data while saturating the system interconnect to avoid introducing unnecessary latencies. We present ndzip-gpu, a novel, highly-efficient GPU parallelization scheme for the block […]

CUDA

Aug, 8

Performance assessment of CUDA and OpenACC in large scale combustion simulations

GPUs have climbed up to the top of supercomputer systems making life harder to many legacy scientific codes. Nowadays, many recipes are being used in such code’s portability, without any clarity of which is the best option. We present a comparative analysis of the two most common approaches, CUDA and OpenACC, into the multi-physics CFD […]

CUDA

Aug, 8

On Efficient GPGPU Computing for Integrated Heterogeneous CPU-GPU Microprocessors

Heterogeneous microprocessors which integrate a CPU and GPU on a single chip provide low-overhead CPU-GPU communication and permit sharing of on-chip resources that a traditional discrete GPU would not have direct access to. These features allow for the optimization of codes that heretofore would be suitable only for multi-core CPUs or discrete GPUs to be […]

OpenCL

Aug, 8

ScaleHLS: Scalable High-Level Synthesis through MLIR

High-level Synthesis (HLS) has been widely adopted as it significantly improves the hardware design productivity and enables efficient design space exploration (DSE). HLS tools can be used to deliver solutions for many different kinds of design problems, which are often better solved with different levels of abstraction. While existing HLS tools are built using compiler […]

Aug, 8

PoCL-R: A Scalable Low Latency Distributed OpenCL Runtime

Offloading the most demanding parts of applications to an edge GPU server cluster to save power or improve the result quality is a solution that becomes increasingly realistic with new networking technologies. In order to make such a computing scheme feasible, an application programming layer that can provide both low latency and scalable utilization of […]

OpenCL

Jul, 25

Effective GPU Sharing Under Compiler Guidance

Modern computing platforms tend to deploy multiple GPUs (2, 4, or more) on a single node to boost system performance, with each GPU having a large capacity of global memory and streaming multiprocessors (SMs). GPUs are an expensive resource, and boosting utilization of GPUs without causing performance degradation of individual workloads is an important and […]

CUDA

Jul, 25

Face.evoLVe: A High-Performance Face Recognition Library

In this paper, we develop face.evoLVe – a comprehensive library that collects and implements a wide range of popular deep learning-based methods for face recognition. First of all, face.evoLVe is composed of key components that cover the full process of face analytics, including face alignment, data processing, various backbones, losses, and alternatives with bags of […]

high performance computing on graphics processing units: hgpu.org

Posts

perf4sight: A toolflow to model CNN training performance on Edge GPUs

EXA2PRO: A Framework for High Development Productivity on Heterogeneous Computing Systems

Parallel time integration using Batched BLAS (Basic Linear Algebra Subprograms) routines

Performance comparison of CFD-DEM solver MFiX-Exa, on GPUs and CPUs

Better GPU Hash Tables

ndzip-gpu: Efficient Lossless Compression of Scientific Floating-Point Data on GPUs

Performance assessment of CUDA and OpenACC in large scale combustion simulations

On Efficient GPGPU Computing for Integrated Heterogeneous CPU-GPU Microprocessors

ScaleHLS: Scalable High-Level Synthesis through MLIR

PoCL-R: A Scalable Low Latency Distributed OpenCL Runtime

Effective GPU Sharing Under Compiler Guidance

Face.evoLVe: A High-Performance Face Recognition Library

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)