high performance computing on graphics processing units: hgpu.org

Posts

Oct, 29

Performance Tuning for GPU-Embedded Systems: Machine-Learning-based and Analytical Model-driven Tuning Methodologies

GPU-embedded systems have gained popularity across various domains due to their efficient power consumption. However, in order to meet the demands of real-time or time-consuming applications running on these systems, it is crucial for them to be tuned to exhibit high performance. This paper addresses the issue by developing and comparing two tuning methodologies on […]

CUDA

Oct, 29

Performance portability evaluation of blocked stencil computations on GPUs

In this new era where multiple GPU vendors are leading the supercomputing landscape, and multiple programming models are available to users, the drive to achieve performance portability across platforms faces new challenges. Consider stencil algorithms, where architecture-specific solutions are required to optimize for the parallelism hierarchy and memory hierarchy of emerging systems. In this work, […]

CUDA

Oct, 29

Dynamic autotuning of SpMV kernel in CUSP library

Sparse matrix-vector product (SpMV) is a central operation in many iterative methods for solving linear systems and as such is an attractive candidate for acceleration on the GPU. However, the performance of the SpMV kernel can vary depending both on the target architecture as well as on the sparsity pattern of the matrix. Thus, to […]

Oct, 29

GEVO-ML: Optimizing Machine Learning Code with Evolutionary Computation

Parallel accelerators, such as GPUs, are key enablers for large-scale Machine Learning (ML) applications. However, ML model developers often lack detailed knowledge of the underlying system architectures, while system programmers usually do not have a high-level understanding of the ML model that runs on the specific system. To mitigate this gap between two relevant aspects […]

CUDA

Oct, 29

A Performance-Portable SYCL Implementation of CRK-HACC for Exascale

The first generation of exascale systems will include a variety of machine architectures, featuring GPUs from multiple vendors. As a result, many developers are interested in adopting portable programming models to avoid maintaining multiple versions of their code. It is necessary to document experiences with such programming models to assist developers in understanding the advantages […]

CUDA

Oct, 22

Performance/power assessment of CNN packages on embedded automotive platforms

The rise of power-efficient embedded computers based on highly-parallel accelerators opens a number of opportunities and challenges for researchers and engineers, and paved the way to the era of edge computing. At the same time, advances in embedded AI for object detection and categorization such as YOLO, GoogleNet and AlexNet reached an unprecedented level of […]

CUDA

Oct, 22

Performance portability analysis of SYCL with a classical CG on CPU, GPU, and FPGA

In this work, the capability of SYCL™ to execute code on different hardware devices is investigated. This motivates conducting a performance portability analysis. The architectures investigated are the CPU, GPU, and FPGA. As a benchmark algorithm, the CG algorithm is used, as it is widely applicable to many fields and is more complex than simple […]

CUDA

Oct, 22

Predicting the Execution Time of a kernel on a specific GPU using PTX code

During the last couple of decades, there has been an exponential growth in the amount of time and energy required to run workloads on high-performance computing systems, which nowadays rely heavily upon GPUs. In order to reduce the resources required by these systems, one clear approach is to avoid inefficient applications by using prediction models […]

CUDA

Oct, 22

SYCL in the Edge: Performance Evaluation for Heterogeneous Acceleration

Edge computing is essential to handle increasing data volumes and processing capacities. It provides real-time, secure data processing near data sources, like smart devices, alleviating cloud computing energy use and saving network bandwidth. Specialized accelerators, like GPUs and FPGAs, are vital for low-latency edge computing but the requirements to customized code for different hardware and […]

CUDA

Oct, 22

LeXInt: GPU-accelerated Exponential Integrators package

We present an open-source CUDA-based package that consists of a compilation of exponential integrators where the action of the matrix exponential or the φl functions on a vector is approximated using the method of polynomial interpolation at Leja points. Using a couple of test examples on an NVIDIA A100 GPU, we show that one can […]

CUDA

Oct, 15

OpenMM 8: Molecular Dynamics Simulation with Machine Learning Potentials

Machine learning plays an important and growing role in molecular simulation. The newest version of the OpenMM molecular dynamics toolkit introduces new features to support the use of machine learning potentials. Arbitrary PyTorch models can be added to a simulation and used to compute forces and energy. A higher-level interface allows users to easily model […]

CUDA

•

OpenCL

Oct, 15

Reverse-Mode AD of Reduce-by-Index and Scan in Futhark

We present and evaluate the Futhark implementation of reverse-mode automatic differentiation (AD) for the basic blocks of parallel programming: reduce, prefix sum (scan), and reduce by index. We first present derivations of general-case algorithms and then discuss several specializations that result in efficient differentiation of most cases of practical interest. We report an experiment that […]

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Posts

Performance Tuning for GPU-Embedded Systems: Machine-Learning-based and Analytical Model-driven Tuning Methodologies

Performance portability evaluation of blocked stencil computations on GPUs

Dynamic autotuning of SpMV kernel in CUSP library

GEVO-ML: Optimizing Machine Learning Code with Evolutionary Computation

A Performance-Portable SYCL Implementation of CRK-HACC for Exascale

Performance/power assessment of CNN packages on embedded automotive platforms

Performance portability analysis of SYCL with a classical CG on CPU, GPU, and FPGA

Predicting the Execution Time of a kernel on a specific GPU using PTX code

SYCL in the Edge: Performance Evaluation for Heterogeneous Acceleration

LeXInt: GPU-accelerated Exponential Integrators package

OpenMM 8: Molecular Dynamics Simulation with Machine Learning Potentials

Reverse-Mode AD of Reduce-by-Index and Scan in Futhark

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)