high performance computing on graphics processing units: hgpu.org

Posts

May, 21

Optimization and Portability of a Fusion OpenACC-based FORTRAN HPC Code from NVIDIA to AMD GPUs

NVIDIA has been the main provider of GPU hardware in HPC systems for over a decade. Most applications that benefit from GPUs have thus been developed and optimized for the NVIDIA software stack. Recent exascale HPC systems are, however, introducing GPUs from other vendors, e.g. with the AMD GPU-based OLCF Frontier system just becoming available. […]

May, 21

Experiences in Building a Composable and Functional API for Runtime SPIR-V Code Generation

This paper presents the Beehive SPIR-V Toolkit; a framework that can automatically generate a Java composable and functional library for dynamically building SPIR-V binary modules. The Beehive SPIR-V Toolkit can be used by optimizing compilers and runtime systems to generate and validate SPIR-V binary modules from managed runtime systems, such as the Java Virtual Machine […]

OpenCL

May, 14

Towards Alignment of Parallelism in SYCL and ISO C++

SYCL began as a C++ abstraction for OpenCL concepts, whereas parallelism in ISO C++ evolved from the algorithms in the standard library. This history has resulted in the two specifications using different terminology to describe parallelism, which is confusing to developers and hinders the SYCL community’s efforts to influence the direction of C++ through experiments […]

OpenCL

May, 14

TorchBench: Benchmarking PyTorch with High API Surface Coverage

Deep learning (DL) has been a revolutionary technique in various domains. To facilitate the model development and deployment, many deep learning frameworks are proposed, among which PyTorch is one of the most popular solutions. The performance of ecosystem around PyTorch is critically important, which saves the costs of training models and reduces the response time […]

May, 14

Performance Optimization using Multimodal Modeling and Heterogeneous GNN

Growing heterogeneity and configurability in HPC architectures has made auto-tuning applications and runtime parameters on these systems very complex. Users are presented with a multitude of options to configure parameters. In addition to application specific solutions, a common approach is to use general purpose search strategies, which often might not identify the best configurations or […]

OpenCL

May, 14

Descend: A Safe GPU Systems Programming Language

Graphics Processing Units (GPU) offer tremendous computational power by following a throughput oriented computing paradigm where many thousand computational units operate in parallel. Programming this massively parallel hardware is challenging. Programmers must correctly and efficiently coordinate thousands of threads and their accesses to various shared memory spaces. Existing mainstream GPU programming languages, such as CUDA […]

CUDA

•

OpenCL

May, 14

Prediction of Performance and Power Consumption of GPGPU Applications

Graphics Processing Units (GPUs) have become an integral part of High-Performance Computing to achieve an Exascale performance. The main goal of application developers of GPU is to tune their code extensively to obtain optimal performance, making efficient use of different resources available. While extracting optimal performance of applications on an HPC infrastructure, developers should also […]

CUDA

May, 7

Dynamically Finding Optimal Kernel Launch Parameters for CUDA Programs

In this thesis, we present KLARAPTOR (Kernel LAunch parameters RAtional Program estimaTOR), a freely available tool to dynamically determine the values of kernel launch parameters of a CUDA kernel. We describe a technique for building a helper program, at the compile-time of a CUDA program, that is used at run-time to determine near-optimal kernel launch […]

CUDA

May, 7

Redwood: Flexible and Portable Heterogeneous Tree Traversal Workloads

Shared memory heterogeneous systems are now mainstream, with nearly every mobile phone and tablet containing integrated processing units. However, developing applications for such devices is difficult as workloads must be decomposed across different processing units, and the decomposition must be flexible to account for the growing diversity of devices, each with different relative processing unit […]

CUDA

May, 7

Optimizing Deep Learning Models For Raspberry Pi

Deep learning models have become increasingly popular for a wide range of applications, including computer vision, natural language processing, and speech recognition. However, these models typically require large amounts of computational resources, making them challenging to run on low-power devices such as the Raspberry Pi. One approach to addressing this challenge is to use pruning […]

May, 7

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

General Matrix Multiplication (GEMM) is a crucial algorithm for various applications such as machine learning and scientific computing, and an efficient GEMM implementation is essential for the performance of these systems. While researchers often strive for faster performance by using large compute platforms, the increased scale of these systems can raise concerns about hardware and […]

CUDA

May, 7

FZ-GPU: A Fast and High-Ratio Lossy Compressor for Scientific Computing Applications on GPUs

Today’s large-scale scientific applications running on high-performance computing (HPC) systems generate vast data volumes. Thus, data compression is becoming a critical technique to mitigate the storage burden and data-movement cost. However, existing lossy compressors for scientific data cannot achieve a high compression ratio and throughput simultaneously, hindering their adoption in many applications requiring fast compression, […]

CUDA

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Posts

Optimization and Portability of a Fusion OpenACC-based FORTRAN HPC Code from NVIDIA to AMD GPUs

Experiences in Building a Composable and Functional API for Runtime SPIR-V Code Generation

Towards Alignment of Parallelism in SYCL and ISO C++

TorchBench: Benchmarking PyTorch with High API Surface Coverage

Performance Optimization using Multimodal Modeling and Heterogeneous GNN

Descend: A Safe GPU Systems Programming Language

Prediction of Performance and Power Consumption of GPGPU Applications

Dynamically Finding Optimal Kernel Launch Parameters for CUDA Programs

Redwood: Flexible and Portable Heterogeneous Tree Traversal Workloads

Optimizing Deep Learning Models For Raspberry Pi

Anatomy of High-Performance GEMM with Online Fault Tolerance on GPUs

FZ-GPU: A Fast and High-Ratio Lossy Compressor for Scientific Computing Applications on GPUs

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)