high performance computing on graphics processing units: hgpu.org

Posts

Oct, 13

Effects of OpenCL-Based Parallelization Methods on Explicit Numerical Methods to Solve the Heat Equation

In recent years, the need for high-performance computing solutions has increased due to the growing complexity of computational tasks. The use of parallel processing techniques has become essential to address this demand. In this study, an Open Computing Language (OpenCL)-based parallelization algorithm is implemented for the Constant Neighbors (CNe) and CNe with Predictor–Corrector (CpC) numerical […]

OpenCL

Oct, 6

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Modern GPU systems are constantly evolving to meet the needs of computing-intensive applications in scientific and machine learning domains. However, there is typically a gap between the hardware capacity and the achievable application performance. This work aims to provide a better understanding of the Infinity Fabric interconnects on AMD GPUs and CPUs. We propose a […]

Oct, 6

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor Core support and inefficient memory management, leading to suboptimal acceleration. To address these challenges, we propose a comprehensive acceleration scheme for arbitrary precision […]

CUDA

Oct, 6

Event-Based OpenMP Tasks for Time-Sensitive GPU-Accelerated Systems

The throughput-centric design of GPUs poses challenges when integrating them into time-sensitive applications. Nevertheless, modern GPU architectures and software have recently evolved, making it possible to minimize overheads and interference along the critical path through advanced mechanisms, such as GPU graphs, while sustaining high throughput. However, GPU vendors provide programming ecosystems specific to their products, […]

CUDA

Oct, 6

Benchmarking Thread Block Cluster

Graphics processing units (GPUs) have become essential accelerators in the fields of artificial intelligence (AI), high-performance computing (HPC), and data analytics, offering substantial performance improvements over traditional computing resources. In 2022, NVIDIA’s release of the Hopper architecture marked a significant advancement in GPU design by adding a new hierarchical level to their CUDA programming model: […]

CUDA

Oct, 6

Intel(R) SHMEM: GPU-initiated OpenSHMEM using SYCL

Modern high-end systems are increasingly becoming heterogeneous, providing users options to use general purpose Graphics Processing Units (GPU) and other accelerators for additional performance. High Performance Computing (HPC) and Artificial Intelligence (AI) applications are often carefully arranged to overlap communications and computation for increased efficiency on such platforms. This has led to efforts to extend […]

Sep, 29

Automatic Generation of OpenCL Code through Polyhedral Compilation with LLM

In recent years, a multitude of AI solutions has emerged to facilitate code generation, commonly known as Language Model-based Programming (LLM). These tools empower programmers to automate their work. Automatic programming also falls within the domain of optimizing compilers, primarily based on the polyhedral model, which processes loop nests concentrating most computations. This article focuses […]

CUDA

•

OpenCL

Sep, 29

HPC acceleration of large (min, +) matrix products to compute domination-type parameters in graphs

The computation of the domination-type parameters is a challenging problem in Cartesian product graphs. We present an algorithmic method to compute the 2-domination number of the Cartesian product of a path with small order and any cycle, involving the (min,+) matrix product. We establish some theoretical results that provide the algorithms necessary to compute that […]

CUDA

Sep, 29

miniLB: A Performance Portability Study of Lattice-Boltzmann Simulations

The Lattice Boltzmann Method (LBM) is a computational technique of Computational Fluid Dynamics (CFD) that has gained popularity due to its high parallelism and ability to handle complex geometries with minimal effort. Although LBM frameworks are increasingly important in various industries and research fields, their complexity makes them difficult to modify and can lead to […]

Sep, 29

Bitstream Database-Driven FPGA Programming Flow Based on Standard OpenCL

Field-programmable gate array (FPGA) vendors provide high-level synthesis (HLS) compilers with accompanying OpenCL runtimes to enable easier use of their devices by non-hardware experts. However, the current runtimes provided by the vendors are not OpenCL-compliant, limiting the application portability and making it difficult to integrate FPGA devices in heterogeneous computing platforms. We propose an automated […]

OpenCL

Sep, 29

OpenACC offloading of the MFC compressible multiphase flow solver on AMD and NVIDIA GPUs

GPUs are the heart of the latest generations of supercomputers. We efficiently accelerate a compressible multiphase flow solver via OpenACC on NVIDIA and AMD Instinct GPUs. Optimization is accomplished by specifying the directive clauses ‘gang vector’ and ‘collapse’. Further speedups of six and ten times are achieved by packing user-defined types into coalesced multidimensional arrays […]

Sep, 22

The Landscape of GPU-Centric Communication

In recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks, especially as the number of GPUs per node and cluster grows. Traditionally, the CPU managed multi-GPU communication, but advancements in GPU-centric communication now […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Effects of OpenCL-Based Parallelization Methods on Explicit Numerical Methods to Solve the Heat Equation

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

Event-Based OpenMP Tasks for Time-Sensitive GPU-Accelerated Systems

Benchmarking Thread Block Cluster

Intel(R) SHMEM: GPU-initiated OpenSHMEM using SYCL

Automatic Generation of OpenCL Code through Polyhedral Compilation with LLM

HPC acceleration of large (min, +) matrix products to compute domination-type parameters in graphs

miniLB: A Performance Portability Study of Lattice-Boltzmann Simulations

Bitstream Database-Driven FPGA Programming Flow Based on Standard OpenCL

OpenACC offloading of the MFC compressible multiphase flow solver on AMD and NVIDIA GPUs

The Landscape of GPU-Centric Communication

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)