high performance computing on graphics processing units: hgpu.org

Posts

Oct, 13

A domain-specific language for geospatial computations on the GPU

This thesis explores how a domain-specific language (DSL) for simple geospatial operators on the GPU can be developed, and evaluates the level of functionality and performance of such a DSL. The purpose of such a DSL is to simplify implementation of geospatial operators on the GPU, in order to increase productivity and performance. An embedded […]

CUDA

Oct, 13

Effects of OpenCL-Based Parallelization Methods on Explicit Numerical Methods to Solve the Heat Equation

In recent years, the need for high-performance computing solutions has increased due to the growing complexity of computational tasks. The use of parallel processing techniques has become essential to address this demand. In this study, an Open Computing Language (OpenCL)-based parallelization algorithm is implemented for the Constant Neighbors (CNe) and CNe with Predictor–Corrector (CpC) numerical […]

OpenCL

Oct, 6

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Modern GPU systems are constantly evolving to meet the needs of computing-intensive applications in scientific and machine learning domains. However, there is typically a gap between the hardware capacity and the achievable application performance. This work aims to provide a better understanding of the Infinity Fabric interconnects on AMD GPUs and CPUs. We propose a […]

Oct, 6

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor Core support and inefficient memory management, leading to suboptimal acceleration. To address these challenges, we propose a comprehensive acceleration scheme for arbitrary precision […]

CUDA

Oct, 6

Event-Based OpenMP Tasks for Time-Sensitive GPU-Accelerated Systems

The throughput-centric design of GPUs poses challenges when integrating them into time-sensitive applications. Nevertheless, modern GPU architectures and software have recently evolved, making it possible to minimize overheads and interference along the critical path through advanced mechanisms, such as GPU graphs, while sustaining high throughput. However, GPU vendors provide programming ecosystems specific to their products, […]

CUDA

Oct, 6

Benchmarking Thread Block Cluster

Graphics processing units (GPUs) have become essential accelerators in the fields of artificial intelligence (AI), high-performance computing (HPC), and data analytics, offering substantial performance improvements over traditional computing resources. In 2022, NVIDIA’s release of the Hopper architecture marked a significant advancement in GPU design by adding a new hierarchical level to their CUDA programming model: […]

CUDA

Oct, 6

Intel(R) SHMEM: GPU-initiated OpenSHMEM using SYCL

Modern high-end systems are increasingly becoming heterogeneous, providing users options to use general purpose Graphics Processing Units (GPU) and other accelerators for additional performance. High Performance Computing (HPC) and Artificial Intelligence (AI) applications are often carefully arranged to overlap communications and computation for increased efficiency on such platforms. This has led to efforts to extend […]

Sep, 29

HPC acceleration of large (min, +) matrix products to compute domination-type parameters in graphs

The computation of the domination-type parameters is a challenging problem in Cartesian product graphs. We present an algorithmic method to compute the 2-domination number of the Cartesian product of a path with small order and any cycle, involving the (min,+) matrix product. We establish some theoretical results that provide the algorithms necessary to compute that […]

CUDA

Sep, 29

miniLB: A Performance Portability Study of Lattice-Boltzmann Simulations

The Lattice Boltzmann Method (LBM) is a computational technique of Computational Fluid Dynamics (CFD) that has gained popularity due to its high parallelism and ability to handle complex geometries with minimal effort. Although LBM frameworks are increasingly important in various industries and research fields, their complexity makes them difficult to modify and can lead to […]

Sep, 29

Bitstream Database-Driven FPGA Programming Flow Based on Standard OpenCL

Field-programmable gate array (FPGA) vendors provide high-level synthesis (HLS) compilers with accompanying OpenCL runtimes to enable easier use of their devices by non-hardware experts. However, the current runtimes provided by the vendors are not OpenCL-compliant, limiting the application portability and making it difficult to integrate FPGA devices in heterogeneous computing platforms. We propose an automated […]

OpenCL

Sep, 29

Automatic Generation of OpenCL Code through Polyhedral Compilation with LLM

In recent years, a multitude of AI solutions has emerged to facilitate code generation, commonly known as Language Model-based Programming (LLM). These tools empower programmers to automate their work. Automatic programming also falls within the domain of optimizing compilers, primarily based on the polyhedral model, which processes loop nests concentrating most computations. This article focuses […]

CUDA

•

OpenCL

Sep, 29

OpenACC offloading of the MFC compressible multiphase flow solver on AMD and NVIDIA GPUs

GPUs are the heart of the latest generations of supercomputers. We efficiently accelerate a compressible multiphase flow solver via OpenACC on NVIDIA and AMD Instinct GPUs. Optimization is accomplished by specifying the directive clauses ‘gang vector’ and ‘collapse’. Further speedups of six and ten times are achieved by packing user-defined types into coalesced multidimensional arrays […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

A domain-specific language for geospatial computations on the GPU

Effects of OpenCL-Based Parallelization Methods on Explicit Numerical Methods to Solve the Heat Equation

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

Event-Based OpenMP Tasks for Time-Sensitive GPU-Accelerated Systems

Benchmarking Thread Block Cluster

Intel(R) SHMEM: GPU-initiated OpenSHMEM using SYCL

HPC acceleration of large (min, +) matrix products to compute domination-type parameters in graphs

miniLB: A Performance Portability Study of Lattice-Boltzmann Simulations

Bitstream Database-Driven FPGA Programming Flow Based on Standard OpenCL

Automatic Generation of OpenCL Code through Polyhedral Compilation with LLM

OpenACC offloading of the MFC compressible multiphase flow solver on AMD and NVIDIA GPUs

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)