high performance computing on graphics processing units: hgpu.org

Posts

Oct, 20

Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs

As scientific codes are ported between GPU platforms, continuous testing is required to ensure numerical robustness and identify numerical differences. Compiler-induced numerical differences occur when a program is compiled and run on different GPUs, and the numerical outcomes are different for the same input. We present a study of compiler-induced numerical differences between NVIDIA and […]

CUDA

Oct, 20

Online Energy Optimization in GPUs: A Multi-Armed Bandit Approach

Energy consumption has become a critical design metric and a limiting factor in the development of future computing architectures, from small wearable devices to large-scale leadership computing facilities. The predominant methods in energy management optimization are focused on CPUs. However, GPUs are increasingly significant and account for the majority of energy consumption in heterogeneous high […]

Oct, 20

Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models

The rapid growth in machine learning models, especially in natural language processing and computer vision, has led to challenges when running these models on hardware with limited resources. This paper introduces Superpipeline, a new framework designed to optimize the execution of large AI models on constrained hardware during both training and inference. Our approach involves […]

CUDA

Oct, 20

Accelerating Drug Discovery in AutoDock-GPU with Tensor Cores

In drug discovery, molecular docking aims at characterizing the binding of a drug-like molecule to a macromolecule. AutoDock-GPU, a state-of-the-art docking software, estimates the geometrical conformation of a docked ligand-protein complex by minimizing a scoring function. Our profiling results indicate that the current reduction operation that is heavily used in the scoring function is sub-optimal. […]

CUDA

Oct, 20

Efficient Configuration of Heterogeneous Resources and Task Scheduling Strategies in Deep Learning Auto-Tuning Systems

Deep Learning Automatic Hyperparameter Tuning plays a crucial role in advancing Artificial Intelligence applications, eliminating the need for complex expertise and costly manual operations. Ray Tune, developed by the University of California, Berkeley, has gained widespread adoption among notable companies like Amazon and Uber. In contrast to large enterprises, the hardware commonly used by the […]

Oct, 13

Optimized Code Generation for Parallel and Polyhedral Loop Nests using MLIR

In this thesis we show the benefits of the novel MLIR compiler technology to the generation of code from a DSL, namely EasyML used in openCARP, a widely used simulator in the cardiac electrophysiology community. Building on an existing work we deeply modified openCARP’s native code generator to enable efficient vectorized CPU and GPU code […]

CUDA

•

OpenCL

Oct, 13

Deep Learning and Machine Learning with GPGPU and CUDA: Unlocking the Power of Parallel Computing

This book presents a comprehensive exploration of GPGPU (General Purpose Graphics Processing Unit) and its applications in deep learning and machine learning. It focuses on how parallel computing, particularly through the use of CUDA (Compute Unified Device Architecture), can unlock unprecedented computational power for complex tasks. The book provides detailed discussions on CPU and GPU […]

CUDA

•

OpenCL

•

OpenGL

Oct, 13

Sound and Partially-Complete Static Analysis of Data-Races in GPU Programs

GPUs are progressively being integrated into modern society, playing a pivotal role in Artificial Intelligence and High-Performance Computing. Programmers need a deep understanding of the GPU programming model to avoid subtle data-races in their codes. Static verification that is sound and incomplete can guarantee data-race freedom, but the alarms it raises may be spurious and […]

CUDA

Oct, 13

A domain-specific language for geospatial computations on the GPU

This thesis explores how a domain-specific language (DSL) for simple geospatial operators on the GPU can be developed, and evaluates the level of functionality and performance of such a DSL. The purpose of such a DSL is to simplify implementation of geospatial operators on the GPU, in order to increase productivity and performance. An embedded […]

CUDA

Oct, 13

Effects of OpenCL-Based Parallelization Methods on Explicit Numerical Methods to Solve the Heat Equation

In recent years, the need for high-performance computing solutions has increased due to the growing complexity of computational tasks. The use of parallel processing techniques has become essential to address this demand. In this study, an Open Computing Language (OpenCL)-based parallelization algorithm is implemented for the Constant Neighbors (CNe) and CNe with Predictor–Corrector (CpC) numerical […]

OpenCL

Oct, 6

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor Core support and inefficient memory management, leading to suboptimal acceleration. To address these challenges, we propose a comprehensive acceleration scheme for arbitrary precision […]

CUDA

Oct, 6

Event-Based OpenMP Tasks for Time-Sensitive GPU-Accelerated Systems

The throughput-centric design of GPUs poses challenges when integrating them into time-sensitive applications. Nevertheless, modern GPU architectures and software have recently evolved, making it possible to minimize overheads and interference along the critical path through advanced mechanisms, such as GPU graphs, while sustaining high throughput. However, GPU vendors provide programming ecosystems specific to their products, […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs

Online Energy Optimization in GPUs: A Multi-Armed Bandit Approach

Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models

Accelerating Drug Discovery in AutoDock-GPU with Tensor Cores

Efficient Configuration of Heterogeneous Resources and Task Scheduling Strategies in Deep Learning Auto-Tuning Systems

Optimized Code Generation for Parallel and Polyhedral Loop Nests using MLIR

Deep Learning and Machine Learning with GPGPU and CUDA: Unlocking the Power of Parallel Computing

Sound and Partially-Complete Static Analysis of Data-Races in GPU Programs

A domain-specific language for geospatial computations on the GPU

Effects of OpenCL-Based Parallelization Methods on Explicit Numerical Methods to Solve the Heat Equation

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

Event-Based OpenMP Tasks for Time-Sensitive GPU-Accelerated Systems

Recent source codes

EnergyUCB-Bandit

Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models

Faial: finds bugs in CUDA kernels

Effects of OpenCL-Based Parallelization Methods on Explicit Numerical Methods to Solve the Heat Equation

Intel® SHMEM: Device initiated shared memory based communication library

miniLB: Lattice Botlzmann miniapp w/SYCL

AFOCL

2domination

MFC: Exascale simulation of multiphase/physics fluid dynamics

UVaFTLE: Lagrangian finite time Lyapunov exponent extraction for fluid dynamic applications

Most viewed papers (last 30 days)