high performance computing on graphics processing units: hgpu.org

Posts

Oct, 20

Online Energy Optimization in GPUs: A Multi-Armed Bandit Approach

Energy consumption has become a critical design metric and a limiting factor in the development of future computing architectures, from small wearable devices to large-scale leadership computing facilities. The predominant methods in energy management optimization are focused on CPUs. However, GPUs are increasingly significant and account for the majority of energy consumption in heterogeneous high […]

Oct, 20

Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models

The rapid growth in machine learning models, especially in natural language processing and computer vision, has led to challenges when running these models on hardware with limited resources. This paper introduces Superpipeline, a new framework designed to optimize the execution of large AI models on constrained hardware during both training and inference. Our approach involves […]

CUDA

Oct, 20

Accelerating Drug Discovery in AutoDock-GPU with Tensor Cores

In drug discovery, molecular docking aims at characterizing the binding of a drug-like molecule to a macromolecule. AutoDock-GPU, a state-of-the-art docking software, estimates the geometrical conformation of a docked ligand-protein complex by minimizing a scoring function. Our profiling results indicate that the current reduction operation that is heavily used in the scoring function is sub-optimal. […]

CUDA

Oct, 20

Efficient Configuration of Heterogeneous Resources and Task Scheduling Strategies in Deep Learning Auto-Tuning Systems

Deep Learning Automatic Hyperparameter Tuning plays a crucial role in advancing Artificial Intelligence applications, eliminating the need for complex expertise and costly manual operations. Ray Tune, developed by the University of California, Berkeley, has gained widespread adoption among notable companies like Amazon and Uber. In contrast to large enterprises, the hardware commonly used by the […]

Oct, 20

Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs

As scientific codes are ported between GPU platforms, continuous testing is required to ensure numerical robustness and identify numerical differences. Compiler-induced numerical differences occur when a program is compiled and run on different GPUs, and the numerical outcomes are different for the same input. We present a study of compiler-induced numerical differences between NVIDIA and […]

CUDA

Oct, 13

Optimized Code Generation for Parallel and Polyhedral Loop Nests using MLIR

In this thesis we show the benefits of the novel MLIR compiler technology to the generation of code from a DSL, namely EasyML used in openCARP, a widely used simulator in the cardiac electrophysiology community. Building on an existing work we deeply modified openCARP’s native code generator to enable efficient vectorized CPU and GPU code […]

CUDA

•

OpenCL

Oct, 13

Deep Learning and Machine Learning with GPGPU and CUDA: Unlocking the Power of Parallel Computing

This book presents a comprehensive exploration of GPGPU (General Purpose Graphics Processing Unit) and its applications in deep learning and machine learning. It focuses on how parallel computing, particularly through the use of CUDA (Compute Unified Device Architecture), can unlock unprecedented computational power for complex tasks. The book provides detailed discussions on CPU and GPU […]

CUDA

•

OpenCL

•

OpenGL

Oct, 13

Sound and Partially-Complete Static Analysis of Data-Races in GPU Programs

GPUs are progressively being integrated into modern society, playing a pivotal role in Artificial Intelligence and High-Performance Computing. Programmers need a deep understanding of the GPU programming model to avoid subtle data-races in their codes. Static verification that is sound and incomplete can guarantee data-race freedom, but the alarms it raises may be spurious and […]

CUDA

Oct, 13

A domain-specific language for geospatial computations on the GPU

This thesis explores how a domain-specific language (DSL) for simple geospatial operators on the GPU can be developed, and evaluates the level of functionality and performance of such a DSL. The purpose of such a DSL is to simplify implementation of geospatial operators on the GPU, in order to increase productivity and performance. An embedded […]

CUDA

Oct, 13

Effects of OpenCL-Based Parallelization Methods on Explicit Numerical Methods to Solve the Heat Equation

In recent years, the need for high-performance computing solutions has increased due to the growing complexity of computational tasks. The use of parallel processing techniques has become essential to address this demand. In this study, an Open Computing Language (OpenCL)-based parallelization algorithm is implemented for the Constant Neighbors (CNe) and CNe with Predictor–Corrector (CpC) numerical […]

OpenCL

Oct, 6

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Modern GPU systems are constantly evolving to meet the needs of computing-intensive applications in scientific and machine learning domains. However, there is typically a gap between the hardware capacity and the achievable application performance. This work aims to provide a better understanding of the Infinity Fabric interconnects on AMD GPUs and CPUs. We propose a […]

Oct, 6

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

Large language models (LLMs) have been widely applied but face challenges in efficient inference. While quantization methods reduce computational demands, ultra-low bit quantization with arbitrary precision is hindered by limited GPU Tensor Core support and inefficient memory management, leading to suboptimal acceleration. To address these challenges, we propose a comprehensive acceleration scheme for arbitrary precision […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Online Energy Optimization in GPUs: A Multi-Armed Bandit Approach

Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models

Accelerating Drug Discovery in AutoDock-GPU with Tensor Cores

Efficient Configuration of Heterogeneous Resources and Task Scheduling Strategies in Deep Learning Auto-Tuning Systems

Testing GPU Numerics: Finding Numerical Differences Between NVIDIA and AMD GPUs

Optimized Code Generation for Parallel and Polyhedral Loop Nests using MLIR

Deep Learning and Machine Learning with GPGPU and CUDA: Unlocking the Power of Parallel Computing

Sound and Partially-Complete Static Analysis of Data-Races in GPU Programs

A domain-specific language for geospatial computations on the GPU

Effects of OpenCL-Based Parallelization Methods on Explicit Numerical Methods to Solve the Heat Equation

Understanding Data Movement in AMD Multi-GPU Systems with Infinity Fabric

Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)