high performance computing on graphics processing units: hgpu.org

Posts

Apr, 16

Energy-Efficient GPU Clusters Scheduling for Deep Learning

Training deep neural networks (DNNs) is a major workload in datacenters today, resulting in a tremendously fast growth of energy consumption. It is important to reduce the energy consumption while completing the DL training jobs early in data centers. In this paper, we propose PowerFlow, a GPU clusters scheduler that reduces the average Job Completion […]

Apr, 2

ytopt: Autotuning Scientific Applications for Energy Efficiency at Large Scales

As we enter the exascale computing era, efficiently utilizing power and optimizing the performance of scientific applications under power and energy constraints has become critical and challenging. We propose a low-overhead autotuning framework to autotune performance and energy for various hybrid MPI/OpenMP scientific applications at large scales and to explore the tradeoffs between application runtime […]

Apr, 2

Task parallelism-based architectures on FPGA to optimize the energy efficiency of AI at the edge

In the world of artificial intelligence (AI) at the edge, we need to focus primarily on the energy efficiency with which we approach deep neural network (DNN) applications. In many applications, the speed of obtaining an inference can be critical; but many applications easily meet their time requirements, and the energy needed to calculate the […]

OpenCL

Apr, 2

Managing heterogeneous device memory using C++17 memory resources

Programmers using the C++ programming language are increasingly taught to manage memory implicitly through containers provided by the C++ standard library. However, heterogeneous programming platforms often require explicit allocation and deallocation of memory. This discrepancy in memory management strategies can be daunting and problematic for C++ developers who are not already familiar with heterogeneous programming. […]

CUDA

Apr, 2

PopSparse: Accelerated block sparse matrix multiplication on IPU

Reducing the computational cost of running large scale neural networks using sparsity has attracted great attention in the deep learning community. While much success has been achieved in reducing FLOP and parameter counts while maintaining acceptable task performance, achieving actual speed improvements has typically been much more difficult, particularly on general purpose accelerators (GPAs) such […]

CUDA

Apr, 2

Pgx: Hardware-accelerated parallel game simulation for reinforcement learning

We propose Pgx, a collection of board game simulators written in JAX. Thanks to auto-vectorization and Just-In-Time compilation of JAX, Pgx scales easily to thousands of parallel execution on GPU/TPU accelerators. We found that the simulation of Pgx on a single A100 GPU is 10x faster than that of existing reinforcement learning libraries. Pgx implements […]

Mar, 26

Comparing SYCL data transfer strategies for tracking use cases

The aim of this work is to compare the performance and ease of programming of the various data transfer strategies provided by SYCL 2020: buffers/accessors on one hand and the different storage types exposed by Unified Shared Memory (USM) on the other hand. We measured the relative performance of USM exclusively located either on the […]

Mar, 26

E2C: A Visual Simulator to Reinforce Education of Heterogeneous Computing Systems

With the increasing popularity of accelerator technologies (e.g., GPUs and TPUs) and the emergence of domain-specific computing via ASICs and FPGA, the matter of heterogeneity and understanding its ramifications on the performance has become more critical than ever before. However, it is challenging to effectively educate students about the potential impacts of heterogeneity on the […]

Mar, 26

Reinforcement Learning Strategies for Compiler Optimization in High level Synthesis

High Level Synthesis (HLS) offers a possible programmability solution for FPGAs by automatically compiling CPU codes to custom hardware configurations, but currently delivers far lower hardware quality than circuits written using Hardware Description Languages (HDLs). One reason is because the standard set of code optimizations used by CPU compilers, such as LLVM, are not well […]

Mar, 26

Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications

Graphic Processing Units (GPUs) have become ubiquitous in scientific computing. However, writing efficient GPU kernels can be challenging due to the need for careful code tuning. To automatically explore the kernel optimization space, several auto-tuning tools – like Kernel Tuner – have been proposed. Unfortunately, these existing auto-tuning tools often do not concern themselves with […]

CUDA

Mar, 26

DSDP: A Blind Docking Strategy Accelerated by GPUs

Virtual screening, including molecular docking, plays an essential role in drug discovery. Many traditional and machine-learning based methods are available to fulfil the docking task. The traditional docking methods are normally extensively time-consuming, and their performance in blind docking remains to be improved. Although the runtime of docking based on machine learning is significantly decreased, […]

CUDA

Mar, 19

Challenges and Opportunities in C/C++ Source-To-Source Compilation

The C/C++ compilation stack (Intermediate Representations (IRs), compilation passes and backends) is encumbered by a steep learning curve, which we believe can be lowered by complementing it with approaches such as source-to-source compilation. Source-to-source compilation is a technology that is widely used and quite mature in certain programming environments, such as JavaScript, but that faces […]

CUDA

•

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Energy-Efficient GPU Clusters Scheduling for Deep Learning

ytopt: Autotuning Scientific Applications for Energy Efficiency at Large Scales

Task parallelism-based architectures on FPGA to optimize the energy efficiency of AI at the edge

Managing heterogeneous device memory using C++17 memory resources

PopSparse: Accelerated block sparse matrix multiplication on IPU

Pgx: Hardware-accelerated parallel game simulation for reinforcement learning

Comparing SYCL data transfer strategies for tracking use cases

E2C: A Visual Simulator to Reinforce Education of Heterogeneous Computing Systems

Reinforcement Learning Strategies for Compiler Optimization in High level Synthesis

Kernel Launcher: C++ Library for Optimal-Performance Portable CUDA Applications

DSDP: A Blind Docking Strategy Accelerated by GPUs

Challenges and Opportunities in C/C++ Source-To-Source Compilation

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)