high performance computing on graphics processing units: hgpu.org

Posts

Mar, 10

WgPy: GPU-accelerated NumPy-like array library for web browsers

To execute scientific computing programs such as deep learning at high speed, GPU acceleration is a powerful option. With the recent advancements in web technologies, interfaces like WebGL and WebGPU, which utilize GPUs on the client side of web applications, have become available. On the other hand, Pyodide, a Python runtime that operates on web […]

CUDA

Mar, 10

A Microbenchmark Framework for Performance Evaluation of OpenMP Target Offloading

We present a framework based on Catch2 to evaluate performance of OpenMP’s target offload model via micro-benchmarks. The compilers supporting OpenMP’s target offload model for heterogeneous architectures are currently undergoing rapid development. These developments influence performance of various complex applications in different ways. This framework can be employed to track the impact of compiler upgrades […]

Mar, 10

Mpache: Interaction Aware Multi-level Cache Bypassing on GPUs

Graphics Processing Units (GPUs) are essential for general-purpose applications and are commonly leveraging multi-level caches to alleviate memory access pressure. However, the default cache management may lose opportunities for optimal performance in different applications. Although existing cache bypassing techniques tend to address this challenge, these methods predominantly concentrate on single-level cache, thus restricting their potential […]

Mar, 10

SUperman: Efficient Permanent Computation on GPUs

The permanent is a function, defined for a square matrix, with applications in various domains including quantum computing, statistical physics, complexity theory, combinatorics, and graph theory. Its formula is similar to that of the determinant, however unlike the determinant, its exact computation is #P-complete, i.e., there is no algorithm to compute the permanent in polynomial […]

CUDA

Mar, 10

Can Tensor Cores Benefit Memory-Bound Kernels? (No!)

Tensor cores are specialized processing units within GPUs that have demonstrated significant efficiency gains in compute-bound applications such as Deep Learning Training by accelerating dense matrix operations. Given their success, researchers have attempted to extend tensor core capabilities beyond dense matrix computations to other computational patterns, including memory-bound kernels. Recent studies have reported that tensor […]

CUDA

Mar, 3

TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework

With the rising demand for computational power and the increasing variety of computational scenarios, considerable interest has emerged in transforming existing CUDA programs into more general-purpose OpenCL programs, enabling them to run across diverse hardware platforms. However, manual methods, typically designed for specific applications, lack flexibility. Current automated conversion techniques also face considerable challenges, particularly […]

CUDA

•

OpenCL

Mar, 3

pyATF: Constraint-Based Auto-Tuning in Python

We introduce pyATF – a new, language-independent, open-source auto-tuning tool that fully automatically determines optimized values of performance-critical program parameters. A major feature of pyATF is its support for constrained parameters, e.g., the value of one parameter has to divide the value of another parameter. A further major feature of pyATF is its user interface […]

CUDA

•

OpenCL

Mar, 3

TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators

Triton, a high-level Python-like language designed for building efficient GPU kernels, is widely adopted in deep learning frameworks due to its portability, flexibility, and accessibility. However, programming and parallel optimization still require considerable trial and error from Triton developers. Despite advances in large language models (LLMs) for conventional code generation, these models struggle to generate […]

CUDA

Mar, 3

CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads

Deep learning training at scale is resource-intensive and time-consuming, often running across hundreds or thousands of GPUs for weeks or months. Efficient checkpointing is crucial for running these workloads, especially in multi-tenant environments where compute resources are shared, and job preemptions or interruptions are common. However, transparent and unified GPU snapshots are particularly challenging because […]

CUDA

Mar, 3

Towards Studying the Effect of Compiler Optimizations and Software Randomization on GPU Reliability

The evolution of Graphics Processing Unit (GPU) compilers has facilitated the support for general-purpose programming languages across various architectures. The NVIDIA CUDA Compiler (NVCC) employs multiple compilation levels prior to generating machine code, implementing intricate optimizations to enhance performance. These optimizations influence the manner in which software is mapped to the underlying hardware, which can […]

CUDA

Feb, 24

KernelBench: Can LLMs Write Efficient GPU Kernels?

Efficient GPU kernels are crucial for building performant machine learning architectures, but writing them is a time-consuming challenge that requires significant expertise; therefore, we explore using language models (LMs) to automate kernel generation. We introduce KernelBench, an open-source framework for evaluating LMs’ ability to write fast and correct kernels on a suite of 250 carefully […]

CUDA

Feb, 24

Seamless acceleration of Fortran intrinsics via AMD AI engines

A major challenge that the HPC community faces is how to continue delivering the performance demanded by scientific programmers, whilst meeting an increased emphasis on sustainable operations. Specialised architectures, such as FPGAs and AMD’s AI Engines (AIEs), have been demonstrated to provide significant energy efficiency advantages, however a major challenge is that to most effectively […]

high performance computing on graphics processing units: hgpu.org

Posts

WgPy: GPU-accelerated NumPy-like array library for web browsers

A Microbenchmark Framework for Performance Evaluation of OpenMP Target Offloading

Mpache: Interaction Aware Multi-level Cache Bypassing on GPUs

SUperman: Efficient Permanent Computation on GPUs

Can Tensor Cores Benefit Memory-Bound Kernels? (No!)

TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework

pyATF: Constraint-Based Auto-Tuning in Python

TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators

CRIUgpu: Transparent Checkpointing of GPU-Accelerated Workloads

Towards Studying the Effect of Compiler Optimizations and Software Randomization on GPU Reliability

KernelBench: Can LLMs Write Efficient GPU Kernels?

Seamless acceleration of Fortran intrinsics via AMD AI engines

Recent source codes

wgpy: WebGL accelerated numpy-compatible array library for web browser

Microbenchmarking OpenMP target offload with Catch2

SUperman: Highly Efficient Permanent Computation Library

TransCL: An Automatic CUDA-to-OpenCL Programs Transformation Framework

pyATF: The Auto-Tuning Framework (ATF) in Python

TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators

Checkpoint/Restore tool

KernelBench: Can LLMs Write GPU Kernels?

An MLIR-based toolchain for AMD AI Engine-enabled devices

Forecasting time series with constraints

Most viewed papers (last 30 days)