high performance computing on graphics processing units: hgpu.org

Posts

Dec, 19

On the accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit and novel 16-bit number formats

Fluid dynamics simulations with the lattice Boltzmann method (LBM) are very memory-intensive. Alongside reduction in memory footprint, significant performance benefits can be achieved by using FP32 (single) precision compared to FP64 (double) precision, especially on GPUs. Here, we evaluate the possibility to use even FP16 and Posit16 (half) precision for storing fluid populations, while still […]

OpenCL

Dec, 12

Manas: Mining Software Repositories to Assist AutoML

Today deep learning is widely used for building software. A software engineering problem with deep learning is that finding an appropriate convolutional neural network (CNN) model for the task can be a challenge for developers. Recent work on AutoML, more precisely neural architecture search (NAS), embodied by tools like Auto-Keras aims to solve this problem […]

Dec, 12

CitiusSynapse: A Deep Learning Framework for Embedded Systems

As embedded systems, such as smartphones with limited resources, have become increasingly popular, active research has recently been conducted on performing on-device deep learning in such systems. Therefore, in this study, we propose a deep learning framework that is specialized for embedded systems with limited resources, the operation processing structure of which differs from that […]

OpenCL

Dec, 12

Fast Neural Representations for Direct Volume Rendering

Despite the potential of neural scene representations to effectively compress 3D scalar fields at high reconstruction quality, the computational complexity of the training and data reconstruction step using scene representation networks limits their use in practical applications. In this paper, we analyze whether scene representation networks can be modified to reduce these limitations and whether […]

CUDA

Dec, 12

High performance computing on Android devices – a case study

High performance computing for low power devices can be useful to speed up calculations on processors that use a lower clock rate than computers for which energy efficiency is not an issue. In this trial, different high performance techniques for Android devices have been compared, with a special focus on the use of the GPU. […]

OpenCL

Dec, 12

GPU backed Data Mining on Android Devices

Choosing an appropriate programming paradigm for high-performance computing on low-power devices can be useful to speed up calculations. Many Android devices have an integrated GPU and – although not officially supported – the OpenCL framework can be used on Android devices for addressing these GPUs. OpenCL supports thread and data parallelism. Applications that use the […]

OpenCL

Dec, 5

Analysis and Comparison of Performance and Power Consumption of Neural Networks on CPU, GPU, TPU and FPGA

In this work, we analyze the performance of neural networks on a variety of heterogenous platforms. We strive to find the best platform in terms of raw benchmark performance, performance per watt and performance per Euro. To reach this goal, we focused on convolutional neural networks and created several micro- and macrobenchmark applications and used […]

Dec, 5

Autotuning CUDA: Applying NLP Techniques to LS-CAT

The abstract relation between hardware parameters and program performance makes setting program parameters a difficult task. Without autotuning, software can miss low-level optimizations, resulting in lower performance. Traditionally, time-consuming trial and error search methods have been the staple of autotuning. Applying Natural language processing (NLP) based machine learning (ML) methods to source code as a […]

CUDA

Dec, 5

Bayesian Optimization for auto-tuning GPU kernels

Finding optimal parameter configurations for tunable GPU kernels is a non-trivial exercise for large search spaces, even when automated. This poses an optimization task on a non-convex search space, using an expensive to evaluate function with unknown derivative. These characteristics make a good candidate for Bayesian Optimization, which has not been applied to this problem […]

CUDA

•

OpenCL

Dec, 5

Atos: A Task-Parallel GPU Dynamic Scheduling Framework for Dynamic Irregular Computations

We present Atos, a task-parallel GPU dynamic scheduling framework that is especially suited to dynamic irregular applications. Compared to the dominant Bulk Synchronous Parallel (BSP) frameworks, Atos exposes additional concurrency by supporting task-parallel formulations of applications with relaxed dependencies, achieving higher GPU utilization, which is particularly significant for problems with concurrency bottlenecks. Atos also offers […]

CUDA

Dec, 5

An Auto-Programming Approach to Vulkan

We propose a novel high-level approach for software development on GPU using Vulkan API. Our goal is to speed-up development and performance studies for complex algorithms on GPU, which is quite difficult and laborious for Vulkan due to large number of HW features low level details. The proposed approach uses auto programming to translate ordinary […]

Nov, 28

Concurrency Mapping to FPGAs with OpenCL: A Case Study with a Shallow Water Kernel

FPGAs have been around for over 30 years and are a viable accelerator for compute-intensive workloads on HPC systems. The adoption of FPGAs for scientific applications has been stimulated recently by the emergence of better programming environments such as High-Level Synthesis (HLS) and OpenCL available through the Xilinx SDSoC design tool. The mapping of the […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

On the accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit and novel 16-bit number formats

Manas: Mining Software Repositories to Assist AutoML

CitiusSynapse: A Deep Learning Framework for Embedded Systems

Fast Neural Representations for Direct Volume Rendering

High performance computing on Android devices – a case study

GPU backed Data Mining on Android Devices

Analysis and Comparison of Performance and Power Consumption of Neural Networks on CPU, GPU, TPU and FPGA

Autotuning CUDA: Applying NLP Techniques to LS-CAT

Bayesian Optimization for auto-tuning GPU kernels

Atos: A Task-Parallel GPU Dynamic Scheduling Framework for Dynamic Irregular Computations

An Auto-Programming Approach to Vulkan

Concurrency Mapping to FPGAs with OpenCL: A Case Study with a Shallow Water Kernel

Recent source codes

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)