high performance computing on graphics processing units: hgpu.org

Posts

May, 2

cuda-kat: The CUDA Kernel Author’s Toolkit

An install-less, header-only library which is a loosely-coupled collection of utility functions and classes for writing device-side CUDA code (kernels and non-kernel functions). These let us: * Write templated device-side without constantly coming up against not-trivially-templatable bits. * Use standard-library(-like) containers in device-side code (but not have to use them). * Not repeat ourselves as […]

CUDA

Apr, 26

Automatic Parallelization for Heterogeneous Embedded Systems

Recent years have seen an increase of heterogeneous architectures combining multi-core CPUs with accelerators such as GPU, FPGA, and Intel Xeon Phi. GPU can achieve significant performance for certain categories of application. Nevertheless, achieving this performance with low-level APIs (e.g. CUDA, OpenCL) requires to rewrite the sequential code, to have a good knowledge of GPU […]

CUDA

•

OpenCL

Apr, 26

Accelerating Winograd Convolutions using Symbolic Computation and Meta-programming

Convolution operations are essential constituents of convolutional neural networks. Their efficient and performance-portable implementation demands tremendous programming effort and fine-tuning. Winograd’s minimal filtering algorithm is a well-known method to reduce the computational complexity of convolution operations. Unfortunately, existing implementations of this algorithm are either vendor-specific or hard-coded to support a small subset of convolutions, thus […]

CUDA

•

OpenCL

Apr, 26

GEVO: GPU Code Optimization using Evolutionary Computation

GPUs are a key enabler of the revolution in machine learning and high performance computing, functioning as de facto co-processors to accelerate large-scale computation. As the programming stack and tool support have matured, GPUs have also become accessible to programmers, who may lack detailed knowledge of the underlying architecture and fail to fully leverage the […]

CUDA

Apr, 26

Evaluating FPGA Accelerator Performance with a Parameterized OpenCL Adaptation of the HPCChallenge Benchmark Suite

FPGAs have found increasing adoption in data center applications since a new generation of high-level tools have become available which noticeably reduce development time for FPGA accelerators and still provide high quality of results. There is however no high-level benchmark suite available which specifically enables a comparison of FPGA architectures, programming tools and libraries for […]

OpenCL

Apr, 26

Cpp-Taskflow: A General-purpose Parallel and Heterogeneous Task Programming System at Scale

The Cpp-Taskflow project addresses the long-standing question: How can we make it easier for developers to write parallel and heterogeneous programs with high performance and simultaneous high productivity? Cpp-Taskflow develops a simple and powerful task programming model to enable efficient implementations of heterogeneous decomposition strategies. Our programming model empowers users with both static and dynamic […]

CUDA

Apr, 19

OpenCL-Darknet: implementation and optimization of OpenCL-based deep learning object detection framework

Object detection is a technology that deals with recognizing classes of objects and their location. It is used in many different areas, such as in face-detecting systems [16, 34, 37], surveillance tools [9], human-machine interfaces [17], and self-driving cars [18, 23, 25, 26, 30]. These days, deep learning object detection approaches have achieved significantly better […]

CUDA

•

OpenCL

Apr, 19

Design Space Exploration of an OpenCL Based SAXPY Kernel Implementation on FPGAs

High-performance computing researchers are trying to find new options, tools to satisfy the performance criteria of a hardware design. FPGA (Field Programmable Gate Array) is one of the accelerators which is widely used for power-efficient applications due to its reconfigurability and high performance. Traditionally FPGA can be programmed using Hardware Description Language (HDL). Using HDL, […]

OpenCL

Apr, 19

FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System

Tensor computation plays a paramount role in a broad range of domains, including machine learning, data analytics, and scientific computing. The wide adoption of tensor computation and its huge computation cost has led to high demand for flexible, portable, and high-performance library implementation on heterogeneous hardware accelerators such as GPUs and FPGAs. However, the current […]

CUDA

•

OpenCL

Apr, 19

Deep-Edge: An Efficient Framework for Deep Learning Model Update on Heterogeneous Edge

Deep Learning (DL) model-based AI services are increasingly offered in a variety of predictive analytics services such as computer vision, natural language processing, speech recognition. However, the quality of the DL models can degrade over time due to changes in the input data distribution, thereby requiring periodic model updates. Although cloud data-centers can meet the […]

CUDA

Apr, 19

A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs

GPUs are playing an increasingly important role in general-purpose computing. Many algorithms require synchronizations at different levels of granularity in a single GPU. Additionally, the emergence of dense GPU nodes also calls for multi-GPU synchronization. Nvidia’s latest CUDA provides a variety of synchronization methods. Until now, there is no full understanding of the characteristics of […]

CUDA

Apr, 12

MNN: A Universal and Efficient Inference Engine

Deploying deep learning models on mobile devices draws more and more attention recently. However, designing an efficient inference engine on devices is under the great challenges of model compatibility, device diversity, and resource limitation. To deal with these challenges, we propose Mobile Neural Network (MNN), a universal and efficient inference engine tailored to mobile applications. […]

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

cuda-kat: The CUDA Kernel Author’s Toolkit

Automatic Parallelization for Heterogeneous Embedded Systems

Accelerating Winograd Convolutions using Symbolic Computation and Meta-programming

GEVO: GPU Code Optimization using Evolutionary Computation

Evaluating FPGA Accelerator Performance with a Parameterized OpenCL Adaptation of the HPCChallenge Benchmark Suite

Cpp-Taskflow: A General-purpose Parallel and Heterogeneous Task Programming System at Scale

OpenCL-Darknet: implementation and optimization of OpenCL-based deep learning object detection framework

Design Space Exploration of an OpenCL Based SAXPY Kernel Implementation on FPGAs

FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System

Deep-Edge: An Efficient Framework for Deep Learning Model Update on Heterogeneous Edge

A Study of Single and Multi-device Synchronization Methods in Nvidia GPUs

MNN: A Universal and Efficient Inference Engine

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)