high performance computing on graphics processing units: hgpu.org

Posts

Nov, 29

AZP: Automatic Specialization for Zero Values in Gaming Applications

Recent research has shown that dynamic zeros in shader programs of gaming applications can be effectively leveraged with a profile-guided, code-versioning transform. This transform duplicates code, specializes one path assuming certain key program operands, called versioning variables, are zero, and leaves the other path unspecialized. Dynamically, depending on the versioning variable’s value, either the specialized […]

OpenGL

Nov, 22

Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL

This book is about programming for data parallelism using C++. If you are new to parallel programming, that is okay. If you have never heard of SYCL or the DPC++ compiler, that is also okay. SYCL is an industry-driven Khronos standard adding data parallelism to C++ for heterogeneous systems. DPC++ is an open source compiler […]

Nov, 22

A Survey of System Architectures and Techniques for FPGA Virtualization

FPGA accelerators are gaining increasing attention in both cloud and edge computing because of their hardware flexibility, high computational throughput, and low power consumption. However, the design flow of FPGAs often requires specific knowledge of the underlying hardware, which hinders the wide adoption of FPGAs by application developers. Therefore, the virtualization of FPGAs becomes extremely […]

OpenCL

Nov, 22

A Novel Memory-Efficient Deep Learning Training Framework via Error-Bounded Lossy Compression

Deep neural networks (DNNs) are becoming increasingly deeper, wider, and non-linear due to the growing demands on prediction accuracy and analysis quality. When training a DNN model, the intermediate activation data must be saved in the memory during forward propagation and then restored for backward propagation. However, state-of-the-art accelerators such as GPUs are only equipped […]

CUDA

Nov, 22

Ginkgo – A Math Library designed for Platform Portability

The first associations to software sustainability might be the existence of a continuous integration (CI) framework; the existence of a testing framework composed of unit tests, integration tests, and end-to-end tests; and also the existence of software documentation. However, when asking what is a common deathblow for a scientific software product, it is often the […]

CUDA

Nov, 22

GPURepair: Automated Repair of GPU Kernels

This paper presents a tool for repairing errors in GPU kernels written in CUDA or OpenCL due to data races and barrier divergence. Our novel extension to prior work can also remove barriers that are deemed unnecessary for correctness. We implement these ideas in our tool called GPURepair, which uses GPUVerify as the verification oracle […]

CUDA

•

OpenCL

Nov, 15

Adaptive Data Migration in Load-Imbalanced HPC Applications

Distributed parallel applications need to maximize and maintain computer resource utilization and be portable across different machines. Balanced execution of some applications requires more effort than others because their data distribution changes over time. Data re-distribution at runtime requires elaborate schemes that are expensive and may benefit particular applications. This dissertation discusses a solution for […]

CUDA

Nov, 15

Runtime Performances Benchmark for Knowledge Graph Embedding Methods

This paper wants to focus on providing a characterization of the runtime performances of state-of-the-art implementations of KGE alghoritms, in terms of memory footprint and execution time. Despite the rapidly growing interest in KGE methods, so far little attention has been devoted to their comparison and evaluation; in particular, previous work mainly focused on performance […]

CUDA

Nov, 15

Exploring the acceleration of Nekbone on reconfigurable architectures

Hardware technological advances are struggling to match scientific ambition, and a key question is how we can use the transistors that we already have more effectively. This is especially true for HPC, where the tendency is often to throw computation at a problem whereas codes themselves are commonly bound, at-least to some extent, by other […]

OpenCL

Nov, 15

FastSVC: Fast Cross-Domain Singing Voice Conversion with Feature-wise Linear Modulation

This paper presents FastSVC, a light-weight cross-domain sing voice conversion (SVC) system, which is able to achieve high conversion performance, with inference speed 4x faster than real time on CPUs. FastSVC uses Conformer based phoneme recognizer to extract singer-agnostic linguistic features from singing signals. A feature-wise linear modulation based generator is used to synthesize waveform […]

Nov, 15

Automatic GPU optimization through higher-order functions in functional languages

Over recent years, graphics processing units (GPUs) have become popular devices to use in procedures that exhibit data-parallelism. Due to high parallel capability, running procedures on a GPU can result in an execution time speedup ranging from a couple times faster to several orders of magnitude faster, compared to executing serially on a central processing […]

CUDA

Nov, 8

Design and Performance Evaluation of Optimizations for OpenCL FPGA Kernels

The use of FPGAs in heterogeneous systems are valuable because they can be used to architect custom hardware to accelerate a particular application or domain. However, they are notoriously difficult to program. The development of high level synthesis tools like OpenCL make FPGA development more accessible, but not without its own challenges. The synthesized hardware […]

OpenCL

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

AZP: Automatic Specialization for Zero Values in Gaming Applications

Data Parallel C++: Mastering DPC++ for Programming of Heterogeneous Systems using C++ and SYCL

A Survey of System Architectures and Techniques for FPGA Virtualization

A Novel Memory-Efficient Deep Learning Training Framework via Error-Bounded Lossy Compression

Ginkgo – A Math Library designed for Platform Portability

GPURepair: Automated Repair of GPU Kernels

Adaptive Data Migration in Load-Imbalanced HPC Applications

Runtime Performances Benchmark for Knowledge Graph Embedding Methods

Exploring the acceleration of Nekbone on reconfigurable architectures

FastSVC: Fast Cross-Domain Singing Voice Conversion with Feature-wise Linear Modulation

Automatic GPU optimization through higher-order functions in functional languages

Design and Performance Evaluation of Optimizations for OpenCL FPGA Kernels

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)