high performance computing on graphics processing units: hgpu.org

Posts

Dec, 4

Three Contributions to the Theory and Practice of Optimizing Compilers

The theory and practice of optimizing compilers gather techniques that, from input computer programs, aim at generating code making best use of modern computer hardware. On the theory side, this thesis contributes new results and algorithms in polyhedral geometry. On the practical side, this thesis contributes techniques for the tuning of parameters of programs targeting […]

CUDA

Dec, 4

SkyFlow: Heterogeneous streaming for skyline computation using FlowGraph and SYCL

The skyline is an optimization operator widely used for multi-criteria decision making. It allows minimizing an n-dimensional dataset into its smallest subset. In this work we present SkyFlow, the first heterogeneous CPU+GPU graph-based engine for skyline computation on a stream of data queries. Two data flow approaches, Coarse-grained and Fine-grained, have been proposed for different […]

Dec, 4

Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

Transformer models have achieved state-of-the-art performance on various domains of applications and gradually becomes the foundations of the advanced large deep learning (DL) models. However, how to train these models over multiple GPUs efficiently is still challenging due to a large number of parallelism choices. Existing DL systems either rely on manual efforts to make […]

CUDA

Dec, 4

Efficient Incremental Text-to-Speech on GPUs

Incremental text-to-speech, also known as streaming TTS, has been increasingly applied to online speech applications that require ultra-low response latency to provide an optimal user experience. However, most of the existing speech synthesis pipelines deployed on GPU are still non-incremental, which uncovers limitations in high-concurrency scenarios, especially when the pipeline is built with end-to-end neural […]

Nov, 27

User-Driven Online Kernel Fusion for SYCL

Heterogeneous programming models are becoming increasingly popular to support the ever-evolving hardware architectures, especially for new and emerging specialized accelerators optimizing speciic tasks. While such programs provide performance portability of the existing applications across various heterogeneous architectures to some extent, short-running device kernels can affect an application performance due to overheads of data transfer, synchronization […]

OpenCL

Nov, 27

A Hybrid Multi-GPU Implementation of Simplex Algorithm with CPU Collaboration

The simplex algorithm has been successfully used for many years in solving linear programming (LP) problems. Due to the intensive computations required (especially for the solution of large LP problems), parallel approaches have also extensively been studied. The computational power provided by the modern GPUs as well as the rapid development of multicore CPU systems […]

CUDA

Nov, 27

SciAI4Industry – Solving PDEs for industry-scale problems with deep learning

Solving partial differential equations with deep learning makes it possible to reduce simulation times by multiple orders of magnitude and unlock scientific methods that typically rely on large numbers of sequential simulations, such as optimization and uncertainty quantification. Two of the largest challenges of adopting scientific AI for industrial problem settings is that training datasets […]

Nov, 27

Design Space Exploration of Concurrency Mapping to FPGAs in Weather and Climate Applications with Xilinx SDSoC OpenCL, SDSoC C++ and Vivad

Recent years have seen increased interest from the HPC community in Field Programmable Gate Arrays (FPGAs) as an alternative/additional accelerator. This has been largely due to the slowdown in the transistor scaling and the difficulty of gaining performance improvement and energy efficiency from the current processing solutions. General (scientific) software programmers have shied away from […]

OpenCL

Nov, 27

Assessing Opportunities of SYCL and Intel oneAPI for Biological Sequence Alignment

Background and objectives. The computational biology area is growing up over the years. The interest in researching and developing computational tools for the acquisition, storage, organization, analysis, and visualization of biological data generates the need to create new hardware architectures and new software tools that allow processing big data in acceptable times. In this sense, […]

CUDA

•

OpenCL

Nov, 20

Training a Vision Transformer from scratch in less than 24 hours with 1 GPU

Transformers have become central to recent advances in computer vision. However, training a vision Transformer (ViT) model from scratch can be resource intensive and time consuming. In this paper, we aim to explore approaches to reduce the training costs of ViT models. We introduce some algorithmic improvements to enable training a ViT model from scratch […]

Nov, 20

Hardware Checkpointing and Productive Debugging Flows for FPGAs

As FPGAs become larger and more complex, productive debugging is becoming more challenging. In this work, we detail a new debugging flow based on hardware checkpointing that provides full visibility and controllability while maintaining reasonable execution speed. Hardware checkpointing is useful not only for debugging but also enables several other capabilities such as live migration, […]

Nov, 20

Challenges and Techniques for Transparent Acceleration of Unmodified Big Data Applications

The ever-increasing demand for high-performance Big Data analytics and data processing has paved the way for heterogeneous hardware accelerators, such as Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs), to be integrated into modern Big Data platforms. Currently, this integration comes at the cost of programmability, as the end-user Application Programming Interface (API) […]

CUDA

•

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Three Contributions to the Theory and Practice of Optimizing Compilers

SkyFlow: Heterogeneous streaming for skyline computation using FlowGraph and SYCL

Galvatron: Efficient Transformer Training over Multiple GPUs Using Automatic Parallelism

Efficient Incremental Text-to-Speech on GPUs

User-Driven Online Kernel Fusion for SYCL

A Hybrid Multi-GPU Implementation of Simplex Algorithm with CPU Collaboration

SciAI4Industry – Solving PDEs for industry-scale problems with deep learning

Design Space Exploration of Concurrency Mapping to FPGAs in Weather and Climate Applications with Xilinx SDSoC OpenCL, SDSoC C++ and Vivad

Assessing Opportunities of SYCL and Intel oneAPI for Biological Sequence Alignment

Training a Vision Transformer from scratch in less than 24 hours with 1 GPU

Hardware Checkpointing and Productive Debugging Flows for FPGAs

Challenges and Techniques for Transparent Acceleration of Unmodified Big Data Applications

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)