high performance computing on graphics processing units: hgpu.org

Posts

Apr, 11

Large Scale GPU Based Simulations of Turbulent Bubbly Flow in a Square Duct

In this paper, we present the results of a numerical study of air-water turbulent bubbly flow in a periodic vertical square duct. The study is conducted using a novel numerical technique which leverages Volume of Fluid method for interface capturing and Sharp Surface Force method for accurate representation of the surface tension forces. A three-dimensional […]

CUDA

Apr, 11

Efficient Video Compression via Content-Adaptive Super-Resolution

Video compression is a critical component of Internet video delivery. Recent work has shown that deep learning techniques can rival or outperform human-designed algorithms, but these methods are significantly less compute and power-efficient than existing codecs. This paper presents a new approach that augments existing codecs with a small, content-adaptive super-resolution model that significantly boosts […]

Apr, 5

An Investigation of Atomic Synchronization for Sort-Based Group-By Aggregation on GPUs

Using heterogeneous processing devices, like GPUs, to accelerate relational database operations is a well-known strategy. In this context, the group by operation is highly interesting for two reasons. Firstly, it incurs large processing costs. Secondly, its results (i.e., aggregates) are usually small reducing data movement costs whose compensation is a major challenge for heterogeneous computing. […]

OpenCL

Apr, 5

Parallel Arbitrary-precision Integer Arithmetic

Arbitrary-precision integer arithmetic computations are driven by applications in solving systems of polynomial equations and public-key cryptography. Such computations arise when high precision is required (with large input values that fit into multiple machine words), or to avoid coefficient overflow due to intermediate expression swell. Meanwhile, the growing demand for faster computation alongside the recent […]

CUDA

•

OpenCL

Apr, 5

Daisen: A Framework for Visualizing Detailed GPU Execution

Graphics Processing Units (GPUs) have been widely used to accelerate artificial intelligence, physics simulation, medical imaging, and information visualization applications. To improve GPU performance, GPU hardware designers need to identify performance issues by inspecting a huge amount of simulator-generated traces. Visualizing the execution traces can reduce the cognitive burden of users and facilitate making sense […]

CUDA

•

OpenCL

Apr, 5

LS-CAT: A Large-Scale CUDA AutoTuning Dataset

The effectiveness of Machine Learning (ML) methods depend on access to large suitable datasets. In this article, we present how we build the LS-CAT (Large-Scale CUDA AutoTuning) dataset sourced from GitHub for the purpose of training NLP-based ML models. Our dataset includes 19 683 CUDA kernels focused on linear algebra. In addition to the CUDA […]

CUDA

Apr, 5

Energy-aware Task Scheduling with Deadline Constraint in DVFS-enabled Heterogeneous Clusters

Energy conservation of large data centers for high-performance computing workloads, such as deep learning with big data, is of critical significance, where cutting down a few percent of electricity translates into million-dollar savings. This work studies energy conservation on emerging CPU-GPU hybrid clusters through dynamic voltage and frequency scaling (DVFS). We aim at minimizing the […]

Mar, 28

Enabling OpenMP Task Parallelism on Multi-FPGAs

FPGA-based hardware accelerators have received increasing attention mainly due to their ability to accelerate deep pipelined applications, thus resulting in higher computational performance and energy efficiency. Nevertheless, the amount of resources available on even the most powerful FPGA is still not enough to speed up very large modern workloads. To achieve that, FPGAs need to […]

Mar, 28

Accelerating Deep Neural Networks implementation: A survey

Recently, Deep Learning (DL) applications are getting more and more involved in different fields. Deploying such Deep Neural Networks (DNN) on embedded devices is still a challenging task considering the massive requirement of computation and storage. Given that the number of operations and parameters increases with the complexity of the model architecture, the performance will […]

OpenCL

Mar, 28

Accelerating Sparse Approximate Matrix Multiplication on GPUs

Although the matrix multiplication plays a vital role in computational linear algebra, there are few efficient solutions for matrix multiplication of the near-sparse matrices. The Sparse Approximate Matrix Multiply (SpAMM) is one of the algorithms to fill the performance gap neglected by traditional optimizations for dense/sparse matrix multiplication. However, existing SpAMM algorithms fail to exploit […]

CUDA

Mar, 28

De-specializing an HLS library for Deep Neural Networks: improvements upon hls4ml

Custom hardware accelerators for Deep Neural Networks are increasingly popular: in fact, the flexibility and performance offered by FPGAs are well-suited to the computational effort and low latency constraints required by many image recognition and natural language processing tasks. The gap between high-level Machine Learning frameworks (e.g., Tensorflow, Pytorch) and low-level hardware design in Verilog/VHDL […]

Mar, 28

CUDA Tutorial – Cryptanalysis of Classical Ciphers Using Modern GPUs and CUDA

CUDA (formerly an abbreviation of Compute Unified Device Architecture) is a parallel computing platform and API model created by Nvidia allowing software developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing. Throughout this tutorial, we introduce the CUDA concepts in an easy-to-grasp interactive way. Starting from scratch, we implement a complete […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Large Scale GPU Based Simulations of Turbulent Bubbly Flow in a Square Duct

Efficient Video Compression via Content-Adaptive Super-Resolution

An Investigation of Atomic Synchronization for Sort-Based Group-By Aggregation on GPUs

Parallel Arbitrary-precision Integer Arithmetic

Daisen: A Framework for Visualizing Detailed GPU Execution

LS-CAT: A Large-Scale CUDA AutoTuning Dataset

Energy-aware Task Scheduling with Deadline Constraint in DVFS-enabled Heterogeneous Clusters

Enabling OpenMP Task Parallelism on Multi-FPGAs

Accelerating Deep Neural Networks implementation: A survey

Accelerating Sparse Approximate Matrix Multiplication on GPUs

De-specializing an HLS library for Deep Neural Networks: improvements upon hls4ml

CUDA Tutorial – Cryptanalysis of Classical Ciphers Using Modern GPUs and CUDA

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)