high performance computing on graphics processing units: hgpu.org

Posts

Feb, 7

Why is FPGA-GPU Heterogeneity the Best Option for Embedded Deep Neural Networks?

Graphics Processing Units (GPUs) are currently the dominating programmable architecture for Deep Learning (DL) accelerators. The adoption of Field Programmable Gate Arrays (FPGAs) in DL accelerators is however getting momentum. In this paper, we demonstrate that Direct Hardware Mapping (DHM) of a Convolutional Neural Network (CNN) on an embedded FPGA substantially outperforms a GPU implementation […]

Jan, 31

C-for-Metal: High Performance SIMD Programming on Intel GPUs

The SIMT execution model is commonly used for general GPU development. CUDA and OpenCL developers write scalar code that is implicitly parallelized by compiler and hardware. On Intel GPUs, however, this abstraction has profound performance implications as the underlying ISA is SIMD and important hardware capabilities cannot be fully utilized. To close this performance gap […]

CUDA

•

OpenCL

Jan, 31

Performance of CPU and GPU HPC Architectures for off-design aircraft simulation

This paper presents a detailed analysis of the relative performance and cost of GPU and CPU architectures for a full aircraft RANS simulation using the CFD code zCFD. Using Amazon Web Services as the platform, several generations of NVIDIA GPUs are assessed (T4, V100, and A100) and compared to x86 Intel Broadwell and Skylake CPUs. […]

CUDA

Jan, 31

Efficient MPI-based Communication for GPU-Accelerated Dask Applications

Dask is a popular parallel and distributed computing framework, which rivals Apache Spark to enable task-based scalable processing of big data. The Dask Distributed library forms the basis of this computing engine and provides support for adding new communication devices. It currently has two communication devices: one for TCP and the other for high-speed networks […]

CUDA

Jan, 31

CPU/GPU Code Acceleration on Heterogeneous Systems and Code Verification for CFD Applications

Computational Fluid Dynamics (CFD) applications usually involve intensive computations, which can be accelerated through using open accelerators, especially GPUs due to their common use in the scientific computing community. In addition to code acceleration, it is important to ensure that the code and algorithm are implemented numerically correctly, which is called code verification. This dissertation […]

CUDA

Jan, 31

Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents

We present Text2Gestures, a transformer-based learning method to interactively generate emotive full-body gestures for virtual agents aligned with natural language text inputs. Our method generates emotionally expressive gestures by utilizing the relevant biomechanical features for body expressions, also known as affective features. We also consider the intended task corresponding to the text and the target […]

Jan, 24

Easy and Efficient Agent-based Simulations with the OpenABL Language and Compiler

Agent-based simulations represent an effective scientific tool, with numerous applications from social sciences to biology, which aims to emulate or predict complex phenomena through a set of simple rules performed by multiple agents. To simulate a large number of agents with complex models, practitioners have developed high-performance parallel implementations, often specialized for particular scenarios and […]

CUDA

Jan, 24

Performance Analysis and Improvement of Parallel Differential Evolution

Differential evolution (DE) is an effective global evolutionary optimization algorithm using to solve global optimization problems mainly in a continuous domain. In this field, researchers pay more attention to improving the capability of DE to find better global solutions, however, the computational performance of DE is also a very interesting aspect especially when the problem […]

CUDA

Jan, 24

Non-Parametric Adaptive Network Pruning

Popular network pruning algorithms reduce redundant information by optimizing hand-crafted parametric models, and may cause suboptimal performance and long time in selecting filters. We innovatively introduce non-parametric modeling to simplify the algorithm design, resulting in an automatic and efficient pruning approach called EPruner. Inspired by the face recognition community, we use a message passing algorithm […]

CUDA

Jan, 24

Learning Massive Graph Embeddings on a Single Machine

We propose a new framework for computing the embeddings of large-scale graphs on a single machine. A graph embedding is a fixed length vector representation for each node (and/or edge-type) in a graph and has emerged as the de-facto approach to apply modern machine learning on graphs. We identify that current systems for learning the […]

CUDA

Jan, 24

StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems

Spatial computing devices have been shown to significantly accelerate stencil computations, but have so far relied on unrolling the iterative dimension of a single stencil operation to increase temporal locality. This work considers the general case of mapping directed acyclic graphs of heterogeneous stencil computations to spatial computing systems, assuming large input programs without an […]

CUDA

•

OpenCL

Jan, 17

Instruments of Productivity for High Performance Computing

High performance computing (HPC) is now well established as the cornerstone for building and conducting software simulations in numerous scientific and industrial fields. The hardware complexity of supercomputers is steadily increasing, however, to deliver ever improved computing performance, causing the complexity of HPC application development to increase as well. As a result, the need for […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Why is FPGA-GPU Heterogeneity the Best Option for Embedded Deep Neural Networks?

C-for-Metal: High Performance SIMD Programming on Intel GPUs

Performance of CPU and GPU HPC Architectures for off-design aircraft simulation

Efficient MPI-based Communication for GPU-Accelerated Dask Applications

CPU/GPU Code Acceleration on Heterogeneous Systems and Code Verification for CFD Applications

Text2Gestures: A Transformer-Based Network for Generating Emotive Body Gestures for Virtual Agents

Easy and Efficient Agent-based Simulations with the OpenABL Language and Compiler

Performance Analysis and Improvement of Parallel Differential Evolution

Non-Parametric Adaptive Network Pruning

Learning Massive Graph Embeddings on a Single Machine

StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems

Instruments of Productivity for High Performance Computing

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)