high performance computing on graphics processing units: hgpu.org

Posts

Aug, 14

HIPRT: A Ray Tracing Framework in HIP

We present HIPRT, an open-source ray tracing framework in HIP. HIPRT provides a versatile, cross-platform solution for professional rendering on contemporary many-core architectures. The core of the framework relies on the bounding volume hierarchy (BVH) with scalable construction algorithms and efficient ray traversal, employing hardware acceleration on AMD GPUs. From a user perspective, we aim […]

Aug, 14

A Comprehensive Deep Learning Library Benchmark and Optimal Library Selection

Deploying deep learning (DL) on mobile devices has been a notable trend in recent years. To support fast inference of on-device DL, DL libraries play a critical role as algorithms and hardware do. Unfortunately, no prior work ever dives deep into the ecosystem of modern DL libraries and provides quantitative results on their performance. In […]

Aug, 14

Acceleration for the many, not the few

Although specialized hardware promises orders of magnitude performance gains, their uptake has been limited by how challenging it is to program them. Hardware accelerators present challenges programmers are not used to, exposing details of the hardware that are often hidden and requiring new programming styles to use them effectively. Existing programming models often involve learning […]

Aug, 14

Evaluating Operators in Deep Neural Networks for Improving Performance Portability of SYCL

SYCL is a portable programming model for heterogeneous computing, so it is important to obtain reasonable performance portability of SYCL. Towards the goal of better understanding and improving performance portability of SYCL for machine learning workloads, we have been developing benchmarks for basic operators in deep neural networks (DNNs). These operators could be offloaded to […]

CUDA

Aug, 14

In-Situ Techniques on GPU-Accelerated Data-Intensive Applications

The computational power of High-Performance Computing (HPC) systems is constantly increasing, however, their input/output (IO) performance grows relatively slowly, and their storage capacity is also limited. This unbalance presents significant challenges for applications such as Molecular Dynamics (MD) and Computational Fluid Dynamics (CFD), which generate massive amounts of data for further visualization or analysis. At […]

CUDA

Aug, 4

LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs

As deep neural networks (DNNs) become increasingly large and complicated, pruning techniques are proposed for lower memory footprint and more efficient inference. The most critical kernel to execute pruned sparse DNNs on GPUs is Sparse-dense Matrix Multiplication (SpMM). To maximize the performance of SpMM, despite the high-performance implementation generated from advanced tensor compilers, they often […]

CUDA

Aug, 4

Springald: GPU-Accelerated Window-Based Aggregates Over Out-of-Order Data Streams

An increasing number of application domains require high-throughput processing to extract insights from massive data streams. The Data Stream Processing (DSP) paradigm provides formal approaches to analyze structured data streams considered as special, unbounded relations. The most used class of stateful operators in DSP are the ones running sliding-window aggregation, which continuously extracts insights from […]

CUDA

Aug, 4

Lectures on Parallel Computing

These lecture notes are designed to accompany an imaginary, virtual, undergraduate, one or two semester course on fundamentals of Parallel Computing as well as to serve as background and reference for graduate courses on High-Performance Computing, parallel algorithms and shared-memory multiprocessor programming. They introduce theoretical concepts and tools for expressing, analyzing and judging parallel algorithms […]

Aug, 4

Data-driven Forecasting of Deep Learning Performance on GPUs

Deep learning kernels exhibit predictable memory accesses and compute patterns, making GPUs’ parallel architecture well-suited for their execution. Software and runtime systems for GPUs are optimized to better utilize the stream multiprocessors, on-chip cache, and off-chip high-bandwidth memory. As deep learning models and GPUs evolve, access to newer GPUs is often limited, raising questions about […]

CUDA

Aug, 4

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

Deep learning (DL) has demonstrated significant success across diverse fields, leading to the construction of dedicated GPU accelerators within GPU clusters for high-quality training services. Efficient scheduler designs for such clusters are vital to reduce operational costs and enhance resource utilization. While recent schedulers have shown impressive performance in optimizing DL job performance and cluster […]

CUDA

Jul, 28

Data-driven Performance Optimization for Data-intensive Applications

Data-intensive applications have attracted considerable attention from researchersin information sciences and enterprises, as these applications have made evolutionary breakthroughs in scientific fields and are extremely valuable to produce productivity in businesses. Recently, as the high speed growth of the new generated data, researchers have begun to leverage the useful knowledge hidden in such huge volume […]

Jul, 28

A Comparison of OpenCL, CUDA, and HIP as Compilation Targets for a Functional Array Language

This paper compares OpenCL, CUDA, and HIP as compilation targets for Futhark, a functional array language. We compare the performance of OpenCL versus CUDA, and OpenCL versus HIP, on the code generated by the Futhark compiler on a collection of 48 application benchmarks on two different GPUs. Despite the generated code in most cases being […]

CUDA

•

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

HIPRT: A Ray Tracing Framework in HIP

A Comprehensive Deep Learning Library Benchmark and Optimal Library Selection

Acceleration for the many, not the few

Evaluating Operators in Deep Neural Networks for Improving Performance Portability of SYCL

In-Situ Techniques on GPU-Accelerated Data-Intensive Applications

LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs

Springald: GPU-Accelerated Window-Based Aggregates Over Out-of-Order Data Streams

Lectures on Parallel Computing

Data-driven Forecasting of Deep Learning Performance on GPUs

Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing

Data-driven Performance Optimization for Data-intensive Applications

A Comparison of OpenCL, CUDA, and HIP as Compilation Targets for a Functional Array Language

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)