high performance computing on graphics processing units: hgpu.org

Posts

Oct, 30

A systematic performance study of the parallel programming framework SkePU 3 using HPC-benchmarks

With hardware performance no longer following Moore’s law, software optimization becomes more important. In this paper, we discuss parallel programming, which is one way to optimize software. However, writing parallel code is considered more difficult than writing sequential code. There is often a specific framework to be used to write parallel code for each type […]

CUDA

•

OpenCL

Oct, 30

Benchmarking GPU and TPU Performance with Graph Neural Networks

Many artificial intelligence (AI) devices have been developed to accelerate the training and inference of neural networks models. The most common ones are the Graphics Processing Unit (GPU) and Tensor Processing Unit (TPU). They are highly optimized for dense data representations. However, sparse representations such as graphs are prevalent in many domains, including science. It […]

CUDA

Oct, 30

gSuite: A Flexible and Framework Independent Benchmark Suite for Graph Neural Network Inference on GPUs

As the interest to Graph Neural Networks (GNNs) is growing, the importance of benchmarking and performance characterization studies of GNNs is increasing. So far, we have seen many studies that investigate and present the performance and computational efficiency of GNNs. However, the work done so far has been carried out using a few high-level GNN […]

CUDA

Oct, 30

Providing performance portable numerics for Intel GPUs

With discrete Intel GPUs entering the high-performance computing landscape, there is an urgent need for production-ready software stacks for these platforms. In this article, we report how we enable the Ginkgo math library to execute on Intel GPUs by developing a kernel backed based on the DPC++ programming environment. We discuss conceptual differences between the […]

CUDA

•

OpenCL

Oct, 30

torchode: A Parallel ODE Solver for PyTorch

We introduce an ODE solver for the PyTorch ecosystem that can solve multiple ODEs in parallel independently from each other while achieving significant performance gains. Our implementation tracks each ODE’s progress separately and is carefully optimized for GPUs and compatibility with PyTorch’s JIT compiler. Its design lets researchers easily augment any aspect of the solver […]

Oct, 23

A Ray Tracing Implementation Performance Comparison between the CPU and the GPU

Ray tracing has gained recent popularity due to the advancement of computer hardware capabilities. The algorithm is used as a rendering technique for computer graphics by tracing rays of light to determine the color of a single pixel, thus simulating the physical behavior of light. This study explores the performance differences between the ray tracing […]

CUDA

Oct, 23

Tausch: A halo exchange library for large heterogeneous computing systems using MPI, OpenCL, and CUDA

Exchanging halo data is a common task in modern scientific computing applications and efficient handling of this operation is critical for the performance of the overall simulation. Tausch is a novel header-only library that provides a simple API for efficiently handling these types of data movements. Tausch supports both simple CPU-only systems, but also more […]

CUDA

•

OpenCL

Oct, 23

Thwarting Piracy: Anti-debugging Using GPU-assisted Self-healing Codes

Software piracy is one of the concerns in the IT sector. Pirates leverage the debugger tools to reverse engineer the logic that verifies the license keys or bypass the entire verification process. Anti-debugging techniques are used to defeat piracy using self-healing codes. However, anti-debugging methods can be defeated when the licensing protections are limited to […]

CUDA

Oct, 23

Behavioral graph fraud detection in E-commerce

In e-commerce industry, graph neural network methods are the new trends for transaction risk modeling.The power of graph algorithms lie in the capability to catch transaction linking network information, which is very hard to be captured by other algorithms.However, in most existing approaches, transaction or user connections are defined by hard link strategies on shared […]

Oct, 23

From Task-Based GPU Work Aggregation to Stellar Mergers: Turning Fine-Grained CPU Tasks into Portable GPU Kernels

Meeting both scalability and performance portability requirements is a challenge for any HPC application, especially for adaptively refined ones. In Octo-Tiger, an astrophysics application for the simulation of stellar mergers, we approach this with existing solutions: We employ HPX to obtain fine-grained tasks to easily distribute work and finely overlap communication and computation. For the […]

CUDA

Oct, 16

Distributed, combined CPU and GPU profiling within HPX using APEX

Benchmarking and comparing performance of a scientific simulation across hardware platforms is a complex task. When the simulation in question is constructed with an asynchronous, many-task (AMT) runtime offloading work to GPUs, the task becomes even more complex. In this paper, we discuss the use of a uniquely suited performance measurement library, APEX, to capture […]

CUDA

Oct, 16

Dataloader Parameter Tuner: An Automated Dataloader Parameter Tuner for Deep Learning Models

Deep learning has recently become one of the most compute/data-intensive methods and is widely used in many research areas and businesses. One of the critical challenges of deep learning is that it has many parameters that can be adjusted, and the optimal value may need to be determined for faster operation and high accuracy. The […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

A systematic performance study of the parallel programming framework SkePU 3 using HPC-benchmarks

Benchmarking GPU and TPU Performance with Graph Neural Networks

gSuite: A Flexible and Framework Independent Benchmark Suite for Graph Neural Network Inference on GPUs

Providing performance portable numerics for Intel GPUs

torchode: A Parallel ODE Solver for PyTorch

A Ray Tracing Implementation Performance Comparison between the CPU and the GPU

Tausch: A halo exchange library for large heterogeneous computing systems using MPI, OpenCL, and CUDA

Thwarting Piracy: Anti-debugging Using GPU-assisted Self-healing Codes

Behavioral graph fraud detection in E-commerce

From Task-Based GPU Work Aggregation to Stellar Mergers: Turning Fine-Grained CPU Tasks into Portable GPU Kernels

Distributed, combined CPU and GPU profiling within HPX using APEX

Dataloader Parameter Tuner: An Automated Dataloader Parameter Tuner for Deep Learning Models

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)