high performance computing on graphics processing units: hgpu.org

Posts

Oct, 18

Evaluating the Performance and Portability of OpenCL

Recent developments in processor architecture have settled a shift from sequential processing to parallel processing. This shift was not based on a breakthrough in processor design, but was actually an alternative design trajectory to avoid the limits that were reached on single core development. Along with the shift towards parallel architectures, a gap arose between […]

CUDA

•

OpenCL

Oct, 18

Odeint – Solving ordinary differential equations in C++

Many physical, biological or chemical systems are modeled by ordinary differential equations (ODEs) and finding their solution is an every-day-task for many scientists. Here, we introduce a new C++ library dedicated to find numerical solutions of initial value problems of ODEs: odeint (www.odeint.com). odeint is implemented in a highly generic way and provides extensive interoperability […]

CUDA

Oct, 18

Optimization strategies for parallel CPU and GPU implementations of a meshfree particle method

Much of the current focus in high performance computing (HPC) for computational fluid dynamics (CFD) deals with grid based methods. However, parallel implementations for new meshfree particle methods such as Smoothed Particle Hydrodynamics (SPH) are less studied. In this work, we present optimizations for both central processing unit (CPU) and graphics processing unit (GPU) of […]

CUDA

Oct, 18

Fast N-body Simulations on GPUs

With the current hybridization of treecodes and FMMs, combined with auto-tuning capabilities on heterogeneous architectures, the flexibility of fast N-body methods has been greatly enhanced. These features are a requirement to developing a black- box software library for fast N-body algorithms on heterogeneous systems, which is our immediate goal.

CUDA

Oct, 17

Fast Implementation of DGEMM on Fermi GPU

In this paper we present a thorough experience on tuning double-precision matrix-matrix multiplication (DGEMM) on the Fermi GPU architecture. We choose an optimal algorithm with blocking in both shared memory and registers to satisfy the constraints of the Fermi memory hierarchy. Our optimization strategy is further guided by a performance modeling based on micro-architecture benchmarks. […]

CUDA

Oct, 17

Numerical Accuracy Analysis Based on the Discrete Stochastic Arithmetic on Multiprocessor Platforms

Simulating the real world has become one of the most widely used techniques in engineering today. Multiprocessor platforms play a key role in this development since bigger and bigger problems need more computing power to be solved. When the floating point standard was adopted in the early eighties of the 20th century, the amount of […]

CUDA

Oct, 17

Data-Driven Programming Abstractions and Optimization for Multi-Core Platforms

Multi-core platforms have spread to all corners of the computing industry, and trends in design and power indicate that the shift to multi-core will become even widerspread in the future. As the number of cores on a chip rises, the complexity of memory systems and on-chip interconnects increases drastically. The programmer inherits this complexity in […]

CUDA

Oct, 17

Implementing Stereo Vision of GPU-Accelerated Scientific Simulations using Commodity Hardware

Stereo vision technology is becoming more and more commonplace in the movie and gaming industries. It has applications in many other fields as well, one of these is viewing scientific data. We develop a stereo vision system using commodity priced hardware and portable graphics software. Hardware and software details are described, as well as some […]

OpenGL

Oct, 17

Magneto-hydrodynamics simulation in astrophysics

Magnetohydrodynamics (MHD) studies the dynamics of an electrically conducting fluid under the influence of a magnetic field. Many astrophysical phenomena are related to MHD, and computer simulations are used to model these dynamics. In this thesis, we conduct MHD simulations of non-radiative black hole accretion as well as fast magnetic reconnection. By performing large scale […]

CUDA

•

OpenCL

Oct, 17

The Lattice Boltzmann Simulation on Multi-GPU Systems

The Lattice Boltzmann Method (LBM) is widely used to simulate different types of flow, such as water, oil and gas in porous reservoirs. In the oil industry it is commonly used to estimate petrophysical properties of porous rocks, such as the permeability. To achieve the required accuracy it is necessary to use big simulation models […]

OpenCL

Oct, 17

An Optimization for Fast Generation of Digital Hologram

Digital hologram generation methods commonly use computer generated hologram (CGH) algorithm. However, it requires complicated computation. Thus, this paper proposes an optimization method for a fast generation of digital hologram. The proposed method uses CUDA and OpenMP for multi-GPU. Also, it applies various optimization methods (variable fixation, vectorization, and loop unrolling) to a CGH algorithm. […]

CUDA

Oct, 17

Dynamic Fine-Grain Scheduling of Pipeline Parallelism

Scheduling pipeline-parallel programs, defined as a graph of stages that communicate explicitly through queues, is challenging. When the application is regular and the underlying architecture can guarantee predictable execution times, several techniques exist to compute highly optimized static schedules. However, these schedules do not admit run-time load balancing, so variability introduced by the application or […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Evaluating the Performance and Portability of OpenCL

Odeint – Solving ordinary differential equations in C++

Optimization strategies for parallel CPU and GPU implementations of a meshfree particle method

Fast N-body Simulations on GPUs

Fast Implementation of DGEMM on Fermi GPU

Numerical Accuracy Analysis Based on the Discrete Stochastic Arithmetic on Multiprocessor Platforms

Data-Driven Programming Abstractions and Optimization for Multi-Core Platforms

Implementing Stereo Vision of GPU-Accelerated Scientific Simulations using Commodity Hardware

Magneto-hydrodynamics simulation in astrophysics

The Lattice Boltzmann Simulation on Multi-GPU Systems

An Optimization for Fast Generation of Digital Hologram

Dynamic Fine-Grain Scheduling of Pipeline Parallelism

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)