high performance computing on graphics processing units: hgpu.org

Posts

Feb, 27

CUDA-enabled LBM Flow Simulation around Three Equilateral Cylinders using GPU Computing Processor

This study is concerned with the simulation of viscous flow past three equal diameter circular cylinders in equilateral-triangular arrangement. The hydrodynamic characteristics of cylinders are modelled by a 2Dlattice Boltzmann kernel which is constructed employing Compute Unified Device Architecture (CUDA) interface developed by nVIDIA. Computations using the developed kernel are performed for nine spacing ratios […]

CUDA

Feb, 27

An Energy Consumption Model for GPU Computing at Instruction Level

With the development of hardware and software, GPU has been used in General-Purpose computation field. The high density of computing resource on chip bring in high performance as well as high power consumption. So the power consumption of GPU has increasingly become one of the most important issue for the development of general computing with […]

Feb, 27

An improved implementation of Preconditioned Conjugate Gradient Method on GPU

An improved implementation of the Preconditioned Conjugate Gradient method on GPU using CUDA (Compute Unified Device Architecture) is proposed. It aims to solving the Poisson equation arising in liquid animation with high efficiency. We consider the features of the linear system obtained from the Poisson equation and propose an optimization method to solve it. First, […]

CUDA

Feb, 27

Comparing the Power and Performance of Intel’s SCC to State-of-the-Art CPUs and GPUs

Power dissipation and energy consumption are becoming increasingly important architectural design constraints in different types of computers, from embedded systems to largescale supercomputers. To continue the scaling of performance, it is essential that we build parallel processor chips that make the best use of exponentially increasing numbers of transistors within the power and energy budgets. […]

Feb, 27

Simultaneous floating-point sine and cosine for VLIW integer processors

Graphics and signal processing applications often require that sines and cosines be evaluated at a same floating-point argument, and in such cases a very fast computation of the pair of values is desirable. This paper studies how 32-bit VLIW integer architectures can be exploited in order to perform this task accurately for IEEE single precision. […]

Feb, 26

Cooperative Heterogeneous Computing for Parallel Processing on CPU/GPU Hybrids

This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The proposed system exploits at runtime the coarse-grain threadlevel parallelism across CPU and GPU, without any source recompilation. To this end, three features […]

CUDA

Feb, 26

Fast Multipole Methods and High Performance Computing

This thesis details my research in two primary fields: fast multipole methods (FMM) and high performance computing with GPUs. Although these are two seemly disparate courses of study, significant results in implementing the FMM efficiently on a GPU are presented in Chapter 4. In first chapter, we introduce these two fields of study in a […]

CUDA

Feb, 26

Power-Efficient Time-Sensitive Mapping in Heterogeneous Systems

Heterogeneous systems that contain multiple types of resources, such as CPUs and GPUs, are becoming increasingly popular thanks to the potential of achieving high performance and energy efficiency. In such systems, the problem of data mapping and communication for time-sensitive applications while reducing power and energy consumption is more challenging, since applications may have varied […]

OpenCL

Feb, 26

Vortex particle method and parallel computing

In this paper, it was presented numerical results related to three dimensional simulation of motion of a vortex ring. For the simulation it was chosen the Vortex In Cell method. The method was shortly described in the paper. The numerical results were obtained on the single processor (x86) architecture. The disadvantage of the single processor […]

CUDA

Feb, 26

GPU Accelerated Molecular Surface Computing

A method is presented for computing the SES (solvent excluded surface) of a protein molecule in interactive-time based on GPU (graphics processing unit) acceleration. First, the offset surface of the van der Waals spheres is sampled using an offset distance d that corresponds to the radius of the solvent probe. The SES is then constructed […]

CUDA

Feb, 24

A novel sorting algorithm for many-core architectures based on adaptive bitonic sort

Adaptive bitonic sort is a well known merge-based parallel sorting algorithm. It achieves optimal complexity using a complex tree-like data structure called a bitonic tree. Due to this, using adaptive bitonic sort together with other algorithms usually implies converting bitonic trees to arrays and vice versa. This makes adaptive bitonic sort inappropriate in the context […]

CUDA

Feb, 24

Reuse and Refactoring of GPU Kernels to Design Complex Applications

Developers of GPU kernels, such as FFT, linear solvers, etc, tune their code extensively in order to obtain optimal performance, making efficient use of different resources available on the GPU. Complex applications are composed of several such kernel components. The software engineering community has performed extensive research on componentbased design to build generic and flexible […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

CUDA-enabled LBM Flow Simulation around Three Equilateral Cylinders using GPU Computing Processor

An Energy Consumption Model for GPU Computing at Instruction Level

An improved implementation of Preconditioned Conjugate Gradient Method on GPU

Comparing the Power and Performance of Intel’s SCC to State-of-the-Art CPUs and GPUs

Simultaneous floating-point sine and cosine for VLIW integer processors

Cooperative Heterogeneous Computing for Parallel Processing on CPU/GPU Hybrids

Fast Multipole Methods and High Performance Computing

Power-Efficient Time-Sensitive Mapping in Heterogeneous Systems

Vortex particle method and parallel computing

GPU Accelerated Molecular Surface Computing

A novel sorting algorithm for many-core architectures based on adaptive bitonic sort

Reuse and Refactoring of GPU Kernels to Design Complex Applications

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)