high performance computing on graphics processing units: hgpu.org

Posts

Jun, 5

Efficient Embarrassingly Parallel on Graphics Processor Unit

The Embarrassingly Parallel (EP) is one kernel benchmark of NAS Parallel Benchmarks (NPB) which are a set of programs designed to help evaluate the performance of parallel supercomputers. In the EP benchmark, two-dimensional statistics are accumulated from a large number of Gaussian pseudo-random numbers, which produced by Linear Congruential Generator (LCG). In this paper, we […]

CUDA

Jun, 5

Accelerating Unstructured Mesh Computational Fluid Dynamics on the NVidia Tesla GPU Architecture

This report presents steps towards accelerating Fluidity, a general-purpose computational fluid dynamics package. One portion of the code, an iterative solver, is targeted for optimisation by using Graphics Processing Units (GPUs) to perform computations. A literature survey which examines the performance issues of iterative solvers and optimisations which may overcome these issues on classical and […]

CUDA

Jun, 5

Performance Analysis of the OP2 Framework on Many-core Architectures

We present a performance analysis and benchmarking study of the OP2 "active" library, which provides an abstraction framework for the solution of parallel unstructured mesh applications. OP2 aims to decouple the scientific specification of the application from its parallel implementation, achieving code longevity and near-optimal performance through re-targeting the back-end to different hardware. Runtime performance […]

CUDA

Jun, 5

A framework for parallel unstructured grid applications on GPUs

PDEs are important in a whole variety of applications. Want a suitable level of abstraction to separate the user’s specification of the app from the details of the parallel implementation. Aim to achieve code longevity and near-optimal performance through re-targeting the back-end to different hardware.

CUDA

•

OpenCL

Jun, 5

QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators

One of the major trends in the design of exascale architectures is the use of multicore nodes enhanced with GPU accelerators. Exploiting all resources of a hybrid accelerators- based node at their maximum potential is thus a fundamental step towards exascale computing. In this article, we present the design of a highly efficient QR factorization […]

CUDA

Jun, 4

Accelerating GPU kernels for dense linear algebra

Implementations of the Basic Linear Algebra Subprograms (BLAS) interface are major building block of dense linear algebra (DLA) libraries, and therefore have to be highly optimized. We present some techniques and implementations that significantly accelerate the corresponding routines from currently available libraries for GPUs. In particular, Pointer Redirecting – a set of GPU specific optimization […]

CUDA

Jun, 4

A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators

We present a Cholesky factorization for multicore with GPU accelerators systems. The challenges in developing scalable high performance algorithms for these emerging systems stem from their heterogeneity, massive parallelism, and the huge gap between the GPUs’ compute power vs the CPU-GPU communication speed. We show an approach that is largely based on software infrastructures that […]

CUDA

Jun, 4

Numerical simulation of 3D particulate flows based on GPU technology

This thesis deals with a particular problem out of the research field of computational fluid dynamics, the numerical simulation of fluids containing soluted rigid particles. Such problems arise within a variety of applied sciences, such as medicine, ecology and engineering and need to be studied in detail in three-dimensions. So far most scientific publications on […]

CUDA

Jun, 4

Monte Carlo Radiative Transport on the GPU

This paper presents a fast parallel Monte Carlo method to solve the radiative transport equation in inhomogeneous participating media. The implementation is based on CUDA and runs on the GPU. In order to meet the requirements of the parallel GPU architecture and to reuse shooting paths, we follow a photon mapping approach where during gathering […]

CUDA

Jun, 4

High performance stream computing for particle beam transport simulations

Understanding modern particle accelerators requires simulating charged particle transport through the machine elements. These simulations can be very time consuming due to the large number of particles and the need to consider many turns of a circular machine. Stream computing offers an attractive way to dramatically improve the performance of such simulations by calculating the […]

Jun, 3

A streaming narrow-band algorithm: interactive computation and visualization of level sets

Deformable isosurfaces, implemented with level-set methods, have demonstrated a great potential in visualization and computer graphics for applications such as segmentation, surface processing, and physically-based modeling. Their usefulness has been limited, however, by their high computational cost and reliance on significant parameter tuning. We present a solution to these challenges by describing graphics processor (GPU) […]

OpenGL

Jun, 3

Synthesizing Subdivision Meshes Using Real Time Tessellation

We propose a new GPU method for synthesizing subdivision meshes with exact adaptive geometry in real time. Our GPU kernel builds upon precomputed tables of basis functions for subdivision surfaces and is therefore supporting all subdivision schemes, either interpolating or approximating, for triangle or quad meshes. We designed our kernel so that it can be […]

OpenGL

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Efficient Embarrassingly Parallel on Graphics Processor Unit

Accelerating Unstructured Mesh Computational Fluid Dynamics on the NVidia Tesla GPU Architecture

Performance Analysis of the OP2 Framework on Many-core Architectures

A framework for parallel unstructured grid applications on GPUs

QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators

Accelerating GPU kernels for dense linear algebra

A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators

Numerical simulation of 3D particulate flows based on GPU technology

Monte Carlo Radiative Transport on the GPU

High performance stream computing for particle beam transport simulations

A streaming narrow-band algorithm: interactive computation and visualization of level sets

Synthesizing Subdivision Meshes Using Real Time Tessellation

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)