high performance computing on graphics processing units: hgpu.org

Posts

Nov, 23

Graph grammar based multi-frontal direct solver for isogeometric FEM simulations on GPU

We present a multi-frontal direct solver for two dimensional isogeometric finite element method simulations with NVIDIA CUDA and perform numerical experiments for linear, quadratic and cubic B-splines. We compare the computational cost O(Np^2) for 2D parallel shared memory implementation with the corresponding estimate O(N^1.5p^3) for a standard 2D sequential implementation. We conclude the presentation with […]

CUDA

Nov, 23

Fast 4pi track reconstruction in nuclear emulsion detectors based on GPU technology

Fast 4pi solid angle particle track recognition has been a challenge in particle physics for a long time, especially in using nuclear emulsion detectors. The recent advances in computing technology opened the way for its realization. A fast 4pi solid angle particle track reconstruction based on GPU technology combined with a multithread programming is reported […]

CUDA

Nov, 23

Dynamic Partitioning-based JPEG Decompression on Heterogeneous Multicore Architectures

With the emergence of social networks and improvements in computational photography, billions of JPEG images are shared and viewed on a daily basis. Desktops, tablets and smartphones constitute the vast majority of hardware platforms used for displaying JPEG images. Despite the fact that these platforms are heterogeneous multicores, no approach exists yet that is capable […]

OpenCL

Nov, 22

Accelerating Sequential Computer Vision Algorithms Using Commodity Parallel Hardware

Since 2004, the clock frequency of CPUs has not increased significantly. Computer Vision applications have an increasing demand for more processing power and are limited by the performance capabilities of sequential processor architectures. The only way to get better performance using commodity hardware is to adopt parallel programming. Many other related research projects have considered […]

OpenCL

Nov, 22

An improved parallel contrast-aware halftoning

Digital image halftoning is a widely used technique. However, achieving high fidelity tone reproduction and structural preservation with low computational time-cost remains a challenging problem. This paper presents a highly parallel algorithm to boost the real-time application of the serial structure-preserving error diffusion. The contrast-aware halftoning approach is one such technique with superior structure preservation, […]

CUDA

Nov, 22

An Optimal Offline Permutation Algorithm on the Hierarchical Memory Machine, with the GPU implementation

The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computation on CUDA-enabled GPUs. The offline permutation is a task to copy numbers stored in an array a of size n to an array b of the same size along a permutation P given in advance. A conventional algorithm […]

CUDA

Nov, 22

Optimization of the Oktay-Kronfeld Action Conjugate Gradient Inverter

Improving the Fermilab action to third order in heavy quark effective theory yields the Oktay-Kronfeld action, a promising candidate for precise calculations of the spectra of heavy quark systems and weak matrix elements relevant to searches for new physics. We have optimized the bi-stabilized conjugate gradient inverter in the SciDAC QOPQDP library and are developing […]

CUDA

Nov, 22

Bohrium: Unmodified NumPy Code on CPU, GPU, and Cluster

In this paper we introduce Bohrium, a runtime-system for mapping array-operations onto a number of different hardware platforms, from multi-core systems to clusters and GPU enabled systems. As a result, the Bohrium runtime system enables NumPy code to utilize CPU, GPU, and Clusters. Bohrium integrates seamlessly into NumPy through the implicit data parallelization of array […]

OpenCL

Nov, 21

Experience with Intel’s Many Integrated Core architecture in ATLAS software

Intel recently released the first commercial boards of its Many Integrated Core (MIC) Architecture. MIC is Intel’s solution for the domain of throughput computing, currently dominated by general purpose programming on graphics processors (GPGPU). MIC allows the use of the more familiar x86 programming model and supports standard technologies such as OpenMP, MPI, and Intel’s […]

Nov, 21

Direct Numeric Simulation of Sheared Convective Boundary Layer Entrainment with GPUs

Sheared convective boundary layers (SCBL) are a frequently observed boundary layer in nature and industry. This paper presents work conducted to validate a numerical fluid model of sheared convective boundary layers implemented in Nvidia’s CUDA programming language for graphical processing units. The code is based on finite difference implementation of the SIMPLE algorithm using the […]

CUDA

Nov, 21

Towards an interactive and automated script feature analysis of 3D scanned cuneiform tablets

Current digitalization projects of ancient artifacts in the field of cultural heritage produce large amounts of data that can not be managed and analyzed in a reasonable amount of time by means of conventional philological methods. Therefore, this paper presents a novel approach to performing a fast and interactive 3D script feature extraction, analysis and […]

OpenGL

Nov, 20

Multi-GPU Support on the Marrow Algorithmic Skeleton Framework

With the proliferation of general purpose GPUs, workload parallelization and datatransfer optimization became an increasing concern. The natural evolution from using a single GPU, is multiplying the amount of available processors, presenting new challenges, as tuning the workload decompositions and load balancing, when dealing with heterogeneous systems. Higher-level programming is a very important asset in […]

CUDA

•

OpenCL

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Graph grammar based multi-frontal direct solver for isogeometric FEM simulations on GPU

Fast 4pi track reconstruction in nuclear emulsion detectors based on GPU technology

Dynamic Partitioning-based JPEG Decompression on Heterogeneous Multicore Architectures

Accelerating Sequential Computer Vision Algorithms Using Commodity Parallel Hardware

An improved parallel contrast-aware halftoning

An Optimal Offline Permutation Algorithm on the Hierarchical Memory Machine, with the GPU implementation

Optimization of the Oktay-Kronfeld Action Conjugate Gradient Inverter

Bohrium: Unmodified NumPy Code on CPU, GPU, and Cluster

Experience with Intel’s Many Integrated Core architecture in ATLAS software

Direct Numeric Simulation of Sheared Convective Boundary Layer Entrainment with GPUs

Towards an interactive and automated script feature analysis of 3D scanned cuneiform tablets

Multi-GPU Support on the Marrow Algorithmic Skeleton Framework

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)