high performance computing on graphics processing units: hgpu.org

Posts

Nov, 24

Real-time Building Airflow Simulation Aided by GPU and FFD

Two recent methods for the fast simulation of the building airflow are studied: the fast fluid dynamics (FFD) algorithm and the use of graphic processing unit (GPU) for scientific computing in building engineering. A GOOGLE SketchUp plug-in for the FFD program was also developed as a model-creating tool to enhance the accessibility of the operation […]

CUDA

Nov, 23

LoGV: Low-overhead GPGPU Virtualization

Over the last few years, running high performance computing applications in the cloud has become feasible. At the same time, GPGPUs are delivering unprecedented performance for HPC applications. Cloud providers thus face the challenge to integrate GPGPUs into their virtualized platforms, which has proven difficult for current virtualization stacks. In this paper, we present LoGV, […]

CUDA

Nov, 23

TESLA GPUs versus MPI with OpenMP for the Forward Modeling of Gravity and Gravity Gradient of Large Prisms Ensemble

An implementation with the CUDA technology in a single and in several graphics processing units (GPUs) is presented for the calculation of the forward modeling of gravitational fields from a tridimensional volumetric ensemble composed by unitary prisms of constant density. We compared the performance results obtained with the GPUs against a previous version coded in […]

CUDA

Nov, 23

Graph grammar based multi-frontal direct solver for isogeometric FEM simulations on GPU

We present a multi-frontal direct solver for two dimensional isogeometric finite element method simulations with NVIDIA CUDA and perform numerical experiments for linear, quadratic and cubic B-splines. We compare the computational cost O(Np^2) for 2D parallel shared memory implementation with the corresponding estimate O(N^1.5p^3) for a standard 2D sequential implementation. We conclude the presentation with […]

CUDA

Nov, 23

Fast 4pi track reconstruction in nuclear emulsion detectors based on GPU technology

Fast 4pi solid angle particle track recognition has been a challenge in particle physics for a long time, especially in using nuclear emulsion detectors. The recent advances in computing technology opened the way for its realization. A fast 4pi solid angle particle track reconstruction based on GPU technology combined with a multithread programming is reported […]

CUDA

Nov, 23

Dynamic Partitioning-based JPEG Decompression on Heterogeneous Multicore Architectures

With the emergence of social networks and improvements in computational photography, billions of JPEG images are shared and viewed on a daily basis. Desktops, tablets and smartphones constitute the vast majority of hardware platforms used for displaying JPEG images. Despite the fact that these platforms are heterogeneous multicores, no approach exists yet that is capable […]

OpenCL

Nov, 22

Accelerating Sequential Computer Vision Algorithms Using Commodity Parallel Hardware

Since 2004, the clock frequency of CPUs has not increased significantly. Computer Vision applications have an increasing demand for more processing power and are limited by the performance capabilities of sequential processor architectures. The only way to get better performance using commodity hardware is to adopt parallel programming. Many other related research projects have considered […]

OpenCL

Nov, 22

An improved parallel contrast-aware halftoning

Digital image halftoning is a widely used technique. However, achieving high fidelity tone reproduction and structural preservation with low computational time-cost remains a challenging problem. This paper presents a highly parallel algorithm to boost the real-time application of the serial structure-preserving error diffusion. The contrast-aware halftoning approach is one such technique with superior structure preservation, […]

CUDA

Nov, 22

An Optimal Offline Permutation Algorithm on the Hierarchical Memory Machine, with the GPU implementation

The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computation on CUDA-enabled GPUs. The offline permutation is a task to copy numbers stored in an array a of size n to an array b of the same size along a permutation P given in advance. A conventional algorithm […]

CUDA

Nov, 22

Optimization of the Oktay-Kronfeld Action Conjugate Gradient Inverter

Improving the Fermilab action to third order in heavy quark effective theory yields the Oktay-Kronfeld action, a promising candidate for precise calculations of the spectra of heavy quark systems and weak matrix elements relevant to searches for new physics. We have optimized the bi-stabilized conjugate gradient inverter in the SciDAC QOPQDP library and are developing […]

CUDA

Nov, 22

Bohrium: Unmodified NumPy Code on CPU, GPU, and Cluster

In this paper we introduce Bohrium, a runtime-system for mapping array-operations onto a number of different hardware platforms, from multi-core systems to clusters and GPU enabled systems. As a result, the Bohrium runtime system enables NumPy code to utilize CPU, GPU, and Clusters. Bohrium integrates seamlessly into NumPy through the implicit data parallelization of array […]

OpenCL

Nov, 21

Experience with Intel’s Many Integrated Core architecture in ATLAS software

Intel recently released the first commercial boards of its Many Integrated Core (MIC) Architecture. MIC is Intel’s solution for the domain of throughput computing, currently dominated by general purpose programming on graphics processors (GPGPU). MIC allows the use of the more familiar x86 programming model and supports standard technologies such as OpenMP, MPI, and Intel’s […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Real-time Building Airflow Simulation Aided by GPU and FFD

LoGV: Low-overhead GPGPU Virtualization

TESLA GPUs versus MPI with OpenMP for the Forward Modeling of Gravity and Gravity Gradient of Large Prisms Ensemble

Graph grammar based multi-frontal direct solver for isogeometric FEM simulations on GPU

Fast 4pi track reconstruction in nuclear emulsion detectors based on GPU technology

Dynamic Partitioning-based JPEG Decompression on Heterogeneous Multicore Architectures

Accelerating Sequential Computer Vision Algorithms Using Commodity Parallel Hardware

An improved parallel contrast-aware halftoning

An Optimal Offline Permutation Algorithm on the Hierarchical Memory Machine, with the GPU implementation

Optimization of the Oktay-Kronfeld Action Conjugate Gradient Inverter

Bohrium: Unmodified NumPy Code on CPU, GPU, and Cluster

Experience with Intel’s Many Integrated Core architecture in ATLAS software

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)