high performance computing on graphics processing units: hgpu.org

Posts

Mar, 2

On Performance of GPU and DSP Architectures for Computationally Intensive Applications

This thesis focuses on the implementations of a support vector machine (SVM) algorithm on digital signal processor (DSP), graphics processor unit (GPU), and a common Intel i7 core architecture. The purpose of this work is to identify which of the three is most suitable for SVM implementation. The performance is measured by looking at the […]

CUDA

Mar, 2

Large-scale ferrofluid simulations on graphics processing units

We present an approach to molecular-dynamics simulations of ferrofluids on graphics processing units (GPUs). Our numerical scheme is based on a GPU-oriented modification of the Barnes-Hut (BH) algorithm designed to increase the parallelism of computations. For an ensemble consisting of a million ferromagnetic particles, the performance of the proposed algorithm on a Tesla M2050 GPU […]

CUDA

Mar, 2

Acceleration of Coarse Grain Molecular Dynamics on GPU Architectures

Coarse grain (CG) molecular models have been proposed to simulate complex systems with lower computational overheads and longer timescales with respect to atomistic level models. However, their acceleration on parallel architectures such as graphic processing units (GPUs) presents original challenges that must be carefully evaluated. The objective of this work is to characterize the impact […]

CUDA

Mar, 2

On continuous maximum flow image segmentation algorithm

In recent years, with the advance of computing equipment and image acquisition techniques, the sizes, dimensions and content of acquired images have increased considerably. Unfortunately as time passes there is a steadily increasing gap between the classical and parallel programming paradigms and their actual performance on modern computer hardware. In this thesis we consider in […]

OpenCL

Mar, 2

Parallel Peeling Algorithms

The analysis of several algorithms and data structures can be framed as a peeling process on a random hypergraph: vertices with degree less than k are removed until there are no vertices of degree less than k left. The remaining hypergraph is known as the k-core. In this paper, we analyze parallel peeling processes, where […]

CUDA

Mar, 2

Matrix-free GPU implementation of a preconditioned conjugate gradient solver for anisotropic elliptic PDEs

Many problems in geophysical and atmospheric modelling require the fast solution of elliptic partial differential equations (PDEs) in "flat" three dimensional geometries. In particular, an anisotropic elliptic PDE for the pressure correction has to be solved at every time step in the dynamical core of many numerical weather prediction models, and equations of a very […]

CUDA

Feb, 28

Automatic Mapping of Stream Programs on Multicore Architectures

Stream languages explicitly describe fork-join and pipeline parallelism, offering a powerful programming model for general multicore systems. This parallelism description can be exploited on hybrid architectures, eg. composed of Graphics Processing Units (GPUs) and general purpose multicore processors. In this paper, we present a novel approach to optimize stream programs for hybrid architectures composed of […]

CUDA

Feb, 28

GEARS: A General and Efficient Algorithm for Rendering Shadows

We present a soft shadow rendering algorithm that is general, efficient, and accurate. The algorithm supports fully dynamic scenes, with moving and deforming blockers and receivers, and with changing area light source parameters. The algorithm computes for each output image pixel a tight but conservative approximation of the set of triangles that block the light […]

CUDA

Feb, 28

Acceleration of the MMFF94 routines within OpenBabel using Eigen and OpenCL

Over the last few decades, computer modelling and computer simulation have become an invaluable tool for computational chemists interested in advancing their research and experiment in a more efficient, cost effective way with new molecules. As computer capabilities increase the demand for more accurate models and faster simulations has also grown. Some of these models […]

OpenCL

Feb, 28

A Fast and Efficient SIFT Detector Using the Mobile GPU

Emerging mobile applications, such as augmented reality, demand robust feature detection at high frame rates. We present an implementation of the popular Scale-Invariant Feature Transform (SIFT) feature detection algorithm that incorporates the powerful graphics processing unit (GPU) in mobile devices. Where the usual GPU methods are inefficient on mobile hardware, we propose a heterogeneous dataflow […]

OpenGL

Feb, 28

Accelerating Dynamic Binary Translation with GPUs

Binary translation is the emulation of one instruction set by another through translation of code. In binary translation sequences of instructions are translated from the source to the target instruction set. Dynamic binary translation (DBT) looks at a short sequence of code – typically on the order of a single basic block – then translate […]

CUDA

Feb, 27

Multi-GPU based on multicriteria optimization for motion estimation system

Graphics processor units (GPUs) offer high performance and power efficiency for a large number of data-parallel applications. Previous research has shown that a GPU-based version of a neuromorphic motion estimation algorithm can achieve a x32 speedup using these devices. However, the memory consumption creates a bottleneck due to the expansive tree of signal processing operations […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

On Performance of GPU and DSP Architectures for Computationally Intensive Applications

Large-scale ferrofluid simulations on graphics processing units

Acceleration of Coarse Grain Molecular Dynamics on GPU Architectures

On continuous maximum flow image segmentation algorithm

Parallel Peeling Algorithms

Matrix-free GPU implementation of a preconditioned conjugate gradient solver for anisotropic elliptic PDEs

Automatic Mapping of Stream Programs on Multicore Architectures

GEARS: A General and Efficient Algorithm for Rendering Shadows

Acceleration of the MMFF94 routines within OpenBabel using Eigen and OpenCL

A Fast and Efficient SIFT Detector Using the Mobile GPU

Accelerating Dynamic Binary Translation with GPUs

Multi-GPU based on multicriteria optimization for motion estimation system

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)