high performance computing on graphics processing units: hgpu.org

Posts

Dec, 18

Efficient data structures for piecewise-smooth video processing

A number of useful image and video processing techniques, ranging from low level operations such as denoising and detail enhancement to higher level methods such as object manipulation and special effects, rely on piecewise-smooth functions computed from the input data. In this thesis, we present two computationally efficient data structures for representing piecewise-smooth visual information […]

Dec, 18

The MOSIX Cluster Operating System for High-Performance Computing on Linux Clusters, Multi-Clusters, GPU Clusters and Clouds

MOSIX is a cluster operating system targeted for HighPerformance Computing (HPC) on Linux platforms, including clusters, multi-clusters, GPU clusters and Clouds. The unique features of MOSIX provide users and applications with the impression of running on a single computer with multiple processors, without changing the interface and the run-time environment of their respective login nodes. […]

OpenCL

Dec, 18

Parallel paradigms in optimal structural design

Modern-day processors are not getting any faster. Due to the power consumption limit of frequency scaling, parallel processing is increasingly being used to decrease computation time. In this thesis, several parallel paradigms are used to improve the performance of commonly serial SAO programs. Four novelties are discussed: First, replacing double precision solvers with single precision […]

CUDA

•

OpenCL

Dec, 17

Massively Parallel Logic Simulation with GPUs

In this article, we developed a massively parallel gate-level logical simulator to address the ever-increasing computing demand for VLSI verification. To the best of the authors’ knowledge, this work is the first one to leverage the power of modern GPUs to successfully unleash the massive parallelism of a conservative discrete event-driven algorithm, CMB algorithm. A […]

CUDA

Dec, 17

Extendable pattern-oriented optimization directives

Current programming models and compiler technologies for multi-core processors do not exploit well the performance benefits obtainable by applying algorithm-specific, i.e., semantic-specific optimizations to a particular application. In this work, we propose a pattern-making methodology that allows algorithm-specific optimizations to be encapsulated into "optimization patterns" that are expressed in terms of pre-processor directives so that […]

CUDA

Dec, 17

Quartile and Outlier Detection on Heterogeneous Clusters Using Distributed Radix Sort

In the past few years, performance improvements in CPUs and memory technologies have outpaced those of storage systems. When extrapolated to the exascale, this trend places strict limits on the amount of data that can be written to disk for full analysis, resulting in an increased reliance on characterizing in-memory data. Many of these characterizations […]

CUDA

Dec, 17

Parallel Mining of Neuronal Spike Streams on Graphics Processing Units

Multi-electrode arrays (MEAs) provide dynamic and spatial perspectives into brain function by capturing the temporal behavior of spikes recorded from cultures and living tissue. Understanding the firing patterns of neurons implicit in these spike trains is crucial to gaining insight into cellular activity. We present a solution involving a massively parallel graphics processing unit (GPU) […]

CUDA

Dec, 17

GPU implementation of JPEG2000 for hyperspectral image compression

Hyperspectral image compression has received considerable interest in recent years due to the enormous data volumes collected by imaging spectrometers for Earth Observation. JPEG2000 is an important technique for data compression which has been successfully used in the context of hyperspectral image compression, either in lossless and lossy fashion. Due to the increasing spatial, spectral […]

CUDA

Dec, 17

Code Optimization Techniques for Graphics Processing Units

Books on parallel programming theory often talk about such weird beasts like the PRAM model, a hypothetical hardware that would provide the programmer with a number of processors that is proportional to the input size of the problem at hand. Modern general purpose computers afford only a few processing units; four is currently a reasonable […]

CUDA

Dec, 17

Customizable Memory Schemes for Data Parallel Accelerators

Memory system efficiency is crucial for any processor to achieve high performance, especially in the case of data parallel machines. Processing capabilities of parallel lanes will be wasted, when data requests are not accomplished in a sustainable and timely manner. Irregular vector memory accesses can lead to inefficient use of the parallel banks/modules/channels and significantly […]

CUDA

Dec, 17

Parallel mesh adaptation and graph analysis using graphics processing units

In the field of Computational Fluid Dynamics, several types of mesh adaptation strategies are used to enhance a mesh’s quality, thereby improving simulation speed and accuracy. Mesh smoothing (r-refinement) is a simple and effective technique, where nodes are repositioned to increase or decrease local mesh resolution. Mesh partitioning divides a mesh into sections, for use […]

CUDA

Dec, 17

A Comparative Analysis of GPU Implementations of Spectral Unmixing Algorithms

Spectral unmixing is a very important task for remotely sensed hyperspectral data exploitation. It involves the separation of a mixed pixel spectrum into its pure component spectra (called endmembers) and the estimation of the proportion (abundance) of each endmember in the pixel. Over the last years, several algorithms have been proposed for: i) automatic extraction […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Efficient data structures for piecewise-smooth video processing

The MOSIX Cluster Operating System for High-Performance Computing on Linux Clusters, Multi-Clusters, GPU Clusters and Clouds

Parallel paradigms in optimal structural design

Massively Parallel Logic Simulation with GPUs

Extendable pattern-oriented optimization directives

Quartile and Outlier Detection on Heterogeneous Clusters Using Distributed Radix Sort

Parallel Mining of Neuronal Spike Streams on Graphics Processing Units

GPU implementation of JPEG2000 for hyperspectral image compression

Code Optimization Techniques for Graphics Processing Units

Customizable Memory Schemes for Data Parallel Accelerators

Parallel mesh adaptation and graph analysis using graphics processing units

A Comparative Analysis of GPU Implementations of Spectral Unmixing Algorithms

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)