high performance computing on graphics processing units: hgpu.org

Posts

Sep, 23

Image super-resolution by vectorizing edges

As the resolution of output device increases, the demand of high resolution contents has become more eagerly. Therefore, the image superresolution algorithms become more important. In digital image, the edges in the image are related to human perception heavily. Because of this, most recent research topics tend to enhance the image edges to achieve better […]

Sep, 23

Acceleration of Functional Validation Using GPGPU

Logic simulation of a VLSI chip is a computationally intensive process. There exists an urgent need to map functional validation algorithms onto parallel architectures to aid hardware designers in meeting time-to-market constraints. In this paper, we propose three novel methods for logic simulation of combinational circuits on GPGPUs. Initial experiments run on two methods using […]

Sep, 23

Simple optimizations for an applicative array language for graphics processors

Graphics processors (GPUs) are highly parallel devices that promise high performance, and they are now flexible enough to be used for general-purpose computing. A programming language based on implicitly data-parallel collective array operations can permit high-level, effective programming of GPUs. I describe three optimizations for such a language: automatic use of GPU shared memory cache, […]

CUDA

Sep, 23

Mathematical limits of parallel computation for embedded systems

Embedded systems are designed to perform a specific set of tasks, and are frequently found in mobile, power-constrained environments. There is growing interest in the use of parallel computation as a means to increase performance while reducing power consumption. In this paper, we highlight fundamental limits to what can and cannot be improved by parallel […]

Sep, 23

HHT-based time-frequency analysis method for biomedical signal applications

Fourier transform, wavelet transformation, and Hilbert-Huang transformation (HHT) can be used to discuss the frequency characteristics of linear and stationary signals, the time-frequency features of linear and non-stationary signals, the time-frequency features of non-linear and non-stationary signals, respectively [1-6]. HHT is a combination of empirical mode decomposition (EMD) and Hilbert spectral analysis. EMD uses the […]

Sep, 23

The International Exascale Software Project roadmap

Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. However, although the investments in these separate software elements have been tremendously valuable, a […]

Sep, 23

Compact data structure and scalable algorithms for the sparse grid technique

The sparse grid discretization technique enables a compressed representation of higher-dimensional functions. In its original form, it relies heavily on recursion and complex data structures, thus being far from well-suited for GPUs. In this paper, we describe optimizations that enable us to implement compression and decompression, the crucial sparse grid algorithms for our application, on […]

CUDA

Sep, 23

Colored stochastic shadow maps

This paper extends the stochastic transparency algorithm that models partial coverage to also model wavelength-varying transmission. It then applies this to the problem of casting shadows between any combination of opaque, colored transmissive, and partially covered (i.e., ?-matted) surfaces in a manner compatible with existing hardware shadow mapping techniques. Colored Stochastic Shadow Maps have a […]

Sep, 23

Unstructured grid applications on GPU: performance analysis and improvement

Performance of applications running on GPUs is mainly affected by hardware occupancy and global memory latency. Scientific applications that rely on analysis using unstructured grids could benefit from the high performance capabilities provided by GPUs, however, its memory access pattern and algorithm limit the potential benefits. In this paper we analyze the algorithm for unstructured […]

CUDA

Sep, 23

Orchestration by approximation: mapping stream programs onto multicore architectures

We present a novel 2-approximation algorithm for deploying stream graphs on multicore computers and a stream graph transformation that eliminates bottlenecks. The key technical insight is a data rate transfer model that enables the computation of a "closed form", i.e., the data rate transfer function of an actor depending on the arrival rate of the […]

Sep, 23

Quantifying NUMA and contention effects in multi-GPU systems

As system architects strive for increased density and power efficiency, the traditional compute node is being augmented with an increasing number of graphics processing units (GPUs). The integration of multiple GPUs per node introduces complex performance phenomena including non-uniform memory access (NUMA) and contention for shared system resources. Utilizing the Keeneland system, this paper quantifies […]

CUDA

Sep, 22

Register packing for cyclic reduction: a case study

We generalize a method for avoiding GPU shared communication when dealing with a downsweep pattern. We apply this generalization to Cyclic Reduction, a tridiagonal solver with this pattern. Previously, Cyclic Reduction suffered poor performance when compared to other tridiagonal solvers on the GPU due to performance issues stemming from shared-memory bandwidth bottlenecks and step-efficiency. We […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Image super-resolution by vectorizing edges

Acceleration of Functional Validation Using GPGPU

Simple optimizations for an applicative array language for graphics processors

Mathematical limits of parallel computation for embedded systems

HHT-based time-frequency analysis method for biomedical signal applications

The International Exascale Software Project roadmap

Compact data structure and scalable algorithms for the sparse grid technique

Colored stochastic shadow maps

Unstructured grid applications on GPU: performance analysis and improvement

Orchestration by approximation: mapping stream programs onto multicore architectures

Quantifying NUMA and contention effects in multi-GPU systems

Register packing for cyclic reduction: a case study

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)