high performance computing on graphics processing units: hgpu.org

Posts

Dec, 18

Shader Performance Analysis on a Modern GPU Architecture

This paper presents an analysis of the performance of the shader processing units in a modern graphics processor unit (GPU) architecture using real graphic applications. The architecture of a modern GPU is described and a simulator and associated framework used to evaluate the architecture is introduced. The paper analyses the effects in performance of different […]

OpenGL

Dec, 18

GPU clusters for high-performance computing

Large-scale GPU clusters are gaining popularity in the scientific computing community. However, their deployment and production use are associated with a number of new challenges. In this paper, we present our efforts to address some of the challenges with building and running GPU clusters in HPC environments. We touch upon such issues as balanced cluster […]

CUDA

Dec, 18

Accelerating Template-Based Matching on the GPU for AR Applications

Recently researchers have shown that it is possible to use GPU hardware for image processing and computer vision algorithms. We have been exploring how to use GPU hardware to improve marker-based tracking for AR Applications. In this paper we describe our findings and explored issues in the context of a standard fiducial tracking pipeline. We […]

Dec, 18

Accelerating SQL Database Operations on a GPU with CUDA

Prior work has shown dramatic acceleration for various database operations on GPUs, but only using primitives that are not part of conventional database languages such as SQL. This paper implements a subset of the SQLite command processor directly on the GPU. This dramatically reduces the effort required to achieve GPU acceleration by avoiding the need […]

CUDA

Dec, 18

Efficient, High-Quality Bayer Demosaic Filtering on GPUs

This paper describes a series of optimizations for implementing the high-quality Malvar-He-Cutler Bayer demosaicing filter on a GPU in OpenGL. Applying this filter is the first step in most video-processing pipelines but is generally considered too slow for real time on a CPU. The optimized implementation contains 66% fewer ALU operations than a direct GPU […]

OpenGL

Dec, 18

GPU-based Island Model for Evolutionary Algorithms

The island model for evolutionary algorithms allows to delay the global convergence of the evolution process and encourage diversity. However, solving large size and time-intensive combinatorial optimization problems with the island model requires a large amount of computational resources. GPU computing is recently revealed as a powerful way to harness these resources. In this paper, […]

CUDA

Dec, 18

Accelerating K-Means on the Graphics Processor via CUDA

In this paper an optimized k-means implementation on the graphics processing unit (GPU) is presented. NVIDIApsilas compute unified device architecture (CUDA), available from the G80 GPU family onwards, is used as the programming environment. Emphasis is placed on optimizations directly targeted at this architecture to best exploit the computational capabilities available. Additionally drawbacks and limitations […]

CUDA

Dec, 18

A GPU based implementation of Center-Surround Distribution Distance for feature extraction and matching

The release of general purpose GPU programming environments has garnered universal access to computing performance that was once only available to super-computers. The availability of such computational power has fostered the creation and re-deployment of algorithms, new and old, creating entirely new classes of applications. In this paper, a GPU implementation of the Center-Surround Distribution […]

CUDA

Dec, 18

A Single (Unified) Shader GPU Microarchitecture for Embedded Systems

We present and evaluate the TILA-rin GPU microarchitecture for embedded systems using the ATTILA GPU simulation framework. We use a trace from an execution of the Unreal Tournament 2004 PC game to eval uate and compare the performance of the proposed embedded GPU against a baseline GPU architecture for the PC. We evaluate the different […]

OpenGL

Dec, 18

A Cross-Input Adaptive Framework for GPU Programs Optimization

Recent years have seen a trend in using graphic processing units (GPU) as accelerators for general-purpose computing. The inexpensive, single-chip, massively parallel architecture of GPU has evidentially brought factors of speedup to many numerical applications. However, the development of a high-quality GPU application is challenging, due to the large optimization space and complex unpredictable effects […]

CUDA

Dec, 17

Fast Software AES Encryption

This paper presents new software speed records for AES-128 encryption for architectures at both ends of the performance spectrum. On the one side we target the low-end 8-bit AVR microcontrollers and 32-bit ARM microprocessors, while on the other side of the spectrum we consider the high-performing Cell broadband engine and NVIDIA graphics processing units (GPUs). […]

CUDA

Dec, 17

A New Parallel Method of Smith-Waterman Algorithm on a Heterogeneous Platform

Smith-Waterman algorithm is a classic dynamic programming algorithm to solve the problem of biological sequence alignment. However, with the rapid increment of the number of DNA and protein sequences, the originally sequential algorithm is very time consuming due to there existing the same computing task computed repeatedly on large-scale data. Today’s GPU (graphics processor unit) […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Shader Performance Analysis on a Modern GPU Architecture

GPU clusters for high-performance computing

Accelerating Template-Based Matching on the GPU for AR Applications

Accelerating SQL Database Operations on a GPU with CUDA

Efficient, High-Quality Bayer Demosaic Filtering on GPUs

GPU-based Island Model for Evolutionary Algorithms

Accelerating K-Means on the Graphics Processor via CUDA

A GPU based implementation of Center-Surround Distribution Distance for feature extraction and matching

A Single (Unified) Shader GPU Microarchitecture for Embedded Systems

A Cross-Input Adaptive Framework for GPU Programs Optimization

Fast Software AES Encryption

A New Parallel Method of Smith-Waterman Algorithm on a Heterogeneous Platform

Recent source codes

DITRON: Distributed Compiler based on Triton for Parallel Systems

IntelliKit: Agent-first tooling for AMD hardware

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)