high performance computing on graphics processing units: hgpu.org

Posts

Mar, 8

PISTON: A Portable Cross-Platform Framework for Data-Parallel Visualization Operators

Due to the wide variety of current and next-generation supercomputing architectures, the development of highperformance parallel visualization and analysis operators frequently requires re-writing the underlying algorithms for many different platforms. In order to facilitate portability, we have devised a framework for creating such operators that employs the data-parallel programming model. By writing the operators using […]

CUDA

•

OpenCL

Mar, 6

Efficient Relational Algebra Algorithms and Data Structures for GPU

Relational databases remain an important application domain for organizing and analyzing the massive volume of data generated as sensor technology, retail and inventory transactions, social media, computer vision, and new fields continue to evolve. At the same time, processor architectures are beginning to shift towards hierarchical and parallel architectures employing throughput-optimized memory systems, lightweight multi-threading, […]

CUDA

Mar, 6

PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs

GPGPUs are increasingly being used to as performance accelerators for HPC (High Performance Computing) applications in CPU/GPU heterogeneous computing systems, including TianHe-1A, the world’s fastest supercomputer in the TOP500 list, built at NUDT (National University of Defense Technology) last year. However, despite their performance advantages, GPGPUs do not provide built-in fault-tolerant mechanisms to offer reliability […]

CUDA

Mar, 6

KUDA: GPU Accelerated Split Race Checker

We propose a novel approach for runtime verification on computers with a large number of computation cores, without any hardware extension to mainstream PC environment. The goal of the approach is making use of all hardware resources to decouple the computational overhead of traditional race checkers via parallelizing the runtime verification. We distinguish between two […]

CUDA

Mar, 6

Synthesizing Software from a ForSyDe Model Targeting GPGPUs

Today, a plethora of parallel execution platforms are available. One platform in particular is the GPGPU – a massively parallel architecture designed for exploiting data parallelism. However, GPGPUS are notoriously difficult to program due to the way data is accessed and processed, and many interconnected factors affect the performance. This makes it an exceptionally challengingtask […]

CUDA

Mar, 6

High-Performance Distributed Multi-Model / Multi-Kernel Simulations: A Case-Study in Jungle Computing

High-performance scientific applications require more and more compute power. The concurrent use of multiple distributed compute resources is vital for making scientific progress. The resulting distributed system, a so-called Jungle Computing System, is both highly heterogeneous and hierarchical, potentially consisting of grids, clouds, stand-alone machines, clusters, desktop grids, mobile devices, and supercomputers, possibly with accelerators […]

CUDA

Mar, 2

Parallel Implementation of Similarity Measures on GPU Architecture using CUDA

Image processing and pattern recognition algorithms take more time for execution on a single core processor. Graphics Processing Unit (GPU) is more popular now-a-days due to their speed, programmability, low cost and more inbuilt execution cores in it. Most of the researchers started work to use GPUs as a processing unit with a single core […]

CUDA

Mar, 2

Ray Tracing Visualization Toolkit

The Ray Tracing Visualization Toolkit (rtVTK) is a collection of programming and visualization tools supporting visual analysis of ray-based rendering algorithms. rtVTK leverages layered visualization within the spatial domain of computation, enabling investigators to explore the computational elements of any ray-based renderer. Renderers utilize a library for recording and processing ray state, and a configurable […]

OpenCL

•

OpenGL

Mar, 2

Efficient Performance Evaluation of Memory Hierarchy for Highly Multithreaded Graphics Processors

With the emergence of highly multithreaded architectures, performance monitoring techniques face new challenges in efficiently locating sources of performance discrepancies in the program source code. For example, the state-of-the-art performance counters in highly multithreaded graphics processing units (GPUs) report only the overall occurrences of microarchitecture events at the end of program execution. Furthermore, even if […]

CUDA

Mar, 2

Parallel Hashing, Compression and Encryption with OpenCL under OS X

In this dissertation we examine the efficiency of GPUs with a limited number of stream processors (up to 32), located in desktops and laptops, in the execution of algorithms such as hashing (MD5, SHA1), encryption (Salsa20) and compression (LZ78). For the implementation part, the OpenCL framework was used under OS X. The graphic cards tested […]

OpenCL

Mar, 2

GPU Implementation of Split-Field Finite-Difference Time-Domain Method for Drude-Lorentz Dispersive Media

Split-field finite-difference time-domain (SF-FDTD) method can overcome the limitation of ordinary FDTD in analyzing periodic structures under oblique incidence. On the other hand, huge run times of 3D SF-FDTD, is practically a major burden in its usage for analysis and design of nanostructures, particularly when having dispersive media. Here, details of parallel implementation of 3D […]

CUDA

Mar, 1

Chestnut: A GPU Programming Language for Non-Experts

Graphics processing units (GPUs) are powerful devices capable of rapid parallel computation. GPU programming, however, can be quite difficult, limiting its use to experienced programmers and keeping it out of reach of a large number of potential users. We present Chestnut, a domain-specific GPU parallel programming language for parallel multi-dimensional grid applications. Chestnut is designed […]

CUDA

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

* * *

high performance computing on graphics processing units: hgpu.org

Posts

PISTON: A Portable Cross-Platform Framework for Data-Parallel Visualization Operators

Efficient Relational Algebra Algorithms and Data Structures for GPU

PartialRC: A Partial Recomputing Method for Efficient Fault Recovery on GPGPUs

KUDA: GPU Accelerated Split Race Checker

Synthesizing Software from a ForSyDe Model Targeting GPGPUs

High-Performance Distributed Multi-Model / Multi-Kernel Simulations: A Case-Study in Jungle Computing

Parallel Implementation of Similarity Measures on GPU Architecture using CUDA

Ray Tracing Visualization Toolkit

Efficient Performance Evaluation of Memory Hierarchy for Highly Multithreaded Graphics Processors

Parallel Hashing, Compression and Encryption with OpenCL under OS X

GPU Implementation of Split-Field Finite-Difference Time-Domain Method for Drude-Lorentz Dispersive Media

Chestnut: A GPU Programming Language for Non-Experts

Recent source codes

CL4SE: A Context Learning Benchmark For Software Engineering Tasks

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels

KernelGYM & Dr. Kernel: A distributed GPU environment and a collection of RL training methods to support RL for Kernel Generations

Vortex-Optimized Light-weight Toolchain (VOLT)

SciDef: Automated Definition Extraction from Scientific Literature

bioagent-bench: Benchmark for evaluating LLM agents in bioinformatics

Benchmark suite for LLM inference on NVIDIA consumer GPUs

Theorizer: from the paper Generating Literature-Driven Scientific Discoveries at Scale

Most viewed papers (last 30 days)