high performance computing on graphics processing units: hgpu.org

Posts

Oct, 22

Introducing CURRENNT – the Munich open-source CUDA RecurREnt Neural Network Toolkit

In this article, we introduce CURRENNT, an open-source parallel implementation of deep recurrent neural networks (RNNs) supporting graphics processing units (GPUs) through NVIDIA’s Computed Unified Device Architecture (CUDA). CURRENNT supports uni- and bidirectional RNNs with Long Short-Term Memory (LSTM) memory cells which overcome the vanishing gradient problem. To our knowledge, CURRENNT is the first publicly […]

CUDA

Oct, 22

Optimization Techniques for Mapping Algorithms and Applications onto CUDA GPU Platforms and CPU-GPU Heterogeneous Platforms

An emerging trend in processor architecture seems to indicate the doubling of the number of cores per chip every two years with same or decreased clock speed. Of particular interest to this thesis is the class of many-core processors, which are becoming more attractive due to their high performance, low cost, and low power consumption. […]

CUDA

Oct, 22

Fast Parallel Algorithm for Enumerating All Chordless Cycles in Graphs

Finding chordless cycles is an important theoretical problem in the Graph Theory area. It also can be applied to practical problems such as discover which predators compete for the same food in ecological networks. Motivated by the problem of theoretical interest and also by its significant practical importance, we present in this paper a parallel […]

OpenCL

Oct, 22

3D simulation of complex shading affecting PV systems taking benefit from the power of graphics cards developed for the video game industry

Shading reduces the power output of a photovoltaic (PV) system. The design engineering of PV systems requires modeling and evaluating shading losses. Some PV systems are affected by complex shading scenes whose resulting PV energy losses are very difficult to evaluate with current modeling tools. Several specialized PV design and simulation software include the possibility […]

OpenGL

Oct, 22

Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems

The Kernel Polynomial Method (KPM) is a well-established scheme in quantum physics and quantum chemistry to determine the eigenvalue density and spectral properties of large sparse matrices. In this work we demonstrate the high optimization potential and feasibility of peta-scale heterogeneous CPU-GPU implementations of the KPM. At the node level we show that it is […]

CUDA

Oct, 20

A Performance Comparison of Sort and Scan Libraries for GPUs

Sorting and scanning are two fundamental primitives for constructing highly parallel algorithms. A number of libraries now provide implementations of these primitives for GPUs, but there is relatively little information about the performance of these implementations. We benchmark seven libraries for 32-bit integer scan and sort, and sorting 32-bit values by 32-bit integer keys.We show […]

CUDA

•

OpenCL

Oct, 20

Massively parallel read mapping on GPUs with the q-group index and PEANUT

We present the q-group index, a novel data structure for read mapping tailored towards graphics processing units (GPUs) with a small memory footprint and efficient parallel algorithms for querying and building. On top of the q-group index we introduce PEANUT, a highly parallel GPU-based read mapper. PEANUT provides the possibility to output both the best […]

OpenCL

Oct, 20

Heterogeneous computing with an algorithmic skeleton framework

The Graphics Processing Unit (GPU) is present in almost every modern day personal computer. Despite its specific purpose design, they have been increasingly used for general computations with very good results. Hence, there is a growing effort from the community to seamlessly integrate this kind of devices in everyday computing. However, to fully exploit the […]

OpenCL

Oct, 20

Fast-Fourier-Transform-Based Electrical Noise Measurements

We have shown how the Fourier spectrum and the power spectral density can be estimated in concrete measurements. Moreover, we have derived spectral leakage, which is a systematic error in spectrum computation. The Nyquist-Shannon sampling theorem and aliasing have been discussed. Furthermore, we have implemented a spectrum analyzer using a combination of LabView, GPU computing […]

OpenCL

Oct, 20

High-Dimensional Adaptive Particle Swarm Optimization on Heterogeneous Systems

Much work has recently been reported in parallel GPU-based particle swarm optimization (PSO). Motivated by the encouraging results of these investigations, while also recognizing the limitations of GPU-based methods for big problems using a large amount of data, this paper explores the efficacy of employing other types of parallel hardware for PSO. Most commodity systems […]

CUDA

Oct, 18

A Review of CUDA, MapReduce, and Pthreads Parallel Computing Models

The advent of high performance computing (HPC) and graphics processing units (GPU), present an enormous computation resource for Large data transactions (big data) that require parallel processing for robust and prompt data analysis. While a number of HPC frameworks have been proposed, parallel programming models present a number of challenges, for instance, how to fully […]

CUDA

Oct, 18

StreamWorks: An Energy-efficient Embedded Co-processor for Stream Computing

Stream processing has emerged as an important model of computation especially in the context of multimedia and communication sub-systems of embedded System-on-Chip (SoC) architectures. The dataflow nature of streaming applications allows them to be most naturally expressed as a set of kernels iteratively operating on continuous streams of data. The kernels are computationally intensive and […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Introducing CURRENNT – the Munich open-source CUDA RecurREnt Neural Network Toolkit

Optimization Techniques for Mapping Algorithms and Applications onto CUDA GPU Platforms and CPU-GPU Heterogeneous Platforms

Fast Parallel Algorithm for Enumerating All Chordless Cycles in Graphs

3D simulation of complex shading affecting PV systems taking benefit from the power of graphics cards developed for the video game industry

Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems

A Performance Comparison of Sort and Scan Libraries for GPUs

Massively parallel read mapping on GPUs with the q-group index and PEANUT

Heterogeneous computing with an algorithmic skeleton framework

Fast-Fourier-Transform-Based Electrical Noise Measurements

High-Dimensional Adaptive Particle Swarm Optimization on Heterogeneous Systems

A Review of CUDA, MapReduce, and Pthreads Parallel Computing Models

StreamWorks: An Energy-efficient Embedded Co-processor for Stream Computing

Recent source codes

RepoLaunch: Automating Build and Test Pipeline of Code Repositories on ANY Language and ANY Platform

RepoLaunch: Automating Build and Test Pipeline of Code Repositories on ANY Language and ANY Platform

CONCUR: a benchmark designed to evaluate multithreaded Java code generated by LLMs

HIPRT: Ray Tracing using HIP

MXFP4 Training Support Codebase

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation

CUDABench: Benchmarking LLMs for Text-to-CUDA Generation

CL4SE: A Context Learning Benchmark For Software Engineering Tasks

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

A Safety Report on GPT-5.2, Gemini 3 Pro, Qwen3-VL, Grok 4.1 Fast, Nano Banana Pro, and Seedream 4.5

Most viewed papers (last 30 days)