high performance computing on graphics processing units: hgpu.org

Posts

Jul, 22

Real-time 3D video synthesis from binocular capture system based on commodity graphic hardware

In this paper, a real-time 3D video synthesis method suitable for implementation on commodity graphic hardware is presented. The system consists of pre-calibrated binocular stereo cameras and an NVIDIA GeForce 8 Series graphic card. Recently, most research has focused on improving the quality of depth maps, which is usually time-consuming and unsuitable for real-time reconstruction. […]

Jul, 22

AES Encryption Implementation on CUDA GPU and Its Analysis

GPU has a good performance ratio and exhibits the capability for applications with high level of parallelism despite its inexpensive price. The support of integer and logical instructions on the latest generation of GPU makes us to implement cipher algorithms easier with the same instructions. However the decisions such as parallel processing granularity or memory […]

CUDA

Jul, 22

Large Scale Simulations of the Euler Equations on GPU Clusters

The paper investigates the scalability of a parallel Euler solver, using the Vijayasundaram method, on a GPU cluster with 32 Nvidia Geforce GTX 295 boards. The aim of this research is to enable large scale fluid dynamics simulations with up to one billion elements. We investigate communication protocols for the GPU cluster to compensate for […]

CUDA

Jul, 22

Mapping the Arnold web with a GPU-supercomputer

The Arnold diffusion constitutes a dynamical phenomenon which may occur in the phase space of a non-integrable Hamiltonian system whenever the number of the system degrees of freedom is $M geq 3$. The diffusion is mediated by a web-like structure of resonance channels, which penetrates the phase space and allows the system to explore the […]

CUDA

Jul, 22

Accelerating Radio Astronomy Cross-Correlation with Graphics Processing Units

We present a highly parallel implementation of the cross-correlation of time-series data using graphics processing units (GPUs), which is scalable to hundreds of independent inputs and suitable for the processing of signals from "Large-N" arrays of many radio antennas. The computational part of the algorithm, the X-engine, is implementated efficiently on Nvidia’s Fermi architecture, sustaining […]

CUDA

Jul, 22

A Comparison of xPU Platforms Exemplified with Ray Tracing Algorithms

Over the years, faster hardware – with higher clock rates – has been the usual way to improve computing times in computer graphics. Aside from highly costly parallel solutions only affordable by big industries – like the movie industry -, there was no alternative available to desktop users. Nevertheless, this scenario is dramatically changing with […]

CUDA

•

OpenCL

Jul, 22

A History-Based Performance Prediction Model with Profile Data Classification for Automatic Task Allocation in Heterogeneous Computing Systems

In this paper, we propose a runtime performance prediction model for automatic selection of accelerators to execute kernels in OpenCL. The proposed method is a history-based approach that uses profile data for performance prediction. The profile data are classified into some groups, from each of which its own performance model is derived. As the execution […]

OpenCL

Jul, 22

Hybrid OpenCL: Enhancing OpenCL for Distributed Processing

We have been developing Hybrid OpenCL, which enables the utilization of OpenCL devices by connecting them over the network. Hybrid OpenCL opens a gate to scale up OpenCL environments. By using Hybrid OpenCL, applications written in OpenCL can be easily ported to high performance cluster computers, thus, Hybrid OpenCL can provide more various distributed and […]

OpenCL

Jul, 22

A self-organization based optical flow estimator with GPU implementation (thesis)

This work describes a parallelizable optical flow field estimator based upon a modified batch version of the Self-Organizing Map (SOM). This estimator handles the ill-posedness in gradient-based motion estimation via a novel combination of regression and self-organization. The aperture problem is treated using an algebraic framework that partitions motion estimates obtained from regression into two […]

CUDA

Jul, 22

A self-organization based optical flow estimator with GPU implementation

CUDA

Jul, 22

Parallelizing the Cellular Potts Model on graphics processing units

The Cellular Potts Model (CPM) is a lattice based modeling technique used for simulating cellular structures in computational biology. The computational complexity of the model means that current serial implementations restrict the size of simulation to a level well below biological relevance. Parallelization on computing clusters enables scaling the size of the simulation but marginally […]

CUDA

Jul, 22

A refactoring tool to extract GPU kernels

Significant performance gains can be achieved by using hardware architectures that integrate GPUs with conventional CPUs to form a hybrid and highly parallel computational engine. However, programming these novel architectures is tedious and error prone, reducing their ease of acceptance in an even wider range of computationally intensive applications. In this paper we discuss a […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Real-time 3D video synthesis from binocular capture system based on commodity graphic hardware

AES Encryption Implementation on CUDA GPU and Its Analysis

Large Scale Simulations of the Euler Equations on GPU Clusters

Mapping the Arnold web with a GPU-supercomputer

Accelerating Radio Astronomy Cross-Correlation with Graphics Processing Units

A Comparison of xPU Platforms Exemplified with Ray Tracing Algorithms

A History-Based Performance Prediction Model with Profile Data Classification for Automatic Task Allocation in Heterogeneous Computing Systems

Hybrid OpenCL: Enhancing OpenCL for Distributed Processing

A self-organization based optical flow estimator with GPU implementation (thesis)

A self-organization based optical flow estimator with GPU implementation

Parallelizing the Cellular Potts Model on graphics processing units

A refactoring tool to extract GPU kernels

Recent source codes

DITRON: Distributed Compiler based on Triton for Parallel Systems

IntelliKit: Agent-first tooling for AMD hardware

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)