high performance computing on graphics processing units: hgpu.org

Posts

Aug, 16

GPU-Accelerated Scalable Solver for Banded Linear Systems

Solving a banded linear system efficiently is important to many scientific and engineering applications. Current solvers achieve good scalability only on the linear systems that can be partitioned into independent subsystems. In this paper, we present a GPU based, scalable Bi-Conjugate Gradient Stabilized solver that can be used to solve a wide range of banded […]

CUDA

Aug, 16

Lossless LZW Data Compression Algorithm on CUDA

Data compression is an important area of information and communication technologies it seeks to reduce the number of bits used to store or transmit information. It will efficiently utilizes the memory spaces and allows to transmit data within a limited bandwidth. Most compression process is achieved by removing data redundancy while preserving information content. Data […]

CUDA

Aug, 16

Towards Path Tracing in Games

We investigate GPU path tracing performance in the context of real-time rendering for games. We propose a reformulation of Russian roulette, as well as an efficient implementation of the path regeneration algorithm by Novak et al. [Novak et al. 2010]. We show that a combination of these algorithms provides high performance for a variety of […]

CUDA

Aug, 16

GUESS-ing Polygenic Associations with Multiple Phenotypes Using a GPU-Based Evolutionary Stochastic Search Algorithm

Genome-wide association studies (GWAS) yielded significant advances in defining the genetic architecture of complex traits and disease. Still, a major hurdle of GWAS is narrowing down multiple genetic associations to a few causal variants for functional studies. This becomes critical in multi-phenotype GWAS where detection and interpretability of complex SNP(s)-trait(s) associations are complicated by complex […]

CUDA

Aug, 15

Parallel Gravitation Field Algorithm Based on the CUDA Platform

Gravitation Field Algorithm (GFA) is a simple but very effective heuristic search algorithm. This algorithm has obvious advantages in multimodal function optimization problems compared with SA and GA. However, when we want to get a more precise global optimal value, it needs a lot of initial dusts involved in computing, which causes a low efficiency […]

CUDA

Aug, 15

General Transformations for GPU Execution of Tree Traversals

With the advent of programmer-friendly GPU computing environments, there has been much interest in offloading workloads that can exploit the high degree of parallelism available on modern GPUs. Exploiting this parallelism and optimizing for the GPU memory hierarchy is well-understood for regular applications that operate on dense data structures such as arrays and matrices. However, […]

CUDA

Aug, 15

Programming Dense Linear Algebra Kernels on Vectorized Architectures

The high performance computing (HPC) community is obsessed over the general matrix-matrix multiply (GEMM) routine. This obsession is not without reason. Most, if not all, Level 3 Basic Linear Algebra Subroutines (BLAS) can be written in terms of GEMM, and many of the higher level linear algebra solvers’ (i.e., LU, Cholesky) performance depend on GEMM’s […]

Aug, 15

First experiences with the Intel MIC architecture at LRZ

With the rapidly growing demand for computing power new accelerator based architectures have entered the world of high performance computing since around 5 years. In particular GPGPUs have recently become very popular, however programming GPGPUs using programming languages like CUDA or OpenCL is cumbersome and error-prone. Trying to overcome these difficulties, Intel developed their own […]

Aug, 15

Detecting Data Races on OpenCL Kernels with Symbolic Execution

We present an automatic analysis technique for checking data races on OpenCL kernels. Our method defines symbolic execution techniques based on separation logic with suitable abstractions to automatically detect non-benign racy behaviours on kernels.

OpenCL

Aug, 14

Lattice Boltzmann Method for Simulating Turbulent Flows

The lattice Boltzmann method (LBM) is a relatively new method for fluid flow simulations, and is recently gaining popularity due to its simple algorithm and parallel scalability. Although the method has been successfully applied to a wide range of flow physics, its capabilities in simulating turbulent flow is still under-validated. Hence, in this project, a […]

CUDA

Aug, 14

The Yin and Yang of Processing Data Warehousing Queries on GPU Devices

Database community has made significant research efforts to optimize query processing on GPUs in the past few years. However, we can hardly find that GPUs have been truly adopted in major warehousing production systems. Preparing to merge GPUs to the warehousing systems, we have identified and addressed several critical issues in a three-dimensional study of […]

CUDA

•

OpenCL

Aug, 14

GPU Acceleration of a Basket Option Pricing Engine

One of the most important methods for pricing complex derivatives is Monte Carlo simulation. However, this method requires a large amount of computing resources for accurate estimates. Since Monte Carlo simulations used in derivatives pricing are often parallelisable, one way to reduce the computing time is to use GPUs, which allow many copies of the […]

CUDA

•

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

GPU-Accelerated Scalable Solver for Banded Linear Systems

Lossless LZW Data Compression Algorithm on CUDA

Towards Path Tracing in Games

GUESS-ing Polygenic Associations with Multiple Phenotypes Using a GPU-Based Evolutionary Stochastic Search Algorithm

Parallel Gravitation Field Algorithm Based on the CUDA Platform

General Transformations for GPU Execution of Tree Traversals

Programming Dense Linear Algebra Kernels on Vectorized Architectures

First experiences with the Intel MIC architecture at LRZ

Detecting Data Races on OpenCL Kernels with Symbolic Execution

Lattice Boltzmann Method for Simulating Turbulent Flows

The Yin and Yang of Processing Data Warehousing Queries on GPU Devices

GPU Acceleration of a Basket Option Pricing Engine

Recent source codes

DITRON: Distributed Compiler based on Triton for Parallel Systems

IntelliKit: Agent-first tooling for AMD hardware

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)