high performance computing on graphics processing units: hgpu.org

Posts

Feb, 20

Introducing ‘Bones’: A Parallelizing Source-to-Source Compiler Based on Algorithmic Skeletons

Recent advances in multi-core and many-core processors requires programmers to exploit an increasing amount of parallelism from their applications. Data parallel languages such as CUDA and OpenCL make it possible to take advantage of such processors, but still require a large amount of effort from programmers. A number of parallelizing source-to-source compilers have recently been […]

CUDA

•

OpenCL

Feb, 20

Review: Kd-tree Traversal Algorithms for Ray Tracing

In this paper we review the traversal algorithms for kd-trees for ray tracing. Ordinary traversal algorithms such as sequential, recursive, and those with neighbour-links have different limitations, which led to several new developments within the last decade. We describe algorithms exploiting ray coherence and algorithms designed with specific hardware architecture limitations such as memory latency […]

Feb, 18

GPU Parallel Statistical and Cube Test Analysis of the SHA-3 Finalist Candidate Hash Functions

The 256-bit versions of the SHA-3 finalist candidate hash functions – BLAKE, Grostl, JH, Keccak, and Skein – were subjected to statistical tests to attempt to disprove the hypothesis that the output bits are uniformly distributed, independent, binary random variables. The hash functions were also subjected to cube tests to attempt to disprove the hypothesis […]

CUDA

Feb, 18

Exploiting Segmentation for Robust 3D Object Matching

While Iterative Closest Point (ICP) algorithms have been successful at aligning 3D point clouds, they do not take into account constraints arising from sensor viewpoints. More recent beam-based models take into account sensor noise and viewpoint, but problems still remain. In particular, good optimization strategies are still lacking for the beam-based model. In situations of […]

CUDA

•

OpenGL

Feb, 18

Performance Portability with the Chapel Language

It has been widely shown that high-throughput computing architectures such as GPUs offer large performance gains compared with their traditional low-latency counterparts for many applications. The downside to these architectures is that the current programming models present numerous challenges to the programmer: lower-level languages, loss of portability across different architectures, explicit data movement, and challenges […]

CUDA

Feb, 18

Cone-beam Computed tomography image reconstruction based on GPU

As so long, three-dimensional cone-beam computed tomography(CBCT) image reconstruction is a hot issue in medical imaging field. Often the computation operation of CBCT reconstruction is huge and the reconstruction time is long. Now with the development of computer technology, especially the rapid development of Graphics Processing Unit (GPU) based general-purpose computing technology enables fast CBCT […]

CUDA

Feb, 17

Bayesian Image Restoration Using A Large-scale Total Patch Variation Prior

Edge-preserving Bayesian restorations using nonquadratic priors are often inefficient in restoring continuous variations and tend to produce block artifacts around edges in ill-posed inverse image restorations. To overcome this, we have proposed a spatial adaptive (SA) prior with improved performance. However, this SA prior restoration suffers from high computational cost and the unguaranteed convergence problem. […]

CUDA

Feb, 17

Proposition for propagated occupation grids for non-rigid moving objects tracking

Autonomous navigation among humans is, however simple it might seems, a difficult subject which draws a lot a attention in our days of increasingly autonomous systems. From a typical scene from a human environment, diverse shapes, behaviours, speeds or colours can be gathered by a lot of sensors and a generic mean to perceive space […]

CUDA

Feb, 17

Joint-MAP Tomographic Reconstruction with Patch Similarity Based Mixture Prior Model

Tomographic reconstruction from noisy projections do not yield adequate results. Mathematically, this tomographic reconstruction represents an ill-posed problem due to information missing caused by the presence of noise. Maximum a posteriori (MAP) or Bayesian reconstruction methods offer possibilities to improve the image quality as compared with analytical methods in particular by introducing a prior to […]

CUDA

Feb, 17

Skeleton-based Automatic Parallelization of Image Processing Algorithms for GPUs

Graphics Processing Units (GPUs) are becoming increasingly important in high performance computing. To maintain high quality solutions, programmers have to efficiently parallelize and map their algorithms. This task is far from trivial, leading to the necessity to automate this process. In this paper, we present a technique to automatically parallelize and map sequential code on […]

CUDA

Feb, 17

GPUs as Storage System Accelerators

Massively multicore processors, such as Graphics Processing Units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any order-of-magnitude drop in the cost per unit of performance for a class of system components, triggers the opportunity to redesign systems […]

CUDA

Feb, 17

Interactive Manycore Photon Mapping

Photon mapping is a state of the art global illumination rendering algorithm. Photons are traced from the light sources in a first pass and their interactions with scene surfaces stored. A second pass reconstructs illumination by density estimation, reproducing a wide range of optical phenomena. This thesis addresses the question how photon mapping can be […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Introducing ‘Bones’: A Parallelizing Source-to-Source Compiler Based on Algorithmic Skeletons

Review: Kd-tree Traversal Algorithms for Ray Tracing

GPU Parallel Statistical and Cube Test Analysis of the SHA-3 Finalist Candidate Hash Functions

Exploiting Segmentation for Robust 3D Object Matching

Performance Portability with the Chapel Language

Cone-beam Computed tomography image reconstruction based on GPU

Bayesian Image Restoration Using A Large-scale Total Patch Variation Prior

Proposition for propagated occupation grids for non-rigid moving objects tracking

Joint-MAP Tomographic Reconstruction with Patch Similarity Based Mixture Prior Model

Skeleton-based Automatic Parallelization of Image Processing Algorithms for GPUs

GPUs as Storage System Accelerators

Interactive Manycore Photon Mapping

Recent source codes

DITRON: Distributed Compiler based on Triton for Parallel Systems

IntelliKit: Agent-first tooling for AMD hardware

CuTile Benchmark Suite: Performance and Productivity Tradeoffs for GPU Kernel Programming on Blackwell Architecture

MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Device Virtual Machine (DVM)

Agentic Code Optimization via Compiler-LLM Cooperation

AutoKernel: Autoresearch for GPU kernels. Give it any PyTorch model, go to sleep, wake up to optimized Triton kernels

Triton-Sanitizer: A Fast and Device-Agnostic Memory Sanitizer for Triton with Rich Diagnostic Context

LLM.Q: Quantized LLM training in pure CUDA/C++

SOL-ExecBench: Speed-of-Light Benchmarking for Real-World GPU Kernels Against Hardware Limits

Most viewed papers (last 30 days)