high performance computing on graphics processing units: hgpu.org

Posts

Feb, 13

Auto-Generation of Parallel Finite-Differencing Code for MPI, TBB and CUDA

Finite-difference methods can be useful for solving certain partial differential equations (PDEs) in the time domain. Compiler technologies can be used to parse an application domain specific representation of these PDEs and build an abstract representation of both the equation and the desired solver. This abstract representation can be used to generate a language-specific implementation. […]

CUDA

Feb, 13

Copperhead: Compiling an embedded data parallel language

Modern parallel microprocessors deliver high performance on applications that expose substantial fine-grained data parallelism. Although data parallelism is widely available in many computations, implementing data parallel algorithms in low-level languages is often an unnecessarily difficult task. The characteristics of parallel microprocessors and the limitations of current programming methodologies motivate our design of Copperhead, a high-level […]

CUDA

Feb, 12

Efficient Sparse Voxel Octrees – Analysis, Extensions, and Implementation

This technical report extends our previous paper on sparse voxel octrees. We first discuss the benefits and drawbacks of voxel representations and how the storage space requirements behave for different kinds of content. Then, we explain in detail our compact data structure for storing voxels and an efficient ray cast algorithm that utilizes this structure, […]

CUDA

Feb, 12

Efficient sparse voxel octrees

In this paper we examine the possibilities of using voxel representations as a generic way for expressing complex and feature-rich geometry on current and future GPUs. We present in detail a compact data structure for storing voxels and an efficient algorithm for performing ray casts using this structure. We augment the voxel data with novel […]

CUDA

Feb, 12

Increasing Memory Miss Tolerance for SIMD Cores

Manycore processors with wide SIMD cores are becoming a popular choice for the next generation of throughput oriented architectures. We introduce a hardware technique called “diverge on miss” that allows SIMD cores to better tolerate memory latency for workloads with non-contiguous memory access patterns. Individual threads within a SIMD “warp” are allowed to slip behind […]

Feb, 12

Image Space Gathering

Soft shadows, glossy reflections and depth of field are valuable effects for realistic rendering and are often computed using distribution ray tracing (DRT). These “blurry” effects often need not be accurate and are sometimes simulated by blurring an image with sharper effects, such as blurring hard shadows to simulate soft shadows. One of the most […]

OpenGL

Feb, 12

Spatial splits in bounding volume hierarchies

Bounding volume hierarchies (BVH) have become a widely used alternative to kD-trees as the acceleration structure of choice in modern ray tracing systems. However, BVHs adapt poorly to non-uniformly tessellated scenes, which leads to increased ray shooting costs. This paper presents a novel and practical BVH construction algorithm, which addresses the issue by utilizing spatial […]

CUDA

Feb, 12

Understanding the efficiency of ray traversal on GPUs

We discuss the mapping of elementary ray tracing operations—acceleration structure traversal and primitive intersection—onto wide SIMD/SIMT machines. Our focus is on NVIDIA GPUs, but some of the observations should be valid for other wide machines as well. While several fast GPU tracing methods have been published, very little is actually understood about their performance. Nobody […]

CUDA

Feb, 12

A meshless hierarchical representation for light transport

We introduce a meshless hierarchical representation for solving light transport problems. Precomputed radiance transfer (PRT) and finite elements require a discrete representation of illumination over the scene. Non-hierarchical approaches such as per-vertex values are simple to implement, but lead to long precomputation. Hierarchical bases like wavelets lead to dramatic acceleration, but in their basic form […]

Feb, 12

Loop Transformation Recipes for Code Generation and Auto-Tuning

In this paper, we describe transformation recipes, which provide a high-level interface to the code transformation and code generation capability of a compiler. These recipes can be generated by compiler decision algorithms or savvy software developers. This interface is part of an auto-tuning framework that explores a set of different implementations of the same computation […]

CUDA

Feb, 12

Automated Dynamic Analysis of CUDA Programs

Recent increases in the programmability and performance of GPUs have led to a surge of interest in utilizing them for general-purpose computations. Tools such as NVIDIA’s Cuda allow programmers to use a C-like language to code algorithms for execution on the GPU. Unfortunately, parallel programs are prone to subtle correctness and performance bugs, and Cuda […]

CUDA

Feb, 12

GPU-powered tools boost molecular visualization

Recent advances in experimental structure determination provide a wealth of structural data on huge macromolecular assemblies such as the ribosome or viral capsids, available in public databases. Further structural models arise from reconstructions using symmetry orders or fitting crystal structures into low-resolution maps obtained by electron-microscopy or small angle X-ray scattering experiments. Visual inspection of […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Auto-Generation of Parallel Finite-Differencing Code for MPI, TBB and CUDA

Copperhead: Compiling an embedded data parallel language

Efficient Sparse Voxel Octrees – Analysis, Extensions, and Implementation

Efficient sparse voxel octrees

Increasing Memory Miss Tolerance for SIMD Cores

Image Space Gathering

Spatial splits in bounding volume hierarchies

Understanding the efficiency of ray traversal on GPUs

A meshless hierarchical representation for light transport

Loop Transformation Recipes for Code Generation and Auto-Tuning

Automated Dynamic Analysis of CUDA Programs

GPU-powered tools boost molecular visualization

Recent source codes

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

Most viewed papers (last 30 days)