high performance computing on graphics processing units: hgpu.org

Posts

Mar, 16

Implementing the Himeno benchmark with CUDA on GPU clusters

This paper describes the use of CUDA to accelerate the Himeno benchmark on clusters with GPUs. The implementation is designed to optimize memory bandwidth utilization. Our approach achieves over 83% of the theoretical peak bandwidth on a NVIDIA Tesla C1060 GPU and performs at over 50 GFlops. A multi-GPU implementation that utilizes MPI alongside CUDA […]

CUDA

Mar, 16

Rotationally invariant sparse patch matching on GPU and FPGA

Vector and data-flow processors are particularly strong at dense, regular computation. Sparse, irregular data layouts cause problems because their unpredictable data access patterns prevent computational pipelines from filling effectively. A number of algorithms in image processing have been proposed which are not dense, and instead apply local neighborhood operations to a sparse, irregular set of […]

OpenGL

Mar, 16

GPU-Based Techniques For Global Illumination Effects

This book presents techniques to render photo-realistic images by programming the Graphics Processing Unit (GPU). We discuss effects such as mirror reflections, refractions, caustics, diffuse or glossy indirect illumination, radiosity, single or multiple scattering in participating media, tone reproduction, glow, and depth of field. The book targets game developers, graphics programmers, and also students with […]

OpenGL

Mar, 16

GPU based acceleration of first principles calculation

We present a Graphics Processing Unit (GPU) accelerated simulations of first principles electronic structure calculations. The FFT, which is the most time-consuming part, is about 10 times accelerated. As the result, the total computation time of a first principles calculation is reduced to 15 percent of that of the CPU.

CUDA

Mar, 16

GPU-Based Volume Rendering for Medical Imagery

During the quick advancements of medical image visualization and augmented virtual reality application, the low performance of the volume rendering algorithm is still a “bottle neck”. To facilitate the usage of well developed hardware resource, a novel graphics processing unit (GPU)-based volume ray-casting algorithm is proposed in this paper. Running on a normal PC, it […]

OpenGL

Mar, 16

GPU-based frequency domain volume rendering

Frequency domain volume rendering (FVR) is a volume rendering technique with lower computational complexity as compared to other techniques. In this paper the FVR algorithm is accelerated by factor of 17 by mapping the rendering stage to the GPU. The overall hardware-accelerated pipeline is discussed and the changes according to previous work are pointed out. […]

OpenGL

Mar, 16

Patterns of Inefficient Performance Behavior in GPU Applications

Writing efficient software for heterogeneous architectures equipped with modern accelerator devices presents a serious challenge to programmer productivity, creating a need for powerful performance-analysis tools to adequately support the software development process. To guide the design of such tools, we describe typical patterns of inefficient runtime behavior that may adversely affect the performance of applications […]

CUDA

Mar, 15

Efficient Shadows for GPU-based Volume Raycasting

GPU-based raycasting has emerged as the defacto standard for interactive volume rendering on off-the-shelf graphics hardware. Even though in theory this technique can be easily extended by shadow feelers in order to support shadows, this obvious approach has a major impact on the rendering performance. In this paper we will investigate shadowing extensions for GPU- […]

Mar, 15

Interactive GPU-based Collision Detection

F two closed polygonal objects with outfacing normals intersect each other there exist one or more lines that intersect these objects at at least two consecutive front or back facing object points. In this work we present a method to efficiently detect these lines using depth-peeling and simple fragment operations. Of all polygons only those […]

Mar, 15

GPUGI: Global Illumination Effects on the GPU

In this tutorial we explain how global illumination rendering methods can be implemented on Shader Model 3.0 GPUs. These algorithms do not follow the conventional local illumination model of DirectX/OpenGL pipelines, but require global geometric or illumination information when shading a point. In addition to the theory and state of the art of these approaches, […]

OpenGL

Mar, 15

Compensated Visual Hull with GPU-Based Optimization

We propose an advanced visual hull technique to compensate for outliers using the reliabilities of silhouettes. The proposed method consists of a foreground extraction technique with multiple thresholds based on the Generalized Gaussian Family model and a compensated visual hull algorithm. We proved that the proposed technique constructs a compact visual hull even in the […]

OpenGL

Mar, 15

A framework for efficient and scalable execution of domain-specific templates on GPUs

Graphics processing units (GPUs) have emerged as important players in the transition of the computing industry from sequential to multi- and many-core computing. We propose a software framework for execution of domain-specific parallel templates on GPUs, which simultaneously raises the abstraction level of GPU programming and ensures efficient execution with forward scalability to large data […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Implementing the Himeno benchmark with CUDA on GPU clusters

Rotationally invariant sparse patch matching on GPU and FPGA

GPU-Based Techniques For Global Illumination Effects

GPU based acceleration of first principles calculation

GPU-Based Volume Rendering for Medical Imagery

GPU-based frequency domain volume rendering

Patterns of Inefficient Performance Behavior in GPU Applications

Efficient Shadows for GPU-based Volume Raycasting

Interactive GPU-based Collision Detection

GPUGI: Global Illumination Effects on the GPU

Compensated Visual Hull with GPU-Based Optimization

A framework for efficient and scalable execution of domain-specific templates on GPUs

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)