high performance computing on graphics processing units: hgpu.org

Posts

Oct, 11

Monte Carlo Path Tracing with OpenCL

We introduce an interactive Monte Carlo path tracer that uses the OpenCL framework. A path tracer draws a photo-realistic image of a 3D scene by simulating physical effects of light. Interactivity enables the user to move around the scene in real time, while OpenCL makes it possible to run the same code on either CPU […]

OpenCL

Oct, 11

Performance Improvement of Multichannel Audio by Graphics Processing Units

Multichannel acoustic signal processing has undergone major development in recent years due to the increased complexity of current audio processing applications. People want to collaborate through communication with the feeling of being together and sharing the same environment, what is considered as Immersive Audio Schemes. In this phenomenon, several acoustic effects are involved: 3D spatial […]

CUDA

•

OpenCL

Oct, 11

Leo: A Profile-Driven Dynamic Optimization Framework for GPU Applications

Parallel architectures like GPUs are a tantalizing compute fabric for performance-hungry developers. While GPUs enable order-of-magnitude performance increases in many data-parallel application domains, writing efficient codes that can actually manifest those increases is a non-trivial endeavor, typically requiring developers to exercise specialized architectural features exposed directly in the programming model. Achieving good performance on GPUs […]

CUDA

Oct, 10

Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM

Real-time dense computer vision and SLAM offer great potential for a new level of scene modelling, tracking and real environmental interaction for many types of robot, but their high computational requirements mean that use on mass market embedded platforms is challenging. Meanwhile, trends in low-cost, low-power processing are towards massive parallelism and heterogeneity, making it […]

CUDA

•

OpenCL

Oct, 10

Code Refinement of Stencil Codes

A straightforward implementation of an algorithm in a general-purpose programming language does usually not deliver peak performance: Compilers often fail to automatically tune the code for certain hardware peculiarities like memory hierarchy or vector execution units. Manually tuning the code is firstly error-prone as well as time-consuming and secondly taints the code by exposing those […]

CUDA

Oct, 10

Parallel implementation of linear repetitive processes identification using subspace algorithms

This paper presents a new parallel approach to identification of linear repetitive processes based on subspace algorithms. Parallel realizations of these algorithms are tested on various graphic cards that use NVIDIA CUDA technology. The paper describes implementation of subspace identification algorithms and their parallel speedup, efficiency, throughput, and delay. The parallel approach to the identification […]

CUDA

Oct, 10

Accelerating Protein Coordinate Conversion using GPUs

For modeling proteins in conformational states, two methods of representation are used: internal coordinates and Cartesian coordinates. Each of these representations contain a large amount of structural and simulation information. Different processing steps require one or the other representation. Our goal is to rapidly translate between these coordinate spaces so that a scientist can choose […]

CUDA

Oct, 10

FDTD on Distributed Heterogeneous Multi-GPU Systems

Finite-Difference Time-Domain (FDTD) is a popular technique for modeling computational electrodynamics, and is used within many research areas, such as the development of antennas, ultrasound imaging, and seismic wave propagation. Simulating large domains can however be very compute and memory demanding, which has motivated the use of cluster computing, and lately also the use of […]

CUDA

Oct, 8

cuDNN: Efficient Primitives for Deep Learning

We present a library that provides optimized implementations for deep learning primitives. Deep learning workloads are computationally intensive, and optimizing the kernels of deep learning workloads is difficult and time-consuming. As parallel architectures evolve, kernels must be reoptimized for new processors, which makes maintaining codebases difficult over time. Similar issues have long been addressed in […]

CUDA

Oct, 8

Movement Tracking in Terrain Conditions Accelerated with CUDA

The paper presents a solution to the problem of movement tracking in images acquired from video cameras monitoring outside terrain. The solution is resistant to such adverse factors as: leaves fluttering, grass waving, smoke or fog, movement of clouds etc. The presented solution is based on well known image processing methods, nevertheless the key was […]

CUDA

Oct, 8

KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators

KBLAS is a new open source high performance library that provides optimized kernels for a subset of Level 2 BLAS functionalities on CUDA-enabled GPUs. Since performance of dense matrix-vector multiplication is hindered by the overhead of memory accesses, a double-buffering optimization technique is employed to overlap data motion with computation. After identifying a proper set […]

CUDA

Oct, 8

A Framework for the Volumetric Integration of Depth Images

Volumetric models have become a popular representation for 3D scenes in recent years. One of the breakthroughs leading to their popularity was KinectFusion, where the focus is on 3D reconstruction using RGB-D sensors. However, monocular SLAM has since also been tackled with very similar approaches. Representing the reconstruction volumetrically as a truncated signed distance function […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Monte Carlo Path Tracing with OpenCL

Performance Improvement of Multichannel Audio by Graphics Processing Units

Leo: A Profile-Driven Dynamic Optimization Framework for GPU Applications

Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM

Code Refinement of Stencil Codes

Parallel implementation of linear repetitive processes identification using subspace algorithms

Accelerating Protein Coordinate Conversion using GPUs

FDTD on Distributed Heterogeneous Multi-GPU Systems

cuDNN: Efficient Primitives for Deep Learning

Movement Tracking in Terrain Conditions Accelerated with CUDA

KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators

A Framework for the Volumetric Integration of Depth Images

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)