high performance computing on graphics processing units: hgpu.org

Posts

Oct, 11

Interactive Simulations with Navier-Stokes Equations on many-core Architectures

Navier-Stokes Equations are a mathematical model to describe the behaviour of fluids. They have proven to represent real fluid flows quite well and are base for many fluid simulations. In order to exploit the performance provided by modern many-core systems, fluid simulation algorithms must be able to efficiently solve the Navier-Stokes Equations in parallel. The […]

OpenCL

•

OpenGL

Oct, 11

Monte Carlo Path Tracing with OpenCL

We introduce an interactive Monte Carlo path tracer that uses the OpenCL framework. A path tracer draws a photo-realistic image of a 3D scene by simulating physical effects of light. Interactivity enables the user to move around the scene in real time, while OpenCL makes it possible to run the same code on either CPU […]

OpenCL

Oct, 11

Performance Improvement of Multichannel Audio by Graphics Processing Units

Multichannel acoustic signal processing has undergone major development in recent years due to the increased complexity of current audio processing applications. People want to collaborate through communication with the feeling of being together and sharing the same environment, what is considered as Immersive Audio Schemes. In this phenomenon, several acoustic effects are involved: 3D spatial […]

CUDA

•

OpenCL

Oct, 11

Leo: A Profile-Driven Dynamic Optimization Framework for GPU Applications

Parallel architectures like GPUs are a tantalizing compute fabric for performance-hungry developers. While GPUs enable order-of-magnitude performance increases in many data-parallel application domains, writing efficient codes that can actually manifest those increases is a non-trivial endeavor, typically requiring developers to exercise specialized architectural features exposed directly in the programming model. Achieving good performance on GPUs […]

CUDA

Oct, 10

Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM

Real-time dense computer vision and SLAM offer great potential for a new level of scene modelling, tracking and real environmental interaction for many types of robot, but their high computational requirements mean that use on mass market embedded platforms is challenging. Meanwhile, trends in low-cost, low-power processing are towards massive parallelism and heterogeneity, making it […]

CUDA

•

OpenCL

Oct, 10

Code Refinement of Stencil Codes

A straightforward implementation of an algorithm in a general-purpose programming language does usually not deliver peak performance: Compilers often fail to automatically tune the code for certain hardware peculiarities like memory hierarchy or vector execution units. Manually tuning the code is firstly error-prone as well as time-consuming and secondly taints the code by exposing those […]

CUDA

Oct, 10

Parallel implementation of linear repetitive processes identification using subspace algorithms

This paper presents a new parallel approach to identification of linear repetitive processes based on subspace algorithms. Parallel realizations of these algorithms are tested on various graphic cards that use NVIDIA CUDA technology. The paper describes implementation of subspace identification algorithms and their parallel speedup, efficiency, throughput, and delay. The parallel approach to the identification […]

CUDA

Oct, 10

Accelerating Protein Coordinate Conversion using GPUs

For modeling proteins in conformational states, two methods of representation are used: internal coordinates and Cartesian coordinates. Each of these representations contain a large amount of structural and simulation information. Different processing steps require one or the other representation. Our goal is to rapidly translate between these coordinate spaces so that a scientist can choose […]

CUDA

Oct, 10

FDTD on Distributed Heterogeneous Multi-GPU Systems

Finite-Difference Time-Domain (FDTD) is a popular technique for modeling computational electrodynamics, and is used within many research areas, such as the development of antennas, ultrasound imaging, and seismic wave propagation. Simulating large domains can however be very compute and memory demanding, which has motivated the use of cluster computing, and lately also the use of […]

CUDA

Oct, 8

cuDNN: Efficient Primitives for Deep Learning

We present a library that provides optimized implementations for deep learning primitives. Deep learning workloads are computationally intensive, and optimizing the kernels of deep learning workloads is difficult and time-consuming. As parallel architectures evolve, kernels must be reoptimized for new processors, which makes maintaining codebases difficult over time. Similar issues have long been addressed in […]

CUDA

Oct, 8

Movement Tracking in Terrain Conditions Accelerated with CUDA

The paper presents a solution to the problem of movement tracking in images acquired from video cameras monitoring outside terrain. The solution is resistant to such adverse factors as: leaves fluttering, grass waving, smoke or fog, movement of clouds etc. The presented solution is based on well known image processing methods, nevertheless the key was […]

CUDA

Oct, 8

KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators

KBLAS is a new open source high performance library that provides optimized kernels for a subset of Level 2 BLAS functionalities on CUDA-enabled GPUs. Since performance of dense matrix-vector multiplication is hindered by the overhead of memory accesses, a double-buffering optimization technique is employed to overlap data motion with computation. After identifying a proper set […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Interactive Simulations with Navier-Stokes Equations on many-core Architectures

Monte Carlo Path Tracing with OpenCL

Performance Improvement of Multichannel Audio by Graphics Processing Units

Leo: A Profile-Driven Dynamic Optimization Framework for GPU Applications

Introducing SLAMBench, a performance and accuracy benchmarking methodology for SLAM

Code Refinement of Stencil Codes

Parallel implementation of linear repetitive processes identification using subspace algorithms

Accelerating Protein Coordinate Conversion using GPUs

FDTD on Distributed Heterogeneous Multi-GPU Systems

cuDNN: Efficient Primitives for Deep Learning

Movement Tracking in Terrain Conditions Accelerated with CUDA

KBLAS: An Optimized Library for Dense Matrix-Vector Multiplication on GPU Accelerators

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)