high performance computing on graphics processing units: hgpu.org

Posts

Dec, 20

Towards global composition of performance-aware components for GPU-based systems

An important program optimization especially for heterogeneous parallel systems is performance-aware implementation selection which is (static or dynamic) selection between multiple implementation variants for the same computation, depending on the current execution context (such as currently available resources or performance affecting parameter values). Doing it for multiple component calls inside a program while considering interferences […]

CUDA

Dec, 19

Optimizing GPU to GPU Communication on Cray XK7

When developing an application for Cray XK7 systems, optimization of compute kernels is only a small part of maximizing scaling and performance. Programmers must consider the effect of the GPU’s distinct address space and the PCIe bus on application scalability. Without such considerations applications rapidly become limited by transfers to and from the GPU and […]

CUDA

Dec, 19

Experiences Porting a Molecular Dynamics Code to GPUs on a Cray XK7

GPU computing has rapidly gained popularity as a way to achieve higher performance of many scientific applications. In this paper we report on the experience of porting a hybrid MPI+OpenMP molecular dynamics code to a GPU enabled CrayXK7 to make a hybrid MPI+GPU code. The target machine, Indiana University’s Big Red II, consists of a […]

CUDA

Dec, 19

A Data-Driven Model for Anisotropic Heterogeneous Subsurface Scattering

We present a new BSSRDF representation for editing measured anisotropic heterogeneous translucent materials, such as veined marble, jade, artificial stones with lighting-blocking discontinuities. Our work is inspired by the SubEdit representation introduced in [1]. Our main contribution is to improve the accuracy of the approximation while keeping it compact and efficient for editing.We decompose the […]

CUDA

Dec, 19

A Two-stage Query by Singing/Humming System on GPU

This paper proposes the use of GPU (graphic processing unit) to implementing a two-stage comparison method for a QBSH (query by singing/humming) system. The system can take a user’s singing or humming and retrieve the top-10 most likely candidates from a database of 8431 songs. In order to speed up the comparison, we apply linear […]

CUDA

Dec, 19

Heterogeneous Programming with Single Operation Multiple Data

Heterogeneity is omnipresent in today’s commodity computational systems, which comprise at least one multi-core Central Processing Unit (CPU) and one Graphics Processing Unit (GPU). Nonetheless, all this computing power is not being exploited in mainstream computing, as the programming of these systems entails many details of the underlying architecture and of its distinct execution models. […]

CUDA

Dec, 18

Tesla vs. Xeon Phi vs. Radeon A Compiler Writer’s Perspective

Today, most CPU+Accelerator systems incorporate NVIDIA GPUs. Intel Xeon Phi and the continued evolution of AMD Radeon GPUs make it likely we will soon see, and want to program, a wider variety of CPU+Accelerator systems. PGI already supports NVIDIA GPUs, and is working to add support for Xeon Phi and AMD Radeon. Here we explore […]

OpenCL

Dec, 18

Fast Image Alignment with Fourier Moment Matching on GPU

In this paper, we develop a fast and accurate image alignment system which can be applied to image sequences in real time. The proposed image alignment system consists of two main components: the development of Fourier moment matching system and the implementation of the system in GPU. The Fourier moment matching is to efficiently find […]

CUDA

Dec, 18

Efficient Multi-GPU Computation of All-Pairs Shortest Paths

We describe a new algorithm for solving the all-pairs shortest-path (APSP) problem for planar graphs and graphs with small separators that exploits the massive on-chip parallelism available in today’s Graphics Processing Units (GPUs). Our algorithm, based on the Floyd-Warshall algorithm, has near optimal complexity in terms of the total number of operations, while its matrix-based […]

CUDA

Dec, 18

A comparative analysis of the performance and deployment overhead of parallelized Finite Difference Time Domain (FDTD) algorithms on a selection of high performance multiprocessor computing systems

The parallel FDTD method as used in computational electromagnetics is implemented on a variety of different high performance computing platforms. These parallel FDTD implementations have regularly been compared in terms of performance or purchase cost, but very little systematic consideration has been given to how much effort has been used to create the parallel FDTD […]

CUDA

Dec, 18

GPU Accelerated Semiclassical Initial Value Representation Molecular Dynamics

This paper presents a graphics processing units (GPUs) implementation of the semiclassical initial value representation (SC-IVR) propagator for vibrational molecular spectroscopy calculations. The time-averaging formulation of the SC-IVR for power spectrum calculations is employed. Details about the CUDA implementation of the semiclassical code are provided. 4 molecules with an increasing number of atoms are considered […]

CUDA

Dec, 17

Data Structures for Task-based Priority Scheduling

Many task-parallel applications can benefit from attempting to execute tasks in a specific order, as for instance indicated by priorities associated with the tasks. We present three lock-free data structures for priority scheduling with different trade-offs on scalability and ordering guarantees. First we propose a basic extension to work-stealing that provides good scalability, but cannot […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Towards global composition of performance-aware components for GPU-based systems

Optimizing GPU to GPU Communication on Cray XK7

Experiences Porting a Molecular Dynamics Code to GPUs on a Cray XK7

A Data-Driven Model for Anisotropic Heterogeneous Subsurface Scattering

A Two-stage Query by Singing/Humming System on GPU

Heterogeneous Programming with Single Operation Multiple Data

Tesla vs. Xeon Phi vs. Radeon A Compiler Writer’s Perspective

Fast Image Alignment with Fourier Moment Matching on GPU

Efficient Multi-GPU Computation of All-Pairs Shortest Paths

A comparative analysis of the performance and deployment overhead of parallelized Finite Difference Time Domain (FDTD) algorithms on a selection of high performance multiprocessor computing systems

GPU Accelerated Semiclassical Initial Value Representation Molecular Dynamics

Data Structures for Task-based Priority Scheduling

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)