high performance computing on graphics processing units: hgpu.org

Posts

Jun, 6

Relativistic Hydrodynamics on Graphic Cards

We show how to accelerate relativistic hydrodynamics simulations using graphic cards (graphic processing units, GPUs). These improvements are of highest relevance e.g. to the field of high-energetic nucleus-nucleus collisions at RHIC and LHC where (ideal and dissipative) relativistic hydrodynamics is used to calculate the evolution of hot and dense QCD matter. The results reported here […]

OpenCL

Jun, 4

Platform 2012, a Many-Core Computing Accelerator for Embedded SoCs: Performance Evaluation of Visual Analytics Applications

P2012 is an area- and power-efficient many-core computing accelerator based on multiple globally asynchronous, locally synchronous processor clusters. Each cluster features up to 16 processors with independent instruction streams sharing a multi-banked one-cycle access L1 data memory, a multi-channel DMA engine and specialized hardware for synchronization and aggressive power management. P2012 is 3D stacking ready […]

OpenCL

Jun, 1

MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-Based Systems

Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remains a challenging problem. Data communication in popular parallel programming models, such as the Message Passing Interface (MPI), is currently limited to the data stored in the CPU memory space. Auxiliary memory systems, such as GPU memory, are not integrated into such data […]

CUDA

Jun, 1

Generating Device-specific GPU code for Local Operators in Medical Imaging

To cope with the complexity of programming GPU accelerators for medical imaging computations, we developed a framework to describe image processing kernels in a domainspecific language, which is embedded into C++. The description uses decoupled access/execute metadata, which allow the programmer to specify both execution constraints and memory access patterns of kernels. A source-to-source compiler […]

CUDA

•

OpenCL

May, 24

Compiler optimizations for directive-based programming for accelerators

Parallel programming is difficult. For regular computation on central processing units application programming interfaces such as OpenMP, which augment normal sequential programs with preprocessor directives to achieve parallelism, have proven to be easy for programmers and they provide good multithreaded performance. OpenACC is a fork of the OpenMP project, which aims to provide a similar […]

CUDA

•

OpenCL

May, 24

Fine-Grained Resource Sharing for Concurrent GPGPU Kernels

General purpose GPU (GPGPU) programming frameworks such as OpenCL and CUDA allow running individual computation kernels sequentially on a device. However, in some cases it is possible to utilize device resources more efficiently by running kernels concurrently. This raises questions about load balancing and resource allocation that have not previously warranted investigation. For example, what […]

OpenCL

May, 19

C-DAC’s Efforts – Application Kernels on HPC Cluster with GPU Accelerators

We describe the problem of parallelization of finite difference method (FDM) and finite element method (FEM) computations for certain class of partial differential equations (PDEs) on High Performance Computing (HPC) GPU cluster. For FDM, the structured grids have been employed and optimal data rearrangement operations are performed in GPU computations. For FEM, unstructured triangular and […]

CUDA

•

OpenCL

May, 12

Characterization and Transformation of Unstructured Control Flow in Bulk Synchronous GPU Applications

In this paper we identify important classes of program control flows in applications targeted to commercially available graphics processing units (GPUs) and characterize their presence in real workloads such as those that occur in CUDA and OpenCL. Broadly, control flow can be characterized as structured or unstructured. It is shown that most existing techniques for […]

CUDA

•

OpenCL

May, 9

Enabling task-level scheduling on heterogeneous platforms

OpenCL is an industry standard for parallel programming on heterogeneous devices. With OpenCL, compute-intensive portions of an application can be offloaded to a variety of processing units within a system. OpenCL is the first standard that focuses on portability, allowing programs to be written once and run seamlessly on multiple, heterogeneous devices, regardless of vendor. […]

OpenCL

May, 4

Generalizing the Utility of GPUs in Large-Scale Heterogeneous Computing Systems

Graphics processing units (GPUs) have been widely used as accelerators in large-scale heterogeneous computing systems. However, current programming models can only support the utilization of local GPUs. When using non-local GPUs, programmers need to explicitly call API functions for data communication across computing nodes. As such, programming GPUs in large-scale computing systems is more challenging […]

OpenCL

May, 2

GPU Acceleration for the C++ Standard Template Library

Modern programmers must exploit parallelism for performance gains, possibly through the use of an attached or on-chip GPU. To take advantage of the GPU in C++ programs, the programmer must use either a new language (CUDA or OpenCL) or an external library (Thrust). Rather than requiring that programmers learn new tools, modify existing code, and […]

CUDA

•

OpenCL

Apr, 25

Radio Astronomy Beam Forming on Many-Core Architectures

Traditional radio telescopes use large steel dishes to observe radio sources. The largest radio telescope in the world, LOFAR, uses tens of thousands of fixed, omnidirectional antennas instead, a novel design that promises ground-breaking research in astronomy. Where traditional telescopes use custom-built hardware, LOFAR uses software to do signal processing in real time. This leads […]

CUDA

•

OpenCL

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Relativistic Hydrodynamics on Graphic Cards

Platform 2012, a Many-Core Computing Accelerator for Embedded SoCs: Performance Evaluation of Visual Analytics Applications

MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-Based Systems

Generating Device-specific GPU code for Local Operators in Medical Imaging

Compiler optimizations for directive-based programming for accelerators

Fine-Grained Resource Sharing for Concurrent GPGPU Kernels

C-DAC’s Efforts – Application Kernels on HPC Cluster with GPU Accelerators

Characterization and Transformation of Unstructured Control Flow in Bulk Synchronous GPU Applications

Enabling task-level scheduling on heterogeneous platforms

Generalizing the Utility of GPUs in Large-Scale Heterogeneous Computing Systems

GPU Acceleration for the C++ Standard Template Library

Radio Astronomy Beam Forming on Many-Core Architectures

Recent source codes

Code examples for paper on SYCL backend of Kokkos - IWOCL 2024

ROCm's implementation of Gromacs

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Most viewed papers (last 30 days)