high performance computing on graphics processing units: hgpu.org

Posts

May, 16

C++ on GPUs Using OpenACC and the PGI Accelerator Compilers, webinar

The fastest supercomputers and clusters use a 64-bit host processor with one or more accelerators per node, most commonly GPUs. These compute accelerators exploit a high degree of parallelism to maximize performance and power efficiency. There are several challenges to effective and productive use of accelerators, the most important of which are managing data movement […]

May, 16

Using GPUs to Accelerate Orthorectification, Atmospheric Correction, and Transformations for Big Data, webinar

Significant improvements in speeds for imagery orthorectification, atmospheric correction, and image transformations like Independent Components Analysis (ICA) have been achieved using GPU-based implementations. Additional optimizations, when factored in with GPU processing capabilities, can provide 50x – 100x reduction in the time required to process large imagery. Exelis Visual Information Solutions (VIS) has implemented a CUDA-based […]

May, 16

Scaling Coupled Climate Models to Exascale: OpenACC-enabled ECEarth3 Earth System Model

Climate change due to increasing anthropogenic greenhouse gases and land surface change is currently one of the most relevant environmental concerns. It threatens ecosystems and human societies. However, its impact on the economy and our living standards depends largely on our ability to anticipate its effects and take appropriate action. Earth System Models (ESMs), such […]

CUDA

May, 16

Porting NAHUJ to CUDA

This white-paper reports on an enabling effort that involves porting a legacy 2D fluid dynamics Fortran code to NVIDIA GPUs. Given the complexity of both code and underlying (custom) numerical method, the natural choice was to use NVIDIA CUDA C to achieve the best possible performance. We achieved over 4.5x speed-up on a single K20 […]

CUDA

May, 16

Enabling CP2K Application for Exascale Computing with Accelerators using OpenACC and OpenCL

CP2K is an application for atomistic and molecular simulation and, with its excellent scalability, is particularly important with regards to use on future exascale systems. The code is well parallelized using MPI and hybrid MPI/OpenMP, typically scaling well to ~1 core per atom in the system. The research on CP2K done within PRACE-1IP stated that […]

CUDA

•

OpenCL

May, 16

Hybrid Use of OmpSs for a Shock Hydrodynamics Proxy Application

The LULESH proxy application models the behavior of the ALE3D multi-physics code with an explicit shock hydrodynamics problem, and is made in order to evaluate interactions between programming models and architectures, using a representative code significantly less complex than the application it models. As identified in the PRACE deliverable D7.2.1 [1], the OmpSs programming model […]

CUDA

May, 16

A Straightforward Preprocessing Approach for Accelerating Convex Hull Computations on the GPU

An effective strategy for accelerating the calculation of convex hulls for point sets is to filter the input points by discarding interior points. In this paper, we present such a straightforward and efficient preprocessing approach by exploiting the GPU. The basic idea behind our approach is to discard the points that locate inside a convex […]

CUDA

May, 15

Multi-GPGPU Cellular Automata Simulations using OpenACC

The Frisch-Hasslacher-Pomeau (FHP) model is a lattice gas cellular automaton designed to simulate fluid flows using the exact, purely Boolean arithmetic, without any round-off error. Here we investigate the problem of its efficient porting to clusters of Fermi-class graphic processing units. To this end two multi-GPU implementations were developed and examined: one using the NVIDIA […]

CUDA

May, 15

Real-time Image Processing on Low Cost Embedded Computers

In 2012 a federal mandate was imposed that required the FAA to integrate unmanned aerial systems (UAS) into the national airspace (NAS) by 2015 for civilian and commercial use. A significant driver for the increasing popularity of these systems is the rise in open hardware and open software solutions which allow hobbyists to build small […]

May, 15

Parallelization of Shape Diameter Function Computation using OpenCL

Shape Diameter Function (SDF) is a scalar function that expresses a measure of the diameter of the object’s volume in the neighborhood of each point on the surface on an input mesh. It is fundamental in many applications in computer graphics used for consistent mesh partitioning and skeletonization. The algorithm sends several rays inside a […]

OpenCL

May, 15

Performance Optimization of GPU ELF-Codes

GPUs (Graphic Processing Units) are of interest for their favorable ratio GF/s/price. Compared to the beginning – early 1980’s – nowadays GPU architectures are more similar to general purpose architectures but with (much) larger numbers of cores – the GF100 architecture released by NVIDIA in 2009-2010, for example, has a true hardware cache hierarchy, a […]

CUDA

May, 15

Optimized Composition: Generating Efficient Code for Heterogeneous Systems from Multi-Variant Components, Skeletons and Containers

In this survey paper, we review recent work on frameworks for the high-level, portable programming of heterogeneous multi-/manycore systems (especially, GPU-based systems) using high-level constructs such as annotated user-level software components, skeletons (i.e., predefined generic components) and containers, and discuss the optimization problems that need to be considered in selecting among multiple implementation variants, generating […]

CUDA

•

OpenCL

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

GigaAPI for GPU Parallelization

high performance computing on graphics processing units: hgpu.org

Posts

C++ on GPUs Using OpenACC and the PGI Accelerator Compilers, webinar

Using GPUs to Accelerate Orthorectification, Atmospheric Correction, and Transformations for Big Data, webinar

Scaling Coupled Climate Models to Exascale: OpenACC-enabled ECEarth3 Earth System Model

Porting NAHUJ to CUDA

Enabling CP2K Application for Exascale Computing with Accelerators using OpenACC and OpenCL

Hybrid Use of OmpSs for a Shock Hydrodynamics Proxy Application

A Straightforward Preprocessing Approach for Accelerating Convex Hull Computations on the GPU

Multi-GPGPU Cellular Automata Simulations using OpenACC

Real-time Image Processing on Low Cost Embedded Computers

Parallelization of Shape Diameter Function Computation using OpenCL

Performance Optimization of GPU ELF-Codes

Optimized Composition: Generating Efficient Code for Heterogeneous Systems from Multi-Variant Components, Skeletons and Containers

Recent source codes

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Data-efficient LLM Fine-tuning for Code Generation

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI: a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential

Coccinelle: a C code transformation engine using SmPL for matches, refactorings, and bug fixing

DuoReduce: MLIR's benchmark

Shamrock: Multi-GPU hydrodynamics for astrophysics

LLMPerf: GPU Performance Modeling meets Large Language Models

Most viewed papers (last 30 days)