high performance computing on graphics processing units: hgpu.org

Posts

Jun, 27

Lettuce: PyTorch-based Lattice Boltzmann Framework

The lattice Boltzmann method (LBM) is an efficient simulation technique for computational fluid mechanics and beyond. It is based on a simple stream-and-collide algorithm on Cartesian grids, which is easily compatible with modern machine learning architectures. While it is becoming increasingly clear that deep learning can provide a decisive stimulus for classical simulation techniques, recent […]

CUDA

Jun, 20

GPUAPI: Multi-level Chapel Runtime API for GPUs

Chapel is inherently well suited not only for homogeneous nodes but also heterogeneous nodes because they employ the concept of locales, distributed domains, forall/reduce constructs, and implicit communications. However, it is unfortunate that there is room for further improvements in supporting GPU in Chapel. This paper addresses some of the key limitations of past approaches […]

CUDA

•

OpenCL

Jun, 20

Study and evaluation of improved automatic GPU offloading method

With the slowing down of Moore’s law, the use of hardware other than CPUs, such as graphics processing units (GPUs) or field-Programmable gate arrays (FPGAs), is increasing. However, when using heterogeneous hardware other than CPUs, barriers to technical skills, such for compute unified device architecture (CUDA) and open computing language (OpenCL), are high. Therefore, I […]

CUDA

•

OpenCL

Jun, 20

Benchmarking the Nvidia GPU Lineage: From Early K80 to Modern A100 with Asynchronous Memory Transfers

For many, Graphics Processing Units (GPUs) provides a source of reliable computing power. Recently, Nvidia introduced its 9th generation HPC-grade GPUs, the Ampere 100, claiming significant performance improvements over previous generations, particularly for AI-workloads, as well as introducing new architectural features such as asynchronous data movement. But how well does the A100 perform on non-AI […]

Jun, 20

StreamBrain: An HPC Framework for Brain-like Neural Networks on CPUs, GPUs and FPGAs

The modern deep learning method based on backpropagation has surged in popularity and has been used in multiple domains and application areas. At the same time, there are other — less-known — machine learning algorithms with a mature and solid theoretical foundation whose performance remains unexplored. One such example is the brain-like Bayesian Confidence Propagation […]

CUDA

Jun, 20

Experience Report: Writing A Portable GPU Runtime with OpenMP 5.1

GPU runtimes are historically implemented in CUDA or other vendor specific languages dedicated to GPU programming. In this work we show that OpenMP 5.1, with minor compiler extensions, is capable of replacing existing solutions without a performance penalty. The result is a performant and portable GPU runtime that can be compiled with LLVM/Clang to Nvidia […]

CUDA

Jun, 6

DLIO: A Data-Centric Benchmark for Scientific Deep Learning Applications

Deep learning has been shown as a successful method for various tasks, and its popularity results in numerous open-source deep learning software tools. Deep learning has been applied to a broad spectrum of scientific domains such as cosmology, particle physics, computer vision, fusion, and astrophysics. Scientists have performed a great deal of work to optimize […]

Jun, 6

Data-Driven Analysis and Design of Vulkan Ray-Tracing Applications using Automatic Instrumentation

Modern graphics Application Programming Interfaces (APIs) provide first-class support for ray tracing. Hardware vendors implement drivers for the graphics API including a black-box compiler. The black-box compiler creates architecture-specific binaries that leverage ray-tracing hardware acceleration. Ray-tracing support in modern APIs allows all geometry and shaders to be specified for a single execution. Thus, ray tracing […]

Jun, 6

Optimization of Heterogeneous Systems with AI Planning Heuristics and Machine Learning: A Performance and Energy Aware Approach

Heterogeneous computing systems provide high performance and energy efficiency. However, to optimally utilize such systems, solutions that distribute the work across host CPUs and accelerating devices are needed. In this paper, we present a performance and energy aware approach that combines AI planning heuristics for parameter space exploration with a machine learning model for performance […]

Jun, 6

Exploiting co-execution with oneAPI: heterogeneity from a modern perspective

Programming efficiently heterogeneous systems is a major challenge, due to the complexity of their architectures. Intel oneAPI, a new and powerful standards-based unified programming model, built on top of SYCL, addresses these issues. In this paper, oneAPI is provided with co-execution strategies to run the same kernel between different devices, enabling the exploitation of static […]

Jun, 6

Early Experiences Migrating CUDA codes to oneAPI

The heterogeneous computing paradigm represents a real programming challenge due to the proliferation of devices with different hardware characteristics. Recently Intel introduced oneAPI, a new programming environment that allows code developed in DPC++ to be run on different devices such as CPUs, GPUs, FPGAs, among others. This paper presents our first experiences in porting two […]

CUDA

May, 30

Using Workload Characterization to Guide High Performance Graph Processing

Graph analytics represent an important application domain widely used in many fields such as web graphs, social networks, and Bayesian networks. The sheer size of the graph data sets combined with the irregular nature of the underlying problem pose a significant challenge for performance, scalability, and power efficiency of graph processing. With the exponential growth […]

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Lettuce: PyTorch-based Lattice Boltzmann Framework

GPUAPI: Multi-level Chapel Runtime API for GPUs

Study and evaluation of improved automatic GPU offloading method

Benchmarking the Nvidia GPU Lineage: From Early K80 to Modern A100 with Asynchronous Memory Transfers

StreamBrain: An HPC Framework for Brain-like Neural Networks on CPUs, GPUs and FPGAs

Experience Report: Writing A Portable GPU Runtime with OpenMP 5.1

DLIO: A Data-Centric Benchmark for Scientific Deep Learning Applications

Data-Driven Analysis and Design of Vulkan Ray-Tracing Applications using Automatic Instrumentation

Optimization of Heterogeneous Systems with AI Planning Heuristics and Machine Learning: A Performance and Energy Aware Approach

Exploiting co-execution with oneAPI: heterogeneity from a modern perspective

Early Experiences Migrating CUDA codes to oneAPI

Using Workload Characterization to Guide High Performance Graph Processing

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)