high performance computing on graphics processing units: hgpu.org

Posts

Sep, 15

Code optimization based on source to source transformations using profile guided metrics

Modern high performance processor architectures tackle performance issues by heavily relying on increased vector lengths and advanced memory hierarchies to deliver high performance. Manual optimization is became a difficult task. Developers usually trust compilers to automatically address these performance issues, but they deploy static performance models and heuristics that force them to remain conservative. On […]

Sep, 15

Parallelizing Multiple Flow Accumulation Algorithm using CUDA and OpenACC

Watershed analysis, as a fundamental component of digital terrain analysis, is based on the Digital Elevation Model (DEM), which is a grid (raster) model of the Earth surface and topography. Watershed analysis consists of computationally and data intensive computing algorithms that need to be implemented by leveraging parallel and high-performance computing methods and techniques. In […]

CUDA

Sep, 15

Characterizing and Predicting Scientific Workloads for Heterogeneous Computing Systems

The next-generation of supercomputers will feature a diverse mix of accelerator devices. The increase in heterogeneity is explained by the nature of supercomputing workloads – certain devices offer acceleration, or a shorter time to completion, for particular application programs. Certain characteristics of these programs are fixed and impose fundamental limitations on the workloads regardless of […]

OpenCL

Sep, 15

PySPH: a Python-based framework for smoothed particle hydrodynamics

PySPH is a Python-based framework for particle methods in general and Smoothed Particle Hydrodynamics (SPH) in particular. PySPH allows a user to define a complete SPH simulation using pure Python. High-performance code is generated from this high-level Python code and executed on either multiple cores, or on GPUs, seamlessly. It also supports distributed execution using […]

CUDA

•

OpenCL

Sep, 15

Efficient Interleaved Batch Matrix Solvers for CUDA

In this paper we present a new methodology for data accesses when solving batches of Tridiagonal and Pentadiagonal matrices that all share the same LHS matrix. By only storing one copy of this matrix there is a significant reduction in storage overheads and the authors show that there is also a performance increase in terms […]

CUDA

Sep, 8

ArborX: A Performance Portable Search Library

Searching for geometric objects that are close in space is a fundamental component of many applications. The performance of search algorithms comes to the forefront as the size of a problem increases both in terms of total object count as well as in the total number of search queries performed. Scientific applications requiring modern leadership-class […]

CUDA

Sep, 8

Fast Code Exploration for Pipeline Processing in FPGA Accelerators

The increasing demand for energy efficient computing has endorsed the usage of Field-Programmable Gate Arrays to create hardware accelerators for large and complex codes. However, implementing such accelerators involve two complex decisions. The first one lies in deciding which code snippet is the best to create an accelerator, and the second one lies in how […]

Sep, 8

FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow

Most sequence-to-sequence (seq2seq) models are autoregressive; they generate each token by conditioning on previously generated tokens. In contrast, non-autoregressive seq2seq models generate all tokens in one pass, which leads to increased efficiency through parallel processing on hardware such as GPUs. However, directly modeling the joint distribution of all tokens simultaneously is challenging, and even with […]

Sep, 8

Compilers for Portable Programming of Heterogeneous Parallel & Approximate Computing Systems

Programming heterogeneous systems such as the System-on-chip (SoC) processors in modern mobile devices can be extremely complex because a single system may include multiple different parallelism models, instruction sets, memory hierarchies, and systems use different combinations of these features. This is further complicated by software and hardware approximate computing optimizations. Different compute units on an […]

OpenCL

Sep, 8

Neural Network Inference on Mobile SoCs

The ever-increasing demand from mobile Machine Learning (ML) applications calls for evermore powerful on-chip computing resources. Mobile devices are empowered with Heterogeneous Multi-Processor Systems on Chips (HMPSoCs) to process ML workloads such as Convolutional Neural Network (CNN) inference. HMPSoCs house several different types of ML capable components on-die, such as CPU, GPU, and accelerators. These […]

Sep, 1

Compositional Deep Learning in Futhark

We present a design pattern for composing deep learning networks in a typed, higher-order fashion. The exposed library functions are generically typed and the composition structure allows for networks to be trained (using backpropagation) and for trained networks to be used for predicting new results (using forward-propagation). Individual layers in a network can take different […]

CUDA

Sep, 1

Demystifying the MLPerf Benchmark Suite

MLPerf, an emerging machine learning benchmark suite strives to cover a broad range of applications of machine learning. We present a study on its characteristics and how the MLPerf benchmarks differ from some of the previous deep learning benchmarks like DAWNBench and DeepBench. We find that application benchmarks such as MLPerf (although rich in kernels) […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Code optimization based on source to source transformations using profile guided metrics

Parallelizing Multiple Flow Accumulation Algorithm using CUDA and OpenACC

Characterizing and Predicting Scientific Workloads for Heterogeneous Computing Systems

PySPH: a Python-based framework for smoothed particle hydrodynamics

Efficient Interleaved Batch Matrix Solvers for CUDA

ArborX: A Performance Portable Search Library

Fast Code Exploration for Pipeline Processing in FPGA Accelerators

FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow

Compilers for Portable Programming of Heterogeneous Parallel & Approximate Computing Systems

Neural Network Inference on Mobile SoCs

Compositional Deep Learning in Futhark

Demystifying the MLPerf Benchmark Suite

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)