high performance computing on graphics processing units: hgpu.org

Posts

Feb, 8

Portable Programming Models for Heterogeneous Platforms

With the end of Dennard scaling and emergence of dark silicon, the bets are high on heterogeneous architectures to achieve both application performance and energy efficiency. However, diversity in heterogeneous architectures poses severe programming challenges in terms of data layout, memory coherence, task partitioning, data distribution, and sharing of virtual addresses. Existing high-level programming languages […]

OpenCL

Feb, 8

High performance high-order numerical methods: applications in ocean modeling

This thesis presents high-order numerical methods for time-dependent simulations of oceanic wave propagation on modern many-core hardware architecture. Simulation of the waves such as tsunami, is challenging because of the varying fluid depths, propagation in many regions, requirement of high resolution near the shore, complex nonlinear wave phenomenon, and necessity of faster than real-time predictions. […]

CUDA

•

OpenCL

Feb, 8

Utilizing GPUs to Accelerate Turbomachinery CFD Codes

GPU computing has established itself as a way to accelerate parallel codes in the high performance computing world. This work focuses on speeding up APNASA, a legacy CFD code used at NASA Glenn Research Center, while also drawing conclusions about the nature of GPU computing and the requirements to make GPGPU worthwhile on legacy codes. […]

Feb, 8

Workload distribution and balancing in FPGAs and CPUs with OpenCL and TBB

In this paper we evaluate the performance and energy effectiveness of FPGA and CPU devices for a kind of parallel computing applications in which the workload can be distributed in a way that enables simultaneous computing in addition to simple off loading. The FPGA device is programmed via OpenCL using the recent availability of commercial […]

OpenCL

Feb, 8

Integrating GPGPU computations with CPU coroutines in C++

We present results on integration of two major GPGPU APIs with reactor-based event processing model in C++ that utilizes coroutines. With current lack of universally usable GPGPU programming interface that gives optimal performance and debates about the style of implementing asynchronous computing in C++, we present a working implementation that allows a uniform and seamless […]

CUDA

•

OpenCL

Feb, 8

Collaborative design and optimization using Collective Knowledge

Designing faster, more energy efficient and reliable computer systems requires effective collaboration between hardware designers, system programmers and performance analysts, as well as feedback from system users. We present Collective Knowledge (CK), an open framework for reproducible and collaborative design and optimization. CK enables systematic and reproducible experimentation, combined with leading edge predictive analytics to […]

OpenCL

Feb, 6

GPU Hackathons, 2016

Background General-purpose Graphics Processing Units (GPGPUs) potentially offer exceptionally high memory bandwidth and performance for a wide range of applications. The challenge in utilizing such accelerators has been the difficulty in programming them. The OpenACC Directives for Accelerators offers straightforward pragma extensions to C++ and Fortran to address this programming hurdle, but other GPU programming […]

Feb, 6

FPGA Based Implementation of Deep Neural Networks Using On-chip Memory Only

Deep neural networks (DNNs) demand a very large amount of computation and weight storage, and thus efficient implementation using special purpose hardware is highly desired. In this work, we have developed an FPGA based fixed-point DNN system using only on-chip memory not to access external DRAM. The execution time and energy consumption of the developed […]

Feb, 6

Asynchronous Methods for Deep Reinforcement Learning

We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. […]

Feb, 6

Impact of data layouts on the efficiency of GPU-accelerated IDW interpolation

This paper focuses on evaluating the impact of different data layouts on the computational efficiency of GPU-accelerated Inverse Distance Weighting (IDW) interpolation algorithm. First we redesign and improve our previous GPU implementation that was performed by exploiting the feature of CUDA dynamic parallelism (CDP). Then we implement three versions of GPU implementations, i.e., the naive […]

CUDA

Feb, 6

PRISM-PSY: Precise GPU-Accelerated Parameter Synthesis for Stochastic Systems

In this paper we present PRISM-PSY, a novel tool that performs precise GPU-accelerated parameter synthesis for continuous-time Markov chains and time-bounded temporal logic specifications. We redesign, in terms of matrix-vector operations, the recently formulated algorithms for precise parameter synthesis in order to enable effective dataparallel processing, which results in significant acceleration on many-core architectures. High […]

OpenCL

Feb, 6

EIE: Efficient Inference Engine on Compressed Deep Neural Network

State-of-the art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware can help the computation, fetching the weights from DRAM can be as much as two orders of magnitude […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Portable Programming Models for Heterogeneous Platforms

High performance high-order numerical methods: applications in ocean modeling

Utilizing GPUs to Accelerate Turbomachinery CFD Codes

Workload distribution and balancing in FPGAs and CPUs with OpenCL and TBB

Integrating GPGPU computations with CPU coroutines in C++

Collaborative design and optimization using Collective Knowledge

GPU Hackathons, 2016

FPGA Based Implementation of Deep Neural Networks Using On-chip Memory Only

Asynchronous Methods for Deep Reinforcement Learning

Impact of data layouts on the efficiency of GPU-accelerated IDW interpolation

PRISM-PSY: Precise GPU-Accelerated Parameter Synthesis for Stochastic Systems

EIE: Efficient Inference Engine on Compressed Deep Neural Network

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)