high performance computing on graphics processing units: hgpu.org

Posts

Feb, 8

Workload distribution and balancing in FPGAs and CPUs with OpenCL and TBB

In this paper we evaluate the performance and energy effectiveness of FPGA and CPU devices for a kind of parallel computing applications in which the workload can be distributed in a way that enables simultaneous computing in addition to simple off loading. The FPGA device is programmed via OpenCL using the recent availability of commercial […]

OpenCL

Feb, 8

Integrating GPGPU computations with CPU coroutines in C++

We present results on integration of two major GPGPU APIs with reactor-based event processing model in C++ that utilizes coroutines. With current lack of universally usable GPGPU programming interface that gives optimal performance and debates about the style of implementing asynchronous computing in C++, we present a working implementation that allows a uniform and seamless […]

CUDA

•

OpenCL

Feb, 8

Collaborative design and optimization using Collective Knowledge

Designing faster, more energy efficient and reliable computer systems requires effective collaboration between hardware designers, system programmers and performance analysts, as well as feedback from system users. We present Collective Knowledge (CK), an open framework for reproducible and collaborative design and optimization. CK enables systematic and reproducible experimentation, combined with leading edge predictive analytics to […]

OpenCL

Feb, 6

GPU Hackathons, 2016

Background General-purpose Graphics Processing Units (GPGPUs) potentially offer exceptionally high memory bandwidth and performance for a wide range of applications. The challenge in utilizing such accelerators has been the difficulty in programming them. The OpenACC Directives for Accelerators offers straightforward pragma extensions to C++ and Fortran to address this programming hurdle, but other GPU programming […]

Feb, 6

FPGA Based Implementation of Deep Neural Networks Using On-chip Memory Only

Deep neural networks (DNNs) demand a very large amount of computation and weight storage, and thus efficient implementation using special purpose hardware is highly desired. In this work, we have developed an FPGA based fixed-point DNN system using only on-chip memory not to access external DRAM. The execution time and energy consumption of the developed […]

Feb, 6

Asynchronous Methods for Deep Reinforcement Learning

We propose a conceptually simple and lightweight framework for deep reinforcement learning that uses asynchronous gradient descent for optimization of deep neural network controllers. We present asynchronous variants of four standard reinforcement learning algorithms and show that parallel actor-learners have a stabilizing effect on training allowing all four methods to successfully train neural network controllers. […]

Feb, 6

Impact of data layouts on the efficiency of GPU-accelerated IDW interpolation

This paper focuses on evaluating the impact of different data layouts on the computational efficiency of GPU-accelerated Inverse Distance Weighting (IDW) interpolation algorithm. First we redesign and improve our previous GPU implementation that was performed by exploiting the feature of CUDA dynamic parallelism (CDP). Then we implement three versions of GPU implementations, i.e., the naive […]

CUDA

Feb, 6

PRISM-PSY: Precise GPU-Accelerated Parameter Synthesis for Stochastic Systems

In this paper we present PRISM-PSY, a novel tool that performs precise GPU-accelerated parameter synthesis for continuous-time Markov chains and time-bounded temporal logic specifications. We redesign, in terms of matrix-vector operations, the recently formulated algorithms for precise parameter synthesis in order to enable effective dataparallel processing, which results in significant acceleration on many-core architectures. High […]

OpenCL

Feb, 6

EIE: Efficient Inference Engine on Compressed Deep Neural Network

State-of-the art deep neural networks (DNNs) have hundreds of millions of connections and are both computationally and memory intensive, making them difficult to deploy on embedded systems with limited hardware resources and power budgets. While custom hardware can help the computation, fetching the weights from DRAM can be as much as two orders of magnitude […]

CUDA

Feb, 4

A Performance Analysis Framework for Optimizing OpenCL Applications on FPGAs

Recently, FPGA vendors such as Altera and Xilinx have released OpenCL SDK for programming FPGAs. However, the architecture of FPGA is significantly different from that of CPU/GPU, for which OpenCL is originally designed. Tuning the OpenCL code for good performance on FPGAs is still an open problem, since the existing OpenCL tools and models designed […]

OpenCL

Feb, 4

Workshop on Exascale Multi/Many Core Computing Systems, 2016

* CONTEXT Exascale computing will revolutionize computational science and engineering by providing 1000x the capabilities of currently available computing systems, while having a similar power footprint. The HPC community is working towards the development of the first Exaflop computer after reaching the Petaflop milestone in 2008. There are concerns that computer designs based on existing […]

Feb, 3

Optimization and Large Scale Computation of an Entropy-Based Moment Closure

We present computational advances and results in the implementation of an entropy-based moment closure, M_N, in the context of linear kinetic equations, with an emphasis on heterogeneous and large-scale computing platforms. Entropy-based closures are known in several cases to yield more accurate results than closures based on standard spectral approximations, such as P_N, but the […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Workload distribution and balancing in FPGAs and CPUs with OpenCL and TBB

Integrating GPGPU computations with CPU coroutines in C++

Collaborative design and optimization using Collective Knowledge

GPU Hackathons, 2016

FPGA Based Implementation of Deep Neural Networks Using On-chip Memory Only

Asynchronous Methods for Deep Reinforcement Learning

Impact of data layouts on the efficiency of GPU-accelerated IDW interpolation

PRISM-PSY: Precise GPU-Accelerated Parameter Synthesis for Stochastic Systems

EIE: Efficient Inference Engine on Compressed Deep Neural Network

A Performance Analysis Framework for Optimizing OpenCL Applications on FPGAs

Workshop on Exascale Multi/Many Core Computing Systems, 2016

Optimization and Large Scale Computation of an Entropy-Based Moment Closure

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)