high performance computing on graphics processing units: hgpu.org

Posts

Mar, 3

Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs

With the ubiquity of accelerators, such as FPGAs and GPUs, the complexity of high-performance programming is increasing beyond the skill-set of the average scientist in domains outside of computer science. It is thus imperative to decouple programming paradigms and architecture-specific implementation from the underlying scientific computations. We present the Stateful DataFlow multiGraph (SDFG), a data-centric […]

CUDA

•

OpenCL

Mar, 3

cuSten – CUDA Finite Difference and Stencil Library

In this paper we present cuSten, a new library of functions to handle the implementation of 2D finite-difference/stencil programs in CUDA. cuSten wraps data handling, kernel calls and streaming into four easy to use functions that speed up development of numerical codes on GPU platforms. The paper also presents an example of this library applied […]

CUDA

Feb, 24

An Empirically Guided Optimization Framework for FPGA OpenCL

FPGAs have been demonstrated to be capable of very high performance, especially power-performance, but generally at the cost of hand-tuned HDL code by FPGA experts. OpenCL is the leading industry effort in improving performance-programmability. But while it is recognized that optimizing OpenCL code using published best practices is critical to achieving good performance, even optimized […]

OpenCL

Feb, 24

A Package for Multi-Dimensional Monte Carlo Integration on Multi-GPUs

We have developed a Python package ZMCintegral for multi-dimensional Monte Carlo integration on multiple Graphics Processing Units(GPUs). The package employs a stratified sampling and heuristic tree search algorithm. We have built two versions of this package: one with Tensorflow and another with Numba, both support general user defined functions with a user-friendly interface. We have […]

Feb, 24

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

It is important to scale out deep neural network (DNN) training for reducing model training time. The high communication overhead is one of the major performance bottlenecks for distributed DNN training across multiple GPUs. Our investigations have shown that popular open-source DNN systems could only achieve 2.5 speedup ratio on 64 GPUs connected by 56 […]

CUDA

Feb, 24

Worst-Case Execution Time Guarantees for Runtime-Reconfigurable Architectures

Real-time systems are ubiquitous in our everyday life, e.g., in safety-critical domains such as automotive, avionics or robotics. The correctness of a real-time system does not only depend on the correctness of its calculations, but also on the non-functional requirement of adhering to deadlines. Failing to meet a deadline may lead to severe malfunctions, therefore […]

OpenCL

Feb, 24

DNNVM: End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-based CNN Accelerators

The convolutional neural network (CNN) has become a state-of-the-art method for several artificial intelligence domains in recent years. The increasingly complex CNN models are both computation-bound and I/O-bound. FPGA-based accelerators driven by custom instruction set architecture (ISA) achieve a balance between generality and efficiency, but there is much on them left to be optimized. We […]

Feb, 17

TensorFlow.js: Machine Learning for the Web and Beyond

TensorFlow.js is a library for building and executing machine learning algorithms in JavaScript. TensorFlow.js models run in a web browser and in the Node.js environment. The library is part of the TensorFlow ecosystem, providing a set of APIs that are compatible with those in Python, allowing models to be ported between the Python and JavaScript […]

CUDA

•

OpenGL

Feb, 17

Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications

GPU computing is becoming increasingly more popular with the proliferation of deep learning (DL) applications. However, unlike traditional resources such as CPU or the network, modern GPUs do not natively support fine-grained sharing primitives. Consequently, implementing common policies such as time sharing and preemption are expensive. Worse, when a DL application cannot completely use a […]

CUDA

Feb, 17

Software-Defined FPGA Accelerator Design for Mobile Deep Learning Applications

Recently, the field of deep learning has received great attention by the scientific community and it is used to provide improved solutions to many computer vision problems. Convolutional neural networks (CNNs) have been successfully used to attack problems such as object recognition, object detection, semantic segmentation, and scene understanding. The rapid development of deep learning […]

Feb, 17

DeeperLab: Single-Shot Image Parser

We present a single-shot, bottom-up approach for whole image parsing. Whole image parsing, also known as Panoptic Segmentation, generalizes the tasks of semantic segmentation for ‘stuff’ classes and instance segmentation for ‘thing’ classes, assigning both semantic and instance labels to every pixel in an image. Recent approaches to whole image parsing typically employ separate standalone […]

Feb, 17

GPU Accelerated Keccak (SHA3) Algorithm

Hash functions like SHA-1 or MD5 are one of the most important cryptographic primitives, especially in the field of information integrity. Considering the fact that increasing methods have been proposed to break these hash algorithms, a competition for a new family of hash functions was held by the US National Institute of Standards and Technology. […]

CUDA