Posts
Mar, 3
Using Compiler Directives for Performance Portability in Scientific Computing: Kernels from Molecular Simulation
Achieving performance portability for high-performance computing (HPC) applications in scientific fields has become an increasingly important initiative due to large differences in emerging supercomputer architectures. Here we test some key kernels from molecular dynamics (MD) to determine whether the use of the OpenACC directive-based programming model when applied to these kernels can result in performance […]
Mar, 3
Cooperative CPU, GPU, and FPGA heterogeneous execution with EngineCL
Heterogeneous systems are the core architecture of most of the High Performance Computing nodes, due to their excellent performance and energy efficiency. However, a key challenge that remains is programmability; specifically, releasing the programmer from the burden of managing data and devices with different architectures. To this end, we extend EngineCL to support FPGA devices. […]
Mar, 3
Application level energy measurements and models for hybrid platform with accelerators
High Performance Computing is essential to continued advancement in many scientific and engineering fields. In recent years, due to the scale of the platforms and the breakdown of laws which had long since supported rapid expansion, energy efficiency has emerged as a new design constraint on HPC platforms and applications. This constraint has increased the […]
Mar, 3
Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs
With the ubiquity of accelerators, such as FPGAs and GPUs, the complexity of high-performance programming is increasing beyond the skill-set of the average scientist in domains outside of computer science. It is thus imperative to decouple programming paradigms and architecture-specific implementation from the underlying scientific computations. We present the Stateful DataFlow multiGraph (SDFG), a data-centric […]
Mar, 3
cuSten – CUDA Finite Difference and Stencil Library
In this paper we present cuSten, a new library of functions to handle the implementation of 2D finite-difference/stencil programs in CUDA. cuSten wraps data handling, kernel calls and streaming into four easy to use functions that speed up development of numerical codes on GPU platforms. The paper also presents an example of this library applied […]
Feb, 24
An Empirically Guided Optimization Framework for FPGA OpenCL
FPGAs have been demonstrated to be capable of very high performance, especially power-performance, but generally at the cost of hand-tuned HDL code by FPGA experts. OpenCL is the leading industry effort in improving performance-programmability. But while it is recognized that optimizing OpenCL code using published best practices is critical to achieving good performance, even optimized […]
Feb, 24
A Package for Multi-Dimensional Monte Carlo Integration on Multi-GPUs
We have developed a Python package ZMCintegral for multi-dimensional Monte Carlo integration on multiple Graphics Processing Units(GPUs). The package employs a stratified sampling and heuristic tree search algorithm. We have built two versions of this package: one with Tensorflow and another with Numba, both support general user defined functions with a user-friendly interface. We have […]
Feb, 24
Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes
It is important to scale out deep neural network (DNN) training for reducing model training time. The high communication overhead is one of the major performance bottlenecks for distributed DNN training across multiple GPUs. Our investigations have shown that popular open-source DNN systems could only achieve 2.5 speedup ratio on 64 GPUs connected by 56 […]
Feb, 24
Worst-Case Execution Time Guarantees for Runtime-Reconfigurable Architectures
Real-time systems are ubiquitous in our everyday life, e.g., in safety-critical domains such as automotive, avionics or robotics. The correctness of a real-time system does not only depend on the correctness of its calculations, but also on the non-functional requirement of adhering to deadlines. Failing to meet a deadline may lead to severe malfunctions, therefore […]
Feb, 24
DNNVM: End-to-End Compiler Leveraging Heterogeneous Optimizations on FPGA-based CNN Accelerators
The convolutional neural network (CNN) has become a state-of-the-art method for several artificial intelligence domains in recent years. The increasingly complex CNN models are both computation-bound and I/O-bound. FPGA-based accelerators driven by custom instruction set architecture (ISA) achieve a balance between generality and efficiency, but there is much on them left to be optimized. We […]
Feb, 17
TensorFlow.js: Machine Learning for the Web and Beyond
TensorFlow.js is a library for building and executing machine learning algorithms in JavaScript. TensorFlow.js models run in a web browser and in the Node.js environment. The library is part of the TensorFlow ecosystem, providing a set of APIs that are compatible with those in Python, allowing models to be ported between the Python and JavaScript […]
Feb, 17
Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications
GPU computing is becoming increasingly more popular with the proliferation of deep learning (DL) applications. However, unlike traditional resources such as CPU or the network, modern GPUs do not natively support fine-grained sharing primitives. Consequently, implementing common policies such as time sharing and preemption are expensive. Worse, when a DL application cannot completely use a […]