18779

Posts

Mar, 10

On the Portability of GPU-Accelerated Applications via Automated Source-to-Source Translation

Over the past decade, accelerator-based supercomputers have grown from 0% to 42% performance share on the TOP500. Ideally, GPUaccelerated code on such systems should be "write once, run anywhere," regardless of the GPU device (or for that matter, any parallel device, e.g., CPU or FPGA). In practice, however, portability can be significantly more limited due […]
Mar, 10

Custom Code Generation for a Graph DSL

Graph algorithms are at the heart of several applications, and achieving high performance with them has become critical due to the tremendous growth of irregular data. However, irregular algorithms are quite challenging to parallelize automatically, due to access patterns influenced by the input graph, which is unavailable until execution. Former research has addressed this issue […]
Mar, 10

GraphVite: A High-Performance CPU-GPU Hybrid System for Node Embedding

Learning continuous representations of nodes is attracting growing interest in both academia and industry recently, due to their simplicity and effectiveness in a variety of applications. Most of existing node embedding algorithms and systems are capable of processing networks with hundreds of thousands or a few millions of nodes. However, how to scale them to […]
Mar, 3

Using Compiler Directives for Performance Portability in Scientific Computing: Kernels from Molecular Simulation

Achieving performance portability for high-performance computing (HPC) applications in scientific fields has become an increasingly important initiative due to large differences in emerging supercomputer architectures. Here we test some key kernels from molecular dynamics (MD) to determine whether the use of the OpenACC directive-based programming model when applied to these kernels can result in performance […]
Mar, 3

Cooperative CPU, GPU, and FPGA heterogeneous execution with EngineCL

Heterogeneous systems are the core architecture of most of the High Performance Computing nodes, due to their excellent performance and energy efficiency. However, a key challenge that remains is programmability; specifically, releasing the programmer from the burden of managing data and devices with different architectures. To this end, we extend EngineCL to support FPGA devices. […]
Mar, 3

Application level energy measurements and models for hybrid platform with accelerators

High Performance Computing is essential to continued advancement in many scientific and engineering fields. In recent years, due to the scale of the platforms and the breakdown of laws which had long since supported rapid expansion, energy efficiency has emerged as a new design constraint on HPC platforms and applications. This constraint has increased the […]
Mar, 3

Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs

With the ubiquity of accelerators, such as FPGAs and GPUs, the complexity of high-performance programming is increasing beyond the skill-set of the average scientist in domains outside of computer science. It is thus imperative to decouple programming paradigms and architecture-specific implementation from the underlying scientific computations. We present the Stateful DataFlow multiGraph (SDFG), a data-centric […]
Mar, 3

cuSten – CUDA Finite Difference and Stencil Library

In this paper we present cuSten, a new library of functions to handle the implementation of 2D finite-difference/stencil programs in CUDA. cuSten wraps data handling, kernel calls and streaming into four easy to use functions that speed up development of numerical codes on GPU platforms. The paper also presents an example of this library applied […]
Feb, 24

An Empirically Guided Optimization Framework for FPGA OpenCL

FPGAs have been demonstrated to be capable of very high performance, especially power-performance, but generally at the cost of hand-tuned HDL code by FPGA experts. OpenCL is the leading industry effort in improving performance-programmability. But while it is recognized that optimizing OpenCL code using published best practices is critical to achieving good performance, even optimized […]
Feb, 24

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

It is important to scale out deep neural network (DNN) training for reducing model training time. The high communication overhead is one of the major performance bottlenecks for distributed DNN training across multiple GPUs. Our investigations have shown that popular open-source DNN systems could only achieve 2.5 speedup ratio on 64 GPUs connected by 56 […]
Feb, 24

A Package for Multi-Dimensional Monte Carlo Integration on Multi-GPUs

We have developed a Python package ZMCintegral for multi-dimensional Monte Carlo integration on multiple Graphics Processing Units(GPUs). The package employs a stratified sampling and heuristic tree search algorithm. We have built two versions of this package: one with Tensorflow and another with Numba, both support general user defined functions with a user-friendly interface. We have […]
Feb, 24

Worst-Case Execution Time Guarantees for Runtime-Reconfigurable Architectures

Real-time systems are ubiquitous in our everyday life, e.g., in safety-critical domains such as automotive, avionics or robotics. The correctness of a real-time system does not only depend on the correctness of its calculations, but also on the non-functional requirement of adhering to deadlines. Failing to meet a deadline may lead to severe malfunctions, therefore […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: