high performance computing on graphics processing units: hgpu.org

Posts

Oct, 6

Syntix: A Profiling Based Resource Estimator for CUDA Kernels

Trending applications such as AI and data analytics have mandated the use of GPUs in modern datacenters for performance reasons. Current practice dictates to dedicate GPUs to applications, which limits the amount of concurrent users to the available GPUs. That use of GPUs contradicts with the policy of datacenters to oversubscribe resources and accommodate as […]

CUDA

Oct, 6

waLBerla: A block-structured high-performance framework for multiphysics simulations

Programming current supercomputers efficiently is a challenging task. Multiple levels of parallelism on the core, on the compute node, and between nodes need to be exploited to make full use of the system. Heterogeneous hardware architectures with accelerators further complicate the development process. waLBerla addresses these challenges by providing the user with highly efficient building […]

CUDA

Oct, 6

MIOpen: An Open Source Library For Deep Learning Primitives

Deep Learning has established itself to be a common occurrence in the business lexicon. The unprecedented success of deep learning in recent years can be attributed to: abundance of data, availability of gargantuan compute capabilities offered by GPUs, and adoption of open-source philosophy by the researchers and industry. Deep neural networks can be decomposed into […]

OpenCL

Sep, 29

Exascale Deep Learning for Scientific Inverse Problems

We introduce novel communication strategies in synchronous distributed Deep Learning consisting of decentralized gradient reduction orchestration and computational graph-aware grouping of gradient tensors. These new techniques produce an optimal overlap between computation and communication and result in near-linear scaling (0.93) of distributed training up to 27,600 NVIDIA V100 GPUs on the Summit Supercomputer. We demonstrate […]

Sep, 29

Futhark Vulkan Backend

This paper describes the effort, challenges, and limitations involved in the implementation of a Futhark compiler variant using the Vulkan API version 1.1 for compiling Futhark programs targeting GPUs. Compared to the existing OpenCL backend with the same purpose, the more modern Vulkan API could offer some performance benefits and may extend the scope of […]

OpenCL

Sep, 29

Heterogeneous Resource-Elastic Management for FPGAs: Concepts, Theory and Implementation

Despite deployment of FPGAs at the edge and cloud data centers due to their performance and energy advantage, FPGA runtime systems commonly tend to support only one-application-at-a-time and cannot adapt to dynamic workloads with reasonable response times. Therefore, this paper proposes the concepts and theory of resource elasticity for FPGA systems to allow a task […]

OpenCL

Sep, 29

Elastic deep learning in multi-tenant GPU cluster

Multi-tenant GPU clusters are common nowadays due to the huge success of deep learning and training jobs are usually conducted with multiple distributed GPUs. These GPU clusters are managed with various goals including short JCT, high resource utilization and quick response to small jobs. In this paper, we show that elasticity, which is the ability […]

Sep, 29

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Increasing model size when pretraining natural language representations often results in improved performance on downstream tasks. However, at some point further model increases become harder due to GPU/TPU memory limitations, longer training times, and unexpected model degradation. To address these problems, we present two parameter-reduction techniques to lower memory consumption and increase the training speed […]

Sep, 22

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Recent work in unsupervised language modeling demonstrates that training large neural language models advances the state of the art in Natural Language Processing applications. However, for very large models, memory constraints limit the size of models that can be practically trained. Model parallelism allows us to train larger models, because the parameters can be split […]

Sep, 22

Performance and Power Evaluation of AI Accelerators for Training Deep Learning Models

Deep neural networks (DNNs) have become widely used in many AI applications. Yet, training a DNN requires a huge amount of calculations and it takes a long time and energy to train a satisfying model. Nowadays, many-core AI accelerators (e.g., GPUs and TPUs) play a key role in training DNNs. However, different many-core processors from […]

Sep, 22

Model-Based Warp-Level Tiling for Image Processing Programs on GPUs

The efficient execution of image processing pipelines on GPUs is an area of active research. The state-of-art involves 1) dividing portions of an image into overlapped tiles, where each tile can be processed by a single thread block and 2) fusing loops together to improve memory locality. However, the state-of-the-art has two limitations: 1) synchronization […]

CUDA

Sep, 22

ALPyNA: Acceleration of Loops in Python for Novel Architectures

We present ALPyNA, an automatic loop parallelization framework for Python, which analyzes data dependences within nested loops and dynamically generates CUDA kernels for GPU execution. The ALPyNA system applies classical dependence analysis techniques to discover and exploit potential parallelism. The skeletal structure of the dependence graph is determined statically (if possible) or at runtime; this […]

CUDA