high performance computing on graphics processing units: hgpu.org

Posts

Nov, 28

Predictive Data Race Detection for GPUs

The high degree of parallelism and relatively complicated synchronization mechanisms in GPUs make writing correct kernels difficult. Data races pose one such concurrency correctness challenge, and therefore, effective methods of detecting as many data races as possible are required. Predictive partial order relations for CPU programs aim to expose data races that can be hidden […]

CUDA

Nov, 28

A Variant RSA Acceleration with Parallelization

The standard RSA relies on multiple big-number modular exponentiation operations and longer key-length is required for better protection. This imposes a hefty time penalty for encryption and decryption. In this study, we analyzed and developed an improved parallel algorithm (PMKRSA) based on the idea of splitting the plaintext into multiple chunks and encrypt the chunks […]

CUDA

Nov, 28

HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

Deep neural networks (DNNs) exploit many layers and a large number of parameters to achieve excellent performance. The training process of DNN models generally handles large-scale input data with many sparse features, which incurs high Input/Output (IO) cost, while some layers are compute-intensive. The training process generally exploits distributed computing resources to reduce training time. […]

Nov, 21

BAT: A Benchmark suite for AutoTuners

An autotuner takes a parameterized code as input and tries to optimize the code by finding the best possible values for a given architecture. To our knowledge, there are currently no standardized benchmark suites for comparing and testing autotuners. Developers of autotuners thus make their own when presenting and comparing autotuners. We thus present BAT, […]

CUDA

Nov, 21

Programming Heterogeneous Systems with General and Domain-Specific Frameworks

As chip manufacturing processes are getting ever closer to what is physically possible, the projections made by Moore’s Law and Dennard Scaling no longer hold true, and CPU performance has been stagnating over the last decade. At the same time, the performance requirements of many important application areas, ranging from machine learning to scientific computing, […]

CUDA

•

OpenCL

Nov, 21

Accelerating JPEG Decompression on GPUs

The JPEG compression format has been the standard for lossy image compression for over multiple decades, offering high compression rates at minor perceptual loss in image quality. For GPU-accelerated computer vision and deep learning tasks, such as the training of image classification models, efficient JPEG decoding is essential due to limitations in memory bandwidth. As […]

CUDA

Nov, 21

QGTC: Accelerating Quantized GNN via GPU Tensor Core

Over the most recent years, quantized graph neural network (QGNN) attracts lots of research and industry attention due to its high robustness and low computation and memory overhead. Unfortunately, the performance gains of QGNN have never been realized on modern GPU platforms. To this end, we propose the first Tensor Core (TC) based computing framework, […]

CUDA

Nov, 21

FLOWER: A Comprehensive Dataflow Compiler for High-Level Synthesis

FPGAs have found their way into data centers as accelerator cards, making reconfigurable computing more accessible for high-performance applications. At the same time, new high-level synthesis compilers like Xilinx Vitis and runtime libraries such as XRT attract software programmers into the reconfigurable domain. While software programmers are familiar with task-level and data-parallel programming, FPGAs often […]

OpenCL

Nov, 14

General purpose lattice QCD code set Bridge++ 2.0 for high performance computing

Bridge++ is a general-purpose code set for a numerical simulation of lattice QCD aiming at a readable, extensible, and portable code while keeping practically high performance. The previous version of Bridge++ is implemented in double precision with a fixed data layout. To exploit the high arithmetic capability of new processor architecture, we extend the Bridge++ […]

Nov, 14

Performance Optimisations for Heterogeneous Managed Runtime Systems

High demand for increased computational capabilities and power efficiency has resulted in making commodity devices integrating diverse hardware resources. Desktops, laptops, and smartphones have embraced heterogeneity through multi-core Central Processing Units (CPUs), energy-efficient integrated Graphics Processing Units (GPUs), Field-Programmable Gate Arrays (FPGAs), powerful discrete GPUs, and Tensor Processing Units (TPUs). To ease the programmability of […]

OpenCL

Nov, 14

Safe and Practical GPU Acceleration in TrustZone

We present a holistic design for GPU-accelerated computation in TrustZone TEE. Without pulling the complex GPU software stack into the TEE, we follow a simple approach: record the CPU/GPU interactions ahead of time, and replay the interactions in the TEE at run time. This paper addresses the approach’s key missing piece – the recording environment, […]

Nov, 14

MQBench: Towards Reproducible and Deployable Model Quantization Benchmark

Model quantization has emerged as an indispensable technique to accelerate deep learning inference. While researchers continue to push the frontier of quantization algorithms, existing quantization work is often unreproducible and undeployable. This is because researchers do not choose consistent training pipelines and ignore the requirements for hardware deployments. In this work, we propose Model Quantization […]