high performance computing on graphics processing units: hgpu.org

Posts

Dec, 12

High performance computing on Android devices – a case study

High performance computing for low power devices can be useful to speed up calculations on processors that use a lower clock rate than computers for which energy efficiency is not an issue. In this trial, different high performance techniques for Android devices have been compared, with a special focus on the use of the GPU. […]

OpenCL

Dec, 12

GPU backed Data Mining on Android Devices

Choosing an appropriate programming paradigm for high-performance computing on low-power devices can be useful to speed up calculations. Many Android devices have an integrated GPU and – although not officially supported – the OpenCL framework can be used on Android devices for addressing these GPUs. OpenCL supports thread and data parallelism. Applications that use the […]

OpenCL

Dec, 5

Analysis and Comparison of Performance and Power Consumption of Neural Networks on CPU, GPU, TPU and FPGA

In this work, we analyze the performance of neural networks on a variety of heterogenous platforms. We strive to find the best platform in terms of raw benchmark performance, performance per watt and performance per Euro. To reach this goal, we focused on convolutional neural networks and created several micro- and macrobenchmark applications and used […]

Dec, 5

Autotuning CUDA: Applying NLP Techniques to LS-CAT

The abstract relation between hardware parameters and program performance makes setting program parameters a difficult task. Without autotuning, software can miss low-level optimizations, resulting in lower performance. Traditionally, time-consuming trial and error search methods have been the staple of autotuning. Applying Natural language processing (NLP) based machine learning (ML) methods to source code as a […]

CUDA

Dec, 5

Bayesian Optimization for auto-tuning GPU kernels

Finding optimal parameter configurations for tunable GPU kernels is a non-trivial exercise for large search spaces, even when automated. This poses an optimization task on a non-convex search space, using an expensive to evaluate function with unknown derivative. These characteristics make a good candidate for Bayesian Optimization, which has not been applied to this problem […]

CUDA

•

OpenCL

Dec, 5

Atos: A Task-Parallel GPU Dynamic Scheduling Framework for Dynamic Irregular Computations

We present Atos, a task-parallel GPU dynamic scheduling framework that is especially suited to dynamic irregular applications. Compared to the dominant Bulk Synchronous Parallel (BSP) frameworks, Atos exposes additional concurrency by supporting task-parallel formulations of applications with relaxed dependencies, achieving higher GPU utilization, which is particularly significant for problems with concurrency bottlenecks. Atos also offers […]

CUDA

Dec, 5

An Auto-Programming Approach to Vulkan

We propose a novel high-level approach for software development on GPU using Vulkan API. Our goal is to speed-up development and performance studies for complex algorithms on GPU, which is quite difficult and laborious for Vulkan due to large number of HW features low level details. The proposed approach uses auto programming to translate ordinary […]

Nov, 28

Concurrency Mapping to FPGAs with OpenCL: A Case Study with a Shallow Water Kernel

FPGAs have been around for over 30 years and are a viable accelerator for compute-intensive workloads on HPC systems. The adoption of FPGAs for scientific applications has been stimulated recently by the emergence of better programming environments such as High-Level Synthesis (HLS) and OpenCL available through the Xilinx SDSoC design tool. The mapping of the […]

OpenCL

Nov, 28

Generating GPU Compiler Heuristics using Reinforcement Learning

GPU compilers are complex software programs with many optimizations specific to target hardware. These optimizations are often controlled by heuristics hand-designed by compiler experts using time- and resource-intensive processes. In this paper, we developed a GPU compiler autotuning framework that uses off-policy deep reinforcement learning to generate heuristics that improve the frame rates of graphics […]

Nov, 28

Predictive Data Race Detection for GPUs

The high degree of parallelism and relatively complicated synchronization mechanisms in GPUs make writing correct kernels difficult. Data races pose one such concurrency correctness challenge, and therefore, effective methods of detecting as many data races as possible are required. Predictive partial order relations for CPU programs aim to expose data races that can be hidden […]

CUDA

Nov, 28

A Variant RSA Acceleration with Parallelization

The standard RSA relies on multiple big-number modular exponentiation operations and longer key-length is required for better protection. This imposes a hefty time penalty for encryption and decryption. In this study, we analyzed and developed an improved parallel algorithm (PMKRSA) based on the idea of splitting the plaintext into multiple chunks and encrypt the chunks […]

CUDA

Nov, 28

HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

Deep neural networks (DNNs) exploit many layers and a large number of parameters to achieve excellent performance. The training process of DNN models generally handles large-scale input data with many sparse features, which incurs high Input/Output (IO) cost, while some layers are compute-intensive. The training process generally exploits distributed computing resources to reduce training time. […]

high performance computing on graphics processing units: hgpu.org

Posts

High performance computing on Android devices – a case study

GPU backed Data Mining on Android Devices

Analysis and Comparison of Performance and Power Consumption of Neural Networks on CPU, GPU, TPU and FPGA

Autotuning CUDA: Applying NLP Techniques to LS-CAT

Bayesian Optimization for auto-tuning GPU kernels

Atos: A Task-Parallel GPU Dynamic Scheduling Framework for Dynamic Irregular Computations

An Auto-Programming Approach to Vulkan

Concurrency Mapping to FPGAs with OpenCL: A Case Study with a Shallow Water Kernel

Generating GPU Compiler Heuristics using Reinforcement Learning

Predictive Data Race Detection for GPUs

A Variant RSA Acceleration with Parallelization

HeterPS: Distributed Deep Learning With Reinforcement Learning Based Scheduling in Heterogeneous Environments

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)