high performance computing on graphics processing units: hgpu.org

Posts

Mar, 6

Integrating SkePU’s algorithmic skeletons with GPI on a cluster

As processors’ clock-speed flattened out in the early 2000s, multi-core processors became more prevalent and so did parallel programming. However this programming paradigm introduces additional complexities, and to combat this, the SkePU framework was created. SkePU does this by offering a single-threaded interface which executes the user’s code in parallel in accordance to a chosen […]

CUDA

•

OpenCL

Mar, 6

Romou: Rapidly Generate High-Performance Tensor Kernels for Mobile GPUs

Mobile GPU, as a ubiquitous and powerful accelerator, plays an important role in accelerating on-device DNN (Deep Neural Network) inference. The frequent-upgrade and diversity of mobile GPUs require automatic kernel generation to empower fast DNN deployment. However, current generated kernels have poor performance. The goal of this paper is to rapidly generate high-performance kernels for […]

OpenCL

Mar, 6

Query Processing on Tensor Computation Runtimes

The huge demand for computation in artificial intelligence (AI) is driving unparalleled investments in new hardware and software systems for AI. This leads to an explosion in the number of specialized hardware devices, which are now part of the offerings of major cloud providers. Meanwhile, by hiding the low-level complexity through a tensor-based interface, tensor […]

CUDA

Mar, 6

FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours

Protein structure prediction is an important method for understanding gene translation and protein function in the domain of structural biology. AlphaFold introduced the Transformer model to the field of protein structure prediction with atomic accuracy. However, training and inference of the AlphaFold model are time-consuming and expensive because of the special performance characteristics and huge […]

CUDA

Mar, 6

Enabling On-Device Smartphone GPU based Training: Lessons Learned

Deep Learning (DL) has shown impressive performance in many mobile applications. Most existing works have focused on reducing the computational and resource overheads of running Deep Neural Networks (DNN) inference on resource-constrained mobile devices. However, the other aspect of DNN operations, i.e. training (forward and backward passes) on smartphone GPUs, has received little attention thus […]

OpenCL

Feb, 20

gLBM: A GPU enabled Lattice Boltzmann Method Library

Lattice Boltzmann Methods (LBM) are a class of computational fluid dynamics (CFD) algorithms for simulation. Unlike traditional formulations that simulate fluid dynamics on a macroscopic level with a mesh, the LBM characterizes the problem on a mesoscopic level applied to a grid discretization. LBM solves the fluid density problem with collide and stream (relaxation) processes. […]

CUDA

Feb, 20

A ML-based resource utilization OpenCL GPU-kernel fusion model

Massive data parallelism can be achieved by using general-purpose graphics processing units (GPGPU) with the help of the OpenCL framework. When smaller data with higher GPU memory is executed, it results in a low resource utilization ratio and energy inefficiencies. Up until now, there is no existing model to share GPU for further execution. In […]

OpenCL

Feb, 20

A Comprehensive Benchmark of Deep Learning Libraries on Mobile Devices

Deploying deep learning (DL) on mobile devices has been a notable trend in recent years. To support fast inference of on-device DL, DL libraries play a critical role as algorithms and hardware do. Unfortunately, no prior work ever dives deep into the ecosystem of modern DL libs and provides quantitative results on their performance. In […]

OpenCL

•

OpenGL

Feb, 20

Heuristic Adaptability to Input Dynamics for SpMM on GPUs

Sparse Matrix-Matrix Multiplication (SpMM) has served as fundamental components in various domains. Many previous studies exploit GPUs for SpMM acceleration because GPUs provide high bandwidth and parallelism. We point out that a static design does not always improve the performance of SpMM on different input data (e.g., >85% performance loss with a single algorithm). In […]

CUDA

Feb, 20

Lightning: Scaling the GPU Programming Model Beyond a Single GPU

The GPU programming model is primarily designed to support the development of applications that run on one GPU. However, just a single GPU is limited in its capabilities in terms of memory capacity and compute power. To handle large problems that exceed these capabilities, one must rewrite application code to manually transfer data between GPU […]

CUDA

•

OpenCL

Feb, 13

Electrical-Level Attacks on CPUs, FPGAs, and GPUs: Survey and Implications in the Heterogeneous Era

Given the need for efficient high-performance computing, computer architectures combining CPUs, GPUs, and FPGAs are nowadays prevalent. However, each of these components suffers from electrical-level security risks. Moving to heterogeneous systems, with the potential of multitenancy, it is essential to understand and investigate how the security vulnerabilities of individual components may affect the system as […]

Feb, 13

Pattern-based Programming Abstractions for Heterogeneous Parallel Computing

Contemporary computer architectures utilize wide multi-core processors, accelerators such as GPUs, and clustering of individual computers into complex large-scale systems. These hardware trends are prevalent across computers of all sizes, from the largest supercomputers down to the smallest mobile phones. While these innovations provide high peak computing performance, software developers find it increasingly difficult to […]

CUDA

•

OpenCL

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Integrating SkePU’s algorithmic skeletons with GPI on a cluster

Romou: Rapidly Generate High-Performance Tensor Kernels for Mobile GPUs

Query Processing on Tensor Computation Runtimes

FastFold: Reducing AlphaFold Training Time from 11 Days to 67 Hours

Enabling On-Device Smartphone GPU based Training: Lessons Learned

gLBM: A GPU enabled Lattice Boltzmann Method Library

A ML-based resource utilization OpenCL GPU-kernel fusion model

A Comprehensive Benchmark of Deep Learning Libraries on Mobile Devices

Heuristic Adaptability to Input Dynamics for SpMM on GPUs

Lightning: Scaling the GPU Programming Model Beyond a Single GPU

Electrical-Level Attacks on CPUs, FPGAs, and GPUs: Survey and Implications in the Heterogeneous Era

Pattern-based Programming Abstractions for Heterogeneous Parallel Computing

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)