high performance computing on graphics processing units: hgpu.org

Posts

Apr, 16

Understanding Performance Portability of Bioinformatics Applications in SYCL on an NVIDIA GPU

Our goal is to have a better understanding of performance portability of SYCL kernels on a GPU. Toward this goal, we migrate representative kernels in bioinformatics applications from CUDA to SYCL, evaluate their performance on an NVIDIA GPU, and explain the performance gaps through performance profiling and analyses. We hope that the findings provide valuable […]

CUDA

Apr, 16

Kernel Tuning Toolkit

Kernel Tuning Toolkit (KTT) is an autotuning framework for CUDA, OpenCL and Vulkan kernels. KTT provides advanced autotuning features such as support for both dynamic (online) and offline tuning, and an ability to tune multiple kernels together with shared tuning parameters. Furthermore, it offers customization features that make integration into larger software suites possible. The […]

CUDA

•

OpenCL

Apr, 16

Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators

Over the last decade, most of the increase in computing power has been gained by advances in accelerated many-core architectures, mainly in the form of GPGPUs. While accelerators achieve phenomenal performances in various computing tasks, their utilization requires code adaptations and transformations. Thus, OpenMP, the most common standard for multi-threading in scientific computing applications, introduced […]

Apr, 16

Energy-Efficient GPU Clusters Scheduling for Deep Learning

Training deep neural networks (DNNs) is a major workload in datacenters today, resulting in a tremendously fast growth of energy consumption. It is important to reduce the energy consumption while completing the DL training jobs early in data centers. In this paper, we propose PowerFlow, a GPU clusters scheduler that reduces the average Job Completion […]

Apr, 16

ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels

GPU-based HPC clusters are attracting more scientific application developers due to their extensive parallelism and energy efficiency. In order to achieve portability among a variety of multi/many core architectures, a popular choice for an application developer is to utilize directive-based parallel programming models, such as OpenMP. However, even with OpenMP, the developer must choose from […]

Apr, 2

ytopt: Autotuning Scientific Applications for Energy Efficiency at Large Scales

As we enter the exascale computing era, efficiently utilizing power and optimizing the performance of scientific applications under power and energy constraints has become critical and challenging. We propose a low-overhead autotuning framework to autotune performance and energy for various hybrid MPI/OpenMP scientific applications at large scales and to explore the tradeoffs between application runtime […]

Apr, 2

Task parallelism-based architectures on FPGA to optimize the energy efficiency of AI at the edge

In the world of artificial intelligence (AI) at the edge, we need to focus primarily on the energy efficiency with which we approach deep neural network (DNN) applications. In many applications, the speed of obtaining an inference can be critical; but many applications easily meet their time requirements, and the energy needed to calculate the […]

OpenCL

Apr, 2

Managing heterogeneous device memory using C++17 memory resources

Programmers using the C++ programming language are increasingly taught to manage memory implicitly through containers provided by the C++ standard library. However, heterogeneous programming platforms often require explicit allocation and deallocation of memory. This discrepancy in memory management strategies can be daunting and problematic for C++ developers who are not already familiar with heterogeneous programming. […]

CUDA

Apr, 2

PopSparse: Accelerated block sparse matrix multiplication on IPU

Reducing the computational cost of running large scale neural networks using sparsity has attracted great attention in the deep learning community. While much success has been achieved in reducing FLOP and parameter counts while maintaining acceptable task performance, achieving actual speed improvements has typically been much more difficult, particularly on general purpose accelerators (GPAs) such […]

CUDA

Apr, 2

Pgx: Hardware-accelerated parallel game simulation for reinforcement learning

We propose Pgx, a collection of board game simulators written in JAX. Thanks to auto-vectorization and Just-In-Time compilation of JAX, Pgx scales easily to thousands of parallel execution on GPU/TPU accelerators. We found that the simulation of Pgx on a single A100 GPU is 10x faster than that of existing reinforcement learning libraries. Pgx implements […]

Mar, 26

Comparing SYCL data transfer strategies for tracking use cases

The aim of this work is to compare the performance and ease of programming of the various data transfer strategies provided by SYCL 2020: buffers/accessors on one hand and the different storage types exposed by Unified Shared Memory (USM) on the other hand. We measured the relative performance of USM exclusively located either on the […]

Mar, 26

E2C: A Visual Simulator to Reinforce Education of Heterogeneous Computing Systems

With the increasing popularity of accelerator technologies (e.g., GPUs and TPUs) and the emergence of domain-specific computing via ASICs and FPGA, the matter of heterogeneity and understanding its ramifications on the performance has become more critical than ever before. However, it is challenging to effectively educate students about the potential impacts of heterogeneity on the […]

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Understanding Performance Portability of Bioinformatics Applications in SYCL on an NVIDIA GPU

Kernel Tuning Toolkit

Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators

Energy-Efficient GPU Clusters Scheduling for Deep Learning

ParaGraph: Weighted Graph Representation for Performance Optimization of HPC Kernels

ytopt: Autotuning Scientific Applications for Energy Efficiency at Large Scales

Task parallelism-based architectures on FPGA to optimize the energy efficiency of AI at the edge

Managing heterogeneous device memory using C++17 memory resources

PopSparse: Accelerated block sparse matrix multiplication on IPU

Pgx: Hardware-accelerated parallel game simulation for reinforcement learning

Comparing SYCL data transfer strategies for tracking use cases

E2C: A Visual Simulator to Reinforce Education of Heterogeneous Computing Systems

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)