high performance computing on graphics processing units: hgpu.org

Posts

Oct, 16

PMT: Power Measurement Toolkit

Efficient use of energy is essential for today’s supercomputing systems, as energy cost is generally a major component of their operational cost. Research into "green computing" is needed to reduce the environmental impact of running these systems. As such, several scientific communities are evaluating the trade-off between time-to-solution and energy-to-solution. While the runtime of an […]

Oct, 16

Bottleneck Analysis of Dynamic Graph Neural Network Inference on CPU and GPU

Dynamic graph neural network (DGNN) is becoming increasingly popular because of its widespread use in capturing dynamic features in the real world. A variety of dynamic graph neural networks designed from algorithmic perspectives have succeeded in incorporating temporal information into graph processing. Despite the promising algorithmic performance, deploying DGNNs on hardware presents additional challenges due […]

CUDA

Oct, 9

Towards Performance Portable Programming for Distributed Heterogeneous Systems

Hardware heterogeneity is here to stay for high-performance computing. Large-scale systems are currently equipped with multiple GPU accelerators per compute node and are expected to incorporate more specialized hardware in the future. This shift in the computing ecosystem offers many opportunities for performance improvement; however, it also increases the complexity of programming for such architectures. […]

CUDA

•

OpenCL

Oct, 9

Decompiling x86 Deep Neural Network Executables

Due to their widespread use on heterogeneous hardware devices, deep learning (DL) models are compiled into executables by DL compilers to fully leverage low-level hardware primitives. This approach allows DL computations to be undertaken at low cost across a variety of computing platforms, including CPUs, GPUs, and various hardware accelerators. We present BTD (Bin to […]

CUDA

•

OpenCL

Oct, 9

Benchmarking optimization algorithms for auto-tuning GPU kernels

Recent years have witnessed phenomenal growth in the application, and capabilities of Graphical Processing Units (GPUs) due to their high parallel computation power at relatively low cost. However, writing a computationally efficient GPU program (kernel) is challenging, and generally only certain specific kernel configurations lead to significant increases in performance. Auto-tuning is the process of […]

CUDA

•

OpenCL

Oct, 9

Performance portability study of epistasis detection using SYCL on NVIDIA GPU

We describe the experience of converting a CUDA implementation of a high-order epistasis detection algorithm to SYCL. The goals are for our work to be useful to application and compiler developers with a detailed description of migration paths between CUDA and SYCL. Evaluating the CUDA and SYCL applications on an NVIDIA V100 GPU, we find […]

CUDA

Oct, 9

cuZK: Accelerating Zero-Knowledge Proof with A Faster Parallel Multi-Scalar Multiplication Algorithm on GPUs

Zero-knowledge proof (ZKP) is a critical cryptographic protocol, and it has been deployed in various privacy-preserving applications such as cryptocurrencies and verifiable machine learning. Unfortunately, ZKP has a high overhead on its proof generation step, which consists of several time-consuming operations, including large-scale matrix-vector multiplication (MUL), number-theoretic transform (NTT), and multi-scalar multiplication (MSM) on elliptic […]

CUDA

Oct, 2

An OpenCL-Based FPGA Accelerator for Faster R-CNN

In recent years, convolutional neural network (CNN)-based object detection algorithms have made breakthroughs, and much of the research corresponds to hardware accelerator designs. Although many previous works have proposed efficient FPGA designs for one-stage detectors such as Yolo, there are still few accelerator designs for faster regions with CNN features (Faster R-CNN) algorithms. Moreover, CNN’s […]

OpenCL

Oct, 2

Efficient Quantized Sparse Matrix Operations on Tensor Cores

The exponentially growing model size drives the continued success of deep learning, but it brings prohibitive computation and memory cost. From the algorithm perspective, model sparsification and quantization have been studied to alleviate the problem. From the architecture perspective, hardware vendors provide Tensor cores for acceleration. However, it is very challenging to gain practical speedups […]

CUDA

Oct, 2

Early Application Experiences on a Modern GPU-Accelerated Arm-based HPC Platform

This paper assesses and reports the experience of eleven application teams working to build, validate, and benchmark several HPC applications on a novel GPU-accelerated Arm testbed. The testbed consists of the latest, at time of writing, Arm Devkits from NVIDIA with server-class Arm CPUs and NVIDIA A100 GPUs. The applications and mini-apps are written using […]

CUDA

Oct, 2

MSREP: A Fast yet Light Sparse Matrix Framework for Multi-GPU Systems

Sparse linear algebra kernels play a critical role in numerous applications, covering from exascale scientific simulation to large-scale data analytics. Offloading linear algebra kernels on one GPU will no longer be viable in these applications, simply because the rapidly growing data volume may exceed the memory capacity and computing power of a single GPU. Multi-GPU […]

CUDA

Oct, 2

Exploiting dynamic sparse matrices for performance portable linear algebra operations

Sparse matrices and linear algebra are at the heart of scientific simulations. More than 70 sparse matrix storage formats have been developed over the years, targeting a wide range of hardware architectures and matrix types. Each format is developed to exploit the particular strengths of an architecture, or the specific sparsity patterns of matrices, and […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

PMT: Power Measurement Toolkit

Bottleneck Analysis of Dynamic Graph Neural Network Inference on CPU and GPU

Towards Performance Portable Programming for Distributed Heterogeneous Systems

Decompiling x86 Deep Neural Network Executables

Benchmarking optimization algorithms for auto-tuning GPU kernels

Performance portability study of epistasis detection using SYCL on NVIDIA GPU

cuZK: Accelerating Zero-Knowledge Proof with A Faster Parallel Multi-Scalar Multiplication Algorithm on GPUs

An OpenCL-Based FPGA Accelerator for Faster R-CNN

Efficient Quantized Sparse Matrix Operations on Tensor Cores

Early Application Experiences on a Modern GPU-Accelerated Arm-based HPC Platform

MSREP: A Fast yet Light Sparse Matrix Framework for Multi-GPU Systems

Exploiting dynamic sparse matrices for performance portable linear algebra operations

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)