high performance computing on graphics processing units: hgpu.org

Posts

Oct, 16

Distributed, combined CPU and GPU profiling within HPX using APEX

Benchmarking and comparing performance of a scientific simulation across hardware platforms is a complex task. When the simulation in question is constructed with an asynchronous, many-task (AMT) runtime offloading work to GPUs, the task becomes even more complex. In this paper, we discuss the use of a uniquely suited performance measurement library, APEX, to capture […]

CUDA

Oct, 16

Dataloader Parameter Tuner: An Automated Dataloader Parameter Tuner for Deep Learning Models

Deep learning has recently become one of the most compute/data-intensive methods and is widely used in many research areas and businesses. One of the critical challenges of deep learning is that it has many parameters that can be adjusted, and the optimal value may need to be determined for faster operation and high accuracy. The […]

CUDA

Oct, 16

PMT: Power Measurement Toolkit

Efficient use of energy is essential for today’s supercomputing systems, as energy cost is generally a major component of their operational cost. Research into "green computing" is needed to reduce the environmental impact of running these systems. As such, several scientific communities are evaluating the trade-off between time-to-solution and energy-to-solution. While the runtime of an […]

Oct, 16

Bottleneck Analysis of Dynamic Graph Neural Network Inference on CPU and GPU

Dynamic graph neural network (DGNN) is becoming increasingly popular because of its widespread use in capturing dynamic features in the real world. A variety of dynamic graph neural networks designed from algorithmic perspectives have succeeded in incorporating temporal information into graph processing. Despite the promising algorithmic performance, deploying DGNNs on hardware presents additional challenges due […]

CUDA

Oct, 9

Towards Performance Portable Programming for Distributed Heterogeneous Systems

Hardware heterogeneity is here to stay for high-performance computing. Large-scale systems are currently equipped with multiple GPU accelerators per compute node and are expected to incorporate more specialized hardware in the future. This shift in the computing ecosystem offers many opportunities for performance improvement; however, it also increases the complexity of programming for such architectures. […]

CUDA

•

OpenCL

Oct, 9

Decompiling x86 Deep Neural Network Executables

Due to their widespread use on heterogeneous hardware devices, deep learning (DL) models are compiled into executables by DL compilers to fully leverage low-level hardware primitives. This approach allows DL computations to be undertaken at low cost across a variety of computing platforms, including CPUs, GPUs, and various hardware accelerators. We present BTD (Bin to […]

CUDA

•

OpenCL

Oct, 9

Benchmarking optimization algorithms for auto-tuning GPU kernels

Recent years have witnessed phenomenal growth in the application, and capabilities of Graphical Processing Units (GPUs) due to their high parallel computation power at relatively low cost. However, writing a computationally efficient GPU program (kernel) is challenging, and generally only certain specific kernel configurations lead to significant increases in performance. Auto-tuning is the process of […]

CUDA

•

OpenCL

Oct, 9

Performance portability study of epistasis detection using SYCL on NVIDIA GPU

We describe the experience of converting a CUDA implementation of a high-order epistasis detection algorithm to SYCL. The goals are for our work to be useful to application and compiler developers with a detailed description of migration paths between CUDA and SYCL. Evaluating the CUDA and SYCL applications on an NVIDIA V100 GPU, we find […]

CUDA

Oct, 9

cuZK: Accelerating Zero-Knowledge Proof with A Faster Parallel Multi-Scalar Multiplication Algorithm on GPUs

Zero-knowledge proof (ZKP) is a critical cryptographic protocol, and it has been deployed in various privacy-preserving applications such as cryptocurrencies and verifiable machine learning. Unfortunately, ZKP has a high overhead on its proof generation step, which consists of several time-consuming operations, including large-scale matrix-vector multiplication (MUL), number-theoretic transform (NTT), and multi-scalar multiplication (MSM) on elliptic […]

CUDA

Oct, 2

An OpenCL-Based FPGA Accelerator for Faster R-CNN

In recent years, convolutional neural network (CNN)-based object detection algorithms have made breakthroughs, and much of the research corresponds to hardware accelerator designs. Although many previous works have proposed efficient FPGA designs for one-stage detectors such as Yolo, there are still few accelerator designs for faster regions with CNN features (Faster R-CNN) algorithms. Moreover, CNN’s […]

OpenCL

Oct, 2

MSREP: A Fast yet Light Sparse Matrix Framework for Multi-GPU Systems

Sparse linear algebra kernels play a critical role in numerous applications, covering from exascale scientific simulation to large-scale data analytics. Offloading linear algebra kernels on one GPU will no longer be viable in these applications, simply because the rapidly growing data volume may exceed the memory capacity and computing power of a single GPU. Multi-GPU […]

CUDA

Oct, 2

Efficient Quantized Sparse Matrix Operations on Tensor Cores

The exponentially growing model size drives the continued success of deep learning, but it brings prohibitive computation and memory cost. From the algorithm perspective, model sparsification and quantization have been studied to alleviate the problem. From the architecture perspective, hardware vendors provide Tensor cores for acceleration. However, it is very challenging to gain practical speedups […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Distributed, combined CPU and GPU profiling within HPX using APEX

Dataloader Parameter Tuner: An Automated Dataloader Parameter Tuner for Deep Learning Models

PMT: Power Measurement Toolkit

Bottleneck Analysis of Dynamic Graph Neural Network Inference on CPU and GPU

Towards Performance Portable Programming for Distributed Heterogeneous Systems

Decompiling x86 Deep Neural Network Executables

Benchmarking optimization algorithms for auto-tuning GPU kernels

Performance portability study of epistasis detection using SYCL on NVIDIA GPU

cuZK: Accelerating Zero-Knowledge Proof with A Faster Parallel Multi-Scalar Multiplication Algorithm on GPUs

An OpenCL-Based FPGA Accelerator for Faster R-CNN

MSREP: A Fast yet Light Sparse Matrix Framework for Multi-GPU Systems

Efficient Quantized Sparse Matrix Operations on Tensor Cores

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)