high performance computing on graphics processing units: hgpu.org

Posts

Oct, 24

Monitoring Collective Communication Among GPUs

Communication among devices in multi-GPU systems plays an important role in terms of performance and scalability. In order to optimize an application, programmers need to know the type and amount of the communication happening among GPUs. Although there are prior works to gather this information in MPI applications on distributed systems and multi-threaded applications on […]

CUDA

Oct, 24

Principal Kernel Analysis: A Tractable Methodology to Simulate Scaled GPU Workloads

Simulating all threads in a scaled GPU workload results in prohibitive simulation cost. Cycle-level simulation is orders of magnitude slower than native silicon, the only solution is to reduce the amount of work simulated while accurately representing the program. Existing solutions to simulate GPU programs either scale the input size, simulate the first several billion […]

CUDA

Oct, 24

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems

Python has become a dominant programming language for emerging areas like Machine Learning (ML), Deep Learning (DL), and Data Science (DS). An attractive feature of Python is that it provides easy-to-use programming interface while allowing library developers to enhance performance of their applications by harnessing the computing power offered by High Performance Computing (HPC) platforms. […]

CUDA

Oct, 24

Least Squares on GPUs in Multiple Double Precision

This paper describes the application of the code generated by the CAMPARY software to accelerate the solving of linear systems in the least squares sense on Graphics Processing Units (GPUs), in double double, quad double, and octo double precision. The goal is to use accelerators to offset the cost overhead caused by multiple double precision […]

CUDA

Oct, 17

Homomorphic-Encrypted Volume Rendering

Computationally demanding tasks are typically calculated in dedicated data centers, and real-time visualizations also follow this trend. Some rendering tasks, however, require the highest level of confidentiality so that no other party, besides the owner, can read or see the sensitive data. Here we present a direct volume rendering approach that performs volume rendering directly […]

Oct, 17

Accelerating LBM on a Tightly-Coupled Field Programmable Gate Array

With the end of Dennard Scaling and the imminent end of Moore’s Law, the search for new ways to improve performance in computing systems is increasing. Nowadays, the main approach is to use hardware accelerations to offload the application. However, while this is a power-efficient approach, their development process is costly and time-consuming. In this […]

OpenCL

Oct, 17

Accelerating AutoDock VINA with GPUs

AutoDock VINA is one of the most-used docking tools in the early stage of modern drug discovery. It uses a Monte-Carlo based iterated search method and multithreading parallelism scheme on multicore machines to improve docking accuracy and speed. However, virtual screening from huge compound databases is common for modern drug discovery, which puts forward a […]

OpenCL

Oct, 17

Artificial Intelligence in Electric Machine Drives: Advances and Trends

This review paper systematically summarizes the existing literature on applying classical AI techniques and advanced deep learning algorithms to electric machine drives. It is anticipated that with the rapid progress in deep learning models and embedded hardware platforms, AI-based data-driven approaches will become increasingly popular for the automated high-performance control of electric machines. Additionally, this […]

Oct, 17

Beyond Desktop Computation: Challenges in Scaling a GPU Infrastructure

Enterprises and labs performing computationally expensive data science applications sooner or later face the problem of scale but unconnected infrastructure. For this up-scaling process, an IT service provider can be hired or in-house personnel can attempt to implement a software stack. The first option can be quite expensive if it is just about connecting several […]

CUDA

Oct, 10

AsymML: An Asymmetric Decomposition Framework for Privacy-Preserving DNN Training and Inference

Leveraging parallel hardware (e.g. GPUs) to conduct deep neural network (DNN) training/inference, though significantly speeds up the computations, raises several data privacy concerns. Trusted execution environments (TEEs) have emerged as a promising solution to enable privacy-preserving inference and training. TEEs, however, have limited memory and computation resources which renders it not comparable to untrusted parallel […]

Oct, 10

Parallel Actors and Learners: A Framework for Generating Scalable RL Implementations

Reinforcement Learning (RL) has achieved significant success in application domains such as robotics, games, health care and others. However, training RL agents is very time consuming. Current implementations exhibit poor performance due to challenges such as irregular memory accesses and synchronization overheads. In this work, we propose a framework for generating scalable reinforcement learning implementations […]

Oct, 10

GCN Inference Acceleration using High-Level Synthesis

GCN (Graph Convolutional Network) has become a promising solution for many applications, such as recommendation systems, social data mining, etc. Many of these applications requires low latency GCN inference. In this paper, we provide a case study of a GCN inference acceleration on FPGA. We explore high-level synthesis programming model to achieve low-latency inference. First, […]

OpenCL

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Posts

Monitoring Collective Communication Among GPUs

Principal Kernel Analysis: A Tractable Methodology to Simulate Scaled GPU Workloads

OMB-Py: Python Micro-Benchmarks for Evaluating Performance of MPI Libraries on HPC Systems

Least Squares on GPUs in Multiple Double Precision

Homomorphic-Encrypted Volume Rendering

Accelerating LBM on a Tightly-Coupled Field Programmable Gate Array

Accelerating AutoDock VINA with GPUs

Artificial Intelligence in Electric Machine Drives: Advances and Trends

Beyond Desktop Computation: Challenges in Scaling a GPU Infrastructure

AsymML: An Asymmetric Decomposition Framework for Privacy-Preserving DNN Training and Inference

Parallel Actors and Learners: A Framework for Generating Scalable RL Implementations

GCN Inference Acceleration using High-Level Synthesis

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)