high performance computing on graphics processing units: hgpu.org

Posts

Oct, 17

Homomorphic-Encrypted Volume Rendering

Computationally demanding tasks are typically calculated in dedicated data centers, and real-time visualizations also follow this trend. Some rendering tasks, however, require the highest level of confidentiality so that no other party, besides the owner, can read or see the sensitive data. Here we present a direct volume rendering approach that performs volume rendering directly […]

Oct, 17

Accelerating LBM on a Tightly-Coupled Field Programmable Gate Array

With the end of Dennard Scaling and the imminent end of Moore’s Law, the search for new ways to improve performance in computing systems is increasing. Nowadays, the main approach is to use hardware accelerations to offload the application. However, while this is a power-efficient approach, their development process is costly and time-consuming. In this […]

OpenCL

Oct, 17

Accelerating AutoDock VINA with GPUs

AutoDock VINA is one of the most-used docking tools in the early stage of modern drug discovery. It uses a Monte-Carlo based iterated search method and multithreading parallelism scheme on multicore machines to improve docking accuracy and speed. However, virtual screening from huge compound databases is common for modern drug discovery, which puts forward a […]

OpenCL

Oct, 17

Artificial Intelligence in Electric Machine Drives: Advances and Trends

This review paper systematically summarizes the existing literature on applying classical AI techniques and advanced deep learning algorithms to electric machine drives. It is anticipated that with the rapid progress in deep learning models and embedded hardware platforms, AI-based data-driven approaches will become increasingly popular for the automated high-performance control of electric machines. Additionally, this […]

Oct, 17

Beyond Desktop Computation: Challenges in Scaling a GPU Infrastructure

Enterprises and labs performing computationally expensive data science applications sooner or later face the problem of scale but unconnected infrastructure. For this up-scaling process, an IT service provider can be hired or in-house personnel can attempt to implement a software stack. The first option can be quite expensive if it is just about connecting several […]

CUDA

Oct, 10

AsymML: An Asymmetric Decomposition Framework for Privacy-Preserving DNN Training and Inference

Leveraging parallel hardware (e.g. GPUs) to conduct deep neural network (DNN) training/inference, though significantly speeds up the computations, raises several data privacy concerns. Trusted execution environments (TEEs) have emerged as a promising solution to enable privacy-preserving inference and training. TEEs, however, have limited memory and computation resources which renders it not comparable to untrusted parallel […]

Oct, 10

Parallel Actors and Learners: A Framework for Generating Scalable RL Implementations

Reinforcement Learning (RL) has achieved significant success in application domains such as robotics, games, health care and others. However, training RL agents is very time consuming. Current implementations exhibit poor performance due to challenges such as irregular memory accesses and synchronization overheads. In this work, we propose a framework for generating scalable reinforcement learning implementations […]

Oct, 10

GCN Inference Acceleration using High-Level Synthesis

GCN (Graph Convolutional Network) has become a promising solution for many applications, such as recommendation systems, social data mining, etc. Many of these applications requires low latency GCN inference. In this paper, we provide a case study of a GCN inference acceleration on FPGA. We explore high-level synthesis programming model to achieve low-latency inference. First, […]

OpenCL

Oct, 10

Large-eddy simulations with ClimateMachine: a new open-source code for atmospheric simulations on GPUs and CPUs

We introduce ClimateMachine, a new open-source atmosphere modeling framework using the Julia language to be performance portable on central processing units (CPUs) and graphics processing units (GPUs). ClimateMachine uses a common framework both for coarser-resolution global simulations and for high-resolution, limited-area large-eddy simulations (LES). Here, we demonstrate the LES configuration of the atmosphere model in […]

CUDA

Oct, 10

Implementation of Parallel Simplified Swarm Optimization in CUDA

As the acquisition cost of the graphics processing unit (GPU) has decreased, personal computers (PC) can handle optimization problems nowadays. In optimization computing, intelligent swarm algorithms (SIAs) method is suitable for parallelization. However, a GPU-based Simplified Swarm Optimization Algorithm has never been proposed. Accordingly, this paper proposed Parallel Simplified Swarm Optimization (PSSO) based on the […]

CUDA

Oct, 3

Unified Shader Programming in C++

In real-time graphics, the strict separation of programming languages and environments for host (CPU) code and GPU code results in code duplication, subtle compatibility bugs, and additional development and maintenance costs. In contrast, popular general-purpose GPU (GPGPU) programming models like CUDA and C++ AMP avoid many of these issues by presenting unified programming environments where […]

Oct, 3

HLS Portability from Intel to Xilinx: A Case Study

Field-programmable gate arrays (FPGAs) are a hardware accelerator option that is growing in popularity. However, FPGAs are notoriously hard to program. To this end, high-level synthesis (HLS) tools have been developed to allow programmers to design hardware accelerators with FPGAs using familiar software languages. The two largest FPGA vendors, Intel and Xilinx, support both C/C++ […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

Homomorphic-Encrypted Volume Rendering

Accelerating LBM on a Tightly-Coupled Field Programmable Gate Array

Accelerating AutoDock VINA with GPUs

Artificial Intelligence in Electric Machine Drives: Advances and Trends

Beyond Desktop Computation: Challenges in Scaling a GPU Infrastructure

AsymML: An Asymmetric Decomposition Framework for Privacy-Preserving DNN Training and Inference

Parallel Actors and Learners: A Framework for Generating Scalable RL Implementations

GCN Inference Acceleration using High-Level Synthesis

Large-eddy simulations with ClimateMachine: a new open-source code for atmospheric simulations on GPUs and CPUs

Implementation of Parallel Simplified Swarm Optimization in CUDA

Unified Shader Programming in C++

HLS Portability from Intel to Xilinx: A Case Study

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)