high performance computing on graphics processing units: hgpu.org

Posts

Oct, 24

Least Squares on GPUs in Multiple Double Precision

This paper describes the application of the code generated by the CAMPARY software to accelerate the solving of linear systems in the least squares sense on Graphics Processing Units (GPUs), in double double, quad double, and octo double precision. The goal is to use accelerators to offset the cost overhead caused by multiple double precision […]

CUDA

Oct, 17

Homomorphic-Encrypted Volume Rendering

Computationally demanding tasks are typically calculated in dedicated data centers, and real-time visualizations also follow this trend. Some rendering tasks, however, require the highest level of confidentiality so that no other party, besides the owner, can read or see the sensitive data. Here we present a direct volume rendering approach that performs volume rendering directly […]

Oct, 17

Accelerating LBM on a Tightly-Coupled Field Programmable Gate Array

With the end of Dennard Scaling and the imminent end of Moore’s Law, the search for new ways to improve performance in computing systems is increasing. Nowadays, the main approach is to use hardware accelerations to offload the application. However, while this is a power-efficient approach, their development process is costly and time-consuming. In this […]

OpenCL

Oct, 17

Accelerating AutoDock VINA with GPUs

AutoDock VINA is one of the most-used docking tools in the early stage of modern drug discovery. It uses a Monte-Carlo based iterated search method and multithreading parallelism scheme on multicore machines to improve docking accuracy and speed. However, virtual screening from huge compound databases is common for modern drug discovery, which puts forward a […]

OpenCL

Oct, 17

Artificial Intelligence in Electric Machine Drives: Advances and Trends

This review paper systematically summarizes the existing literature on applying classical AI techniques and advanced deep learning algorithms to electric machine drives. It is anticipated that with the rapid progress in deep learning models and embedded hardware platforms, AI-based data-driven approaches will become increasingly popular for the automated high-performance control of electric machines. Additionally, this […]

Oct, 17

Beyond Desktop Computation: Challenges in Scaling a GPU Infrastructure

Enterprises and labs performing computationally expensive data science applications sooner or later face the problem of scale but unconnected infrastructure. For this up-scaling process, an IT service provider can be hired or in-house personnel can attempt to implement a software stack. The first option can be quite expensive if it is just about connecting several […]

CUDA

Oct, 10

AsymML: An Asymmetric Decomposition Framework for Privacy-Preserving DNN Training and Inference

Leveraging parallel hardware (e.g. GPUs) to conduct deep neural network (DNN) training/inference, though significantly speeds up the computations, raises several data privacy concerns. Trusted execution environments (TEEs) have emerged as a promising solution to enable privacy-preserving inference and training. TEEs, however, have limited memory and computation resources which renders it not comparable to untrusted parallel […]

Oct, 10

Parallel Actors and Learners: A Framework for Generating Scalable RL Implementations

Reinforcement Learning (RL) has achieved significant success in application domains such as robotics, games, health care and others. However, training RL agents is very time consuming. Current implementations exhibit poor performance due to challenges such as irregular memory accesses and synchronization overheads. In this work, we propose a framework for generating scalable reinforcement learning implementations […]

Oct, 10

GCN Inference Acceleration using High-Level Synthesis

GCN (Graph Convolutional Network) has become a promising solution for many applications, such as recommendation systems, social data mining, etc. Many of these applications requires low latency GCN inference. In this paper, we provide a case study of a GCN inference acceleration on FPGA. We explore high-level synthesis programming model to achieve low-latency inference. First, […]

OpenCL

Oct, 10

Large-eddy simulations with ClimateMachine: a new open-source code for atmospheric simulations on GPUs and CPUs

We introduce ClimateMachine, a new open-source atmosphere modeling framework using the Julia language to be performance portable on central processing units (CPUs) and graphics processing units (GPUs). ClimateMachine uses a common framework both for coarser-resolution global simulations and for high-resolution, limited-area large-eddy simulations (LES). Here, we demonstrate the LES configuration of the atmosphere model in […]

CUDA

Oct, 10

Implementation of Parallel Simplified Swarm Optimization in CUDA

As the acquisition cost of the graphics processing unit (GPU) has decreased, personal computers (PC) can handle optimization problems nowadays. In optimization computing, intelligent swarm algorithms (SIAs) method is suitable for parallelization. However, a GPU-based Simplified Swarm Optimization Algorithm has never been proposed. Accordingly, this paper proposed Parallel Simplified Swarm Optimization (PSSO) based on the […]

CUDA

Oct, 3

HLS Portability from Intel to Xilinx: A Case Study

Field-programmable gate arrays (FPGAs) are a hardware accelerator option that is growing in popularity. However, FPGAs are notoriously hard to program. To this end, high-level synthesis (HLS) tools have been developed to allow programmers to design hardware accelerators with FPGAs using familiar software languages. The two largest FPGA vendors, Intel and Xilinx, support both C/C++ […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

Least Squares on GPUs in Multiple Double Precision

Homomorphic-Encrypted Volume Rendering

Accelerating LBM on a Tightly-Coupled Field Programmable Gate Array

Accelerating AutoDock VINA with GPUs

Artificial Intelligence in Electric Machine Drives: Advances and Trends

Beyond Desktop Computation: Challenges in Scaling a GPU Infrastructure

AsymML: An Asymmetric Decomposition Framework for Privacy-Preserving DNN Training and Inference

Parallel Actors and Learners: A Framework for Generating Scalable RL Implementations

GCN Inference Acceleration using High-Level Synthesis

Large-eddy simulations with ClimateMachine: a new open-source code for atmospheric simulations on GPUs and CPUs

Implementation of Parallel Simplified Swarm Optimization in CUDA

HLS Portability from Intel to Xilinx: A Case Study

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)