high performance computing on graphics processing units: hgpu.org

Posts

Oct, 10

Implementation of Parallel Simplified Swarm Optimization in CUDA

As the acquisition cost of the graphics processing unit (GPU) has decreased, personal computers (PC) can handle optimization problems nowadays. In optimization computing, intelligent swarm algorithms (SIAs) method is suitable for parallelization. However, a GPU-based Simplified Swarm Optimization Algorithm has never been proposed. Accordingly, this paper proposed Parallel Simplified Swarm Optimization (PSSO) based on the […]

CUDA

Oct, 10

GCN Inference Acceleration using High-Level Synthesis

GCN (Graph Convolutional Network) has become a promising solution for many applications, such as recommendation systems, social data mining, etc. Many of these applications requires low latency GCN inference. In this paper, we provide a case study of a GCN inference acceleration on FPGA. We explore high-level synthesis programming model to achieve low-latency inference. First, […]

OpenCL

Oct, 3

HLS Portability from Intel to Xilinx: A Case Study

Field-programmable gate arrays (FPGAs) are a hardware accelerator option that is growing in popularity. However, FPGAs are notoriously hard to program. To this end, high-level synthesis (HLS) tools have been developed to allow programmers to design hardware accelerators with FPGAs using familiar software languages. The two largest FPGA vendors, Intel and Xilinx, support both C/C++ […]

OpenCL

Oct, 3

Unified Shader Programming in C++

In real-time graphics, the strict separation of programming languages and environments for host (CPU) code and GPU code results in code duplication, subtle compatibility bugs, and additional development and maintenance costs. In contrast, popular general-purpose GPU (GPGPU) programming models like CUDA and C++ AMP avoid many of these issues by presenting unified programming environments where […]

Oct, 3

Intel oneAPI DPC++ FPGA Optimization Guide

The Intel® oneAPI FPGA Optimization Guide provides guidance on leveraging the functionalities of Data Parallel C++ (DPC++) to optimize your design. This document assumes that you are familiar with SYCL* concepts and application programming interfaces (APIs), as described in the SYCL* Specification version 1.2.1 by the Khronos* Group. It also assumes that you have experience […]

Oct, 3

Embedded Software Synthesis using Heterogeneous Dataflow Models

Dataflow process networks (DPNs) consist of statically defined process nodes with First-In-First-Out (FIFO) buffered point-to-point connections. DPNs are intrinsically data-driven, i.e., node actions are not synchronized among each other and may fire whenever sufficient input operands arrived at a node. In this original form, DPNs are data-driven and therefore a suitable model of computation (MoC) […]

OpenCL

Oct, 3

Accelerating Encrypted Computing on Intel GPUs

Homomorphic Encryption (HE) is an emerging encryption scheme that allows computations to be performed directly on encrypted messages. This property provides promising applications such as privacy-preserving deep learning and cloud computing. Prior works have been proposed to enable practical privacy-preserving applications with architectural-aware optimizations on CPUs, GPUs and FPGAs. However, there is no systematic optimization […]

Sep, 26

Taskflow: A Lightweight Parallel and Heterogeneous Task Graph Computing System

Taskflow aims to streamline the building of parallel and heterogeneous applications using a lightweight task graph-based approach. Taskflow introduces an expressive task graph programming model to assist developers in the implementation of parallel and heterogeneous decomposition strategies on a heterogeneous computing platform. Our programming model distinguishes itself as a very general class of task graph […]

CUDA

Sep, 26

Small-Bench NLP: Benchmark for small single GPU trained models in Natural Language Processing

Recent progress in the Natural Language Processing domain has given us several State-of-the-Art (SOTA) pretrained models which can be finetuned for specific tasks. These large models with billions of parameters trained on numerous GPUs/TPUs over weeks are leading in the benchmark leaderboards. In this paper, we discuss the need for a benchmark for cost and […]

Sep, 26

IgNet. A Super-precise Convolutional Neural Network

Convolutional neural networks (CNN) are known to be an effective means to detect and analyze images. Their power is essentially based on the ability to extract out images common features. There exist, however, images involving unique, irregular features or details. Such is a collection of unusual children drawings reflecting the kids imagination and individuality. These […]

OpenCL

Sep, 26

An Experimental Study of SYCL Task Graph Parallelism for Large-Scale Machine Learning Workloads

Task graph parallelism has emerged as an important tool to efficiently execute large machine learning workloads on GPUs. Users describe a GPU workload in a task dependency graph rather than aggregated GPU operations and dependencies, allowing the runtime to run whole-graph scheduling optimization to significantly improve the performance. While the new CUDA graph execution model […]

CUDA

Sep, 26

CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research

Interest in applying Artificial Intelligence (AI) techniques to compiler optimizations is increasing rapidly, but compiler research has a high entry barrier. Unlike in other domains, compiler and AI researchers do not have access to the datasets and frameworks that enable fast iteration and development of ideas, and getting started requires a significant engineering investment. What […]

CUDA

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

high performance computing on graphics processing units: hgpu.org

Posts

Implementation of Parallel Simplified Swarm Optimization in CUDA

GCN Inference Acceleration using High-Level Synthesis

HLS Portability from Intel to Xilinx: A Case Study

Unified Shader Programming in C++

Intel oneAPI DPC++ FPGA Optimization Guide

Embedded Software Synthesis using Heterogeneous Dataflow Models

Accelerating Encrypted Computing on Intel GPUs

Taskflow: A Lightweight Parallel and Heterogeneous Task Graph Computing System

Small-Bench NLP: Benchmark for small single GPU trained models in Natural Language Processing

IgNet. A Super-precise Convolutional Neural Network

An Experimental Study of SYCL Task Graph Parallelism for Large-Scale Machine Learning Workloads

CompilerGym: Robust, Performant Compiler Optimization Environments for AI Research

Recent source codes

SimSYCL: Synchronous, single-threaded, library-only SYCL implementation for debugging and verification

GPU plugin for PySCF

QArray

Celerity: High-level C++ for Accelerator Clusters

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Most viewed papers (last 30 days)