high performance computing on graphics processing units: hgpu.org

Posts

Oct, 6

Live Migration for OpenCL FPGA Accelerators

FPGAs are currently being deployed at a large scale across data-centres for various applications because of their performance and power benefits. In particular, the cloud operators have started providing FPGAs as a Service. However, to completely integrate FPGAs in a data-centre environment like standard software systems, support for fault tolerance and task migration is essential. […]

OpenCL

Oct, 6

MyCaffe: A Complete C# Re-Write of Caffe with Reinforcement Learning

Over the past few years Caffe, from Berkeley AI Research, has gained a strong following in the deep learning community with over 15K forks on the github.com/BLVC/Caffe site. With its well organized, very modular C++ design it is easy to work with and very fast. However, in the world of Windows development, C# has helped […]

CUDA

Oct, 6

Exascale Deep Learning for Climate Analytics

We extract pixel-level masks of extreme weather patterns using variants of Tiramisu and DeepLabv3+ neural networks. We describe improvements to the software frameworks, input pipeline, and the network training algorithms necessary to efficiently scale deep learning on the Piz Daint and Summit systems. The Tiramisu network scales to 5300 P100 GPUs with a sustained throughput […]

CUDA

Oct, 6

HSTREAM: A directive-based language extension for heterogeneous stream computing

Big data streaming applications require utilization of heterogeneous parallel computing systems, which may comprise multiple multi-core CPUs and many-core accelerating devices such as NVIDIA GPUs and Intel Xeon Phis. Programming such systems require advanced knowledge of several hardware architectures and device-specific programming models, including OpenMP and CUDA. In this paper, we present HSTREAM, a compiler […]

CUDA

Oct, 6

On Reinforcement Learning for Full-length Game of StarCraft

StarCraft II poses a grand challenge for reinforcement learning. The main difficulties of it include huge state and action space and a long-time horizon. In this paper, we investigate a hierarchical reinforcement learning approach for StarCraft II. The hierarchy involves two levels of abstraction. One is the macro-action automatically extracted from expert’s trajectories, which reduces […]

Sep, 23

Evaluating Performance Portability of Accelerator Programming Models using SPEC ACCEL 1.2 Benchmarks

As heterogeneous architectures are becoming mainstream for HPC systems, application programmers are looking for programming model implementations that offer both performance and portability across platforms. Two directive-based programming models for accelerator programming that aim at doing this are OpenMP 4/4.5 and OpenACC. Many users want to know the difference between these two programming models, the […]

OpenCL

Sep, 23

Parallel LZ77 Decoding using a GPU

Data compression, as a process, aims to satisfy the modern world’s need for speed and efficiency by reducing the cost of storing and transmitting information. Over the past few years, there have been several attempts to improve the performance and reduce the execution times of older compression algorithms by adapting them to make use of […]

CUDA

Sep, 23

Scalability Analysis of Synchronous Data-Parallel Artificial Neural Network (ANN) Learners

Artificial Neural Networks (ANNs) have been established as one of the most important algorithmic tools in the Machine Learning (ML) toolbox over the past few decades. ANNs’ recent rise to widespread acceptance can be attributed to two developments: (1) the availability of large-scale training and testing datasets; and (2) the availability of new computer architectures […]

OpenCL

Sep, 23

Support for Parallel Scan in OpenMP

Prefix Scan (or simply scan) is an operator that computes all the partial sums of a vector. A scan operation results in a vector where each element is the sum of the preceding elements in the original vector up to the corresponding position. Scan is a key operation in many relevant problems like sorting, lexical […]

CUDA

•

OpenCL

Sep, 23

SoaAlloc: Accelerating Single-Method Multiple-Objects Applications on GPUs

We propose SoaAlloc, a dynamic object allocator for Single-Method Multiple-Objects applications in CUDA. SoaAlloc is the first allocator for GPUs that (a) arranges allocations in a SIMD-friendly Structure of Arrays (SOA) data layout, (b) provides a do-all operation for maximizing the benefit of SOA, and (c) is on par with state-of-the-art memory allocators for raw […]

CUDA

Sep, 18

International Conference on Image, Video and Signal Processing (IVSP), 2019

The 2019 International Conference on Image, Video and Signal Processing (IVSP 2019) will be held during 25-28 February, 2019 in Shanghai, China. IVSP 2019 aims to provide researchers and practitioners from academia and industry with a forum to report on the latest developments in video, image and signal processing, multimedia and computer graphics. The conference […]

Sep, 18

International Joint Conference on Signals, Systems and Computers (CSSC), 2018

Venue: Khalifa University, Abu Dhabi, UAE Khalifa University (also known as Khalifa University of Science, Technology & Research, or KUSTAR) is a science-focused university located in Abu Dhabi, United Arab Emirates with a satellite campus in Sharjah. In 2017 it is ranked as the 401st best university in the world by QS rankings. Founded in 2007 […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Live Migration for OpenCL FPGA Accelerators

MyCaffe: A Complete C# Re-Write of Caffe with Reinforcement Learning

Exascale Deep Learning for Climate Analytics

HSTREAM: A directive-based language extension for heterogeneous stream computing

On Reinforcement Learning for Full-length Game of StarCraft

Evaluating Performance Portability of Accelerator Programming Models using SPEC ACCEL 1.2 Benchmarks

Parallel LZ77 Decoding using a GPU

Scalability Analysis of Synchronous Data-Parallel Artificial Neural Network (ANN) Learners

Support for Parallel Scan in OpenMP

SoaAlloc: Accelerating Single-Method Multiple-Objects Applications on GPUs

International Conference on Image, Video and Signal Processing (IVSP), 2019

International Joint Conference on Signals, Systems and Computers (CSSC), 2018

Recent source codes

Kernel Library for LLM Serving

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Genten: Software for Generalized Tensor Decompositions by Sandia National Laboratories

Interleaved Learning and Exploration: A Self-Adaptive Fuzz Testing Framework for MLIR

Pinocchio: PINpointing Orbit Crossing Collapsed Hierarchical Objects

KernelCoder: trained on a curated dataset of reasoning traces and CUDA kernel pairs

VibeCodeHPC - Multi Agentic Vibe Coding for HPC

Compile-Time Resource Safety for GPU APIs: A Low-Overhead Typestate Framework

exa-AMD: Exascale Accelerated Materials Discovery

Most viewed papers (last 30 days)