high performance computing on graphics processing units: hgpu.org

Posts

Feb, 22

Exploring Design Space of 3D NVM and eDRAM Caches Using DESTINY Tool (open-source code)

To enable the design of large sized caches, novel memory technologies (such as non-volatile memory) and novel fabrication approaches (e.g. 3D stacking) have been explored. The existing modeling tools, however, cover only few memory technologies, CMOS technology nodes and fabrication approaches. We present DESTINY, a tool for modeling 3D (and 2D) cache designs using SRAM, […]

Feb, 19

Reproducible Triangular Solvers for High-Performance Computing

On modern parallel architectures, floating-point computations may become non-deterministic and, therefore, non-reproducible mainly due to non-associativity of floating-point operations. We propose an algorithm to solve dense triangular systems by leveraging the standard parallel triangular solver and our, recently introduced, multi-level exact summation approach. Finally, we present implementations of the proposed fast reproducible triangular solver and […]

OpenCL

Feb, 19

Fast, Memory-Efficient Construction of Voxelized Shadows

We present a fast and memory efficient algorithm for generating Compact Precomputed Voxelized Shadows. By performing much of the common sub-tree merging before identical nodes are ever created, we improve construction times by several orders of magnitude for large data structures, and require much less working memory. We also propose a new set of rules […]

CUDA

•

OpenGL

Feb, 19

Auto-tuning Shallow water simulations on GPUs

Graphic processing units (GPUs) have gained popularity in scientific computing the recent years. This is because of the massive computing power they can provide for parallel tasks, and while GPUs are powerful, it is also hard to fully utilize their power. A part of this difficulty comes from the many parameters available, and tuning of […]

CUDA

Feb, 19

Memory-efficient Adaptive Subdivision for Software Rendering on the GPU

The adaptive subdivision step for surface tessellation is a key component of the Reyes rendering pipeline. While this operation has been successfully parallelized for execution on the GPU using a breadth-first traversal, the resulting implementations are limited by their high worst-case memory consumption and high global memory bandwidth utilization. This report proposes an alternate strategy […]

OpenCL

Feb, 19

NMF-mGPU: non-negative matrix factorization on multi-GPU systems

BACKGROUND: In the last few years, the Non-negative Matrix Factorization (NMF) technique has gained a great interest among the Bioinformatics community, since it is able to extract interpretable parts from high-dimensional datasets. However, the computing time required to process large data matrices may become impractical, even for a parallel application running on a multiprocessors cluster. […]

CUDA

Feb, 13

NUPAR: A Benchmark Suite for Modern GPU Architectures

Heterogeneous systems consisting of multi-core CPUs, Graphics Processing Units (GPUs) and many-core accelerators have gained widespread use by application developers and data-center platform developers. Modern day heterogeneous systems have evolved to include advanced hardware and software features to support a spectrum of application patterns. Heterogeneous programming frameworks such as CUDA, OpenCL, and OpenACC have all […]

CUDA

•

OpenCL

Feb, 13

Locally-Oriented Programming: A Simple Programming Model for Stencil-Based Computations on Multi-Level Distributed Memory Architectures

Emerging hybrid accelerator architectures for high performance computing are often suited for the use of a data-parallel programming model. Unfortunately, programmers of these architectures face a steep learning curve that frequently requires learning a new language (e.g., OpenCL). Furthermore, the distributed (and frequently multi-level) nature of the memory organization of clusters of these machines provides […]

OpenCL

Feb, 13

Quadratic Pseudo-Boolean Optimization for Scene Analysis using CUDA

Many problems in early computer vision, like image segmentation, image reconstruction, 3D vision or object labeling can be modeled by Markov Random Fields (MRF). General algorithms to optimize a MRF like Simulated Annealing, Belief Propagation or Iterated Conditional Modes are either slow or produce low quality results [Rother 07]. On the other hand, in the […]

CUDA

Feb, 13

Large-Scale Deep Learning on the YFCC100M Dataset

We present a work-in-progress snapshot of learning with a 15 billion parameter deep learning network on HPC architectures applied to the largest publicly available natural image and video dataset released to-date. Recent advancements in unsupervised deep neural networks suggest that scaling up such networks in both model and training dataset size can yield significant improvements […]

CUDA

Feb, 13

Primal Dual Affine Scaling on GPUs

Here we present an implementation of Primal-Dual Affine scaling method to solve linear optimization problem on GPU based systems. Strategies to convert the system generated by complementary slackness theorem into a symmetric system are given. A new CUDA friendly technique to solve the resulting symmetric positive definite subsystem is also developed. Various strategies to reduce […]

CUDA

Feb, 12

A Real-time GPU Implementation of the SIFT Algorithm for Large-Scale Video Analysis Tasks

The SIFT algorithm is one of the most popular feature extraction methods and therefore widely used in all sort of video analysis tasks like instance search and duplicate/near-duplicate detection. We present an efficient GPU implementation of the SIFT descriptor extraction algorithm using CUDA. The major steps of the algorithm are presented and for each step […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Exploring Design Space of 3D NVM and eDRAM Caches Using DESTINY Tool (open-source code)

Reproducible Triangular Solvers for High-Performance Computing

Fast, Memory-Efficient Construction of Voxelized Shadows

Auto-tuning Shallow water simulations on GPUs

Memory-efficient Adaptive Subdivision for Software Rendering on the GPU

NMF-mGPU: non-negative matrix factorization on multi-GPU systems

NUPAR: A Benchmark Suite for Modern GPU Architectures

Locally-Oriented Programming: A Simple Programming Model for Stencil-Based Computations on Multi-Level Distributed Memory Architectures

Quadratic Pseudo-Boolean Optimization for Scene Analysis using CUDA

Large-Scale Deep Learning on the YFCC100M Dataset

Primal Dual Affine Scaling on GPUs

A Real-time GPU Implementation of the SIFT Algorithm for Large-Scale Video Analysis Tasks

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)