high performance computing on graphics processing units: hgpu.org

Posts

Sep, 10

CuNeuQuant: A CUDA Implementation of the NeuQuant Image Quantization Algorithm

Color quantization is an often performed prestep in many image processing and computer vision applications. Quantization is defined as the process of selecting a palette of representative colors P which can replace the original colors C in an image such that |P| << |C| and the perceptual distortion of the reduced color image is minimized. […]

CUDA

Sep, 10

GPU-Accelerated Monte Carlo Simulations of Dense Stellar Systems

Computing the interactions between the stars within dense stellar clusters is a problem of fundamental importance in theoretical astrophysics. However, simulating realistic sized clusters of about 106 stars is computationally intensive and often takes a long time to complete. This paper presents the acceleration of a Monte Carlo algorithm for simulating stellar cluster evolution using […]

CUDA

Sep, 10

A Parallel Twig Join Algorithm for XML Processing using a GPGPU

With an increasing amount of data and demand for fast query processing, the efficiency of database operations continues to be a challenging task. A common approach is to leverage parallel hardware platforms. With the introduction of general-purpose GPU (Graphics Processing Unit) computing, massively parallel hardware has become available within commodity hardware. XML is based on […]

CUDA

Sep, 10

Accelerating Boosting-based Face Detection on GPUs

The goal of face detection is to determine the presence of faces in arbitrary images, along with their locations and dimensions. As it happens with any graphics workloads, these algorithms benefit from data-level parallelism. Existing parallelization efforts strictly focus on mapping different divide and conquer strategies into multicore CPUs and GPUs. However, even the most […]

CUDA

Sep, 10

Performance Improvement of TOUGH2 Simulation with Graphics Processing Unit

We tried to accelerate the computational speed of TOUGH2 simulation by introducing a linear computation routine using a Graphics Processing Unit (GPU). Libraries for GPU computation were introduced, and new solvers for linear equations were developed. Out of those, CLLUSTB, an ILU preconditioned BiCGSTAB solver made with the CULA Sparse, demonstrated good performance both in […]

CUDA

Sep, 8

Performance Evaluation of Concurrent Lock-free Data Structures on GPUs

Graphics processing units (GPUs) have emerged as a strong candidate for high-performance computing. While regular data-parallel computations with little or no synchronization are easy to map on the GPU architectures, it is a challenge to scale up computations on dynamically changing pointer-linked data structures. The traditional lock-based implementations are known to offer poor scalability due […]

CUDA

Sep, 8

Lost in Translation: Challenges in Automating CUDA-to-OpenCL Translation

The use of accelerators in high-performance computing is increasing. The most commonly used accelerator is the graphics processing unit (GPU) because of its low cost and massively parallel performance. The two most common programming environments for GPU accelerators are CUDA and OpenCL. While CUDA runs natively only on NVIDIA GPUs, OpenCL is an open standard […]

CUDA

•

OpenCL

Sep, 8

Mastering Software Variant Explosion for GPU Accelerators

Mapping algorithms in an efficient way to the target hardware poses a challenge for algorithm designers. This is particular true for heterogeneous systems hosting accelerators like graphics cards. While algorithm developers have profound knowledge of the application domain, they often lack detailed insight into the underlying hardware of accelerators in order to exploit the provided […]

CUDA

•

OpenCL

Sep, 8

Supporting Heterogenous Computing Environments in SaC

From laptops to supercomputer nodes hardware architectures become increasingly heterogeneous, combining at least multiple general-purpose cores with one or even multiple GPGPU accelerators. Taking effective advantage of such systems’ capabilities becomes increasingly challenging. SaC is a functional array programming language with support for fully automatic parallelization following a data-parallel approach. As such many SaC programs […]

CUDA

Sep, 8

OpenACC Implementations Comparison

Using GPUs for general purpose programming is, nowadays, much easier than the previous years. In the very beginning were Brook-GPU or Close To Metal the approaches used for exploring the new possibilities of hardware accelerators. After that, CUDA and OpenCL were released. They had been adopted by many programmers due to theirs advantages but, however, […]

CUDA

Sep, 7

Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck?

Heterogeneous parallel primitives (HPP) addresses two major shortcomings in current GPGPU programming models: it supports full composability by defining abstractions and increases flexibility in execution by introducing braided parallelism.

OpenCL

Sep, 7

High Performance Parallel Implementation of Compressive Sensing SAR Imaging

The compressive sensing (CS) theory has been applied to SAR imaging systems in many ways. And it shows a significant reduction in the amount of sampling data at the cost of much longer reconstruction time. In this paper, we investigate the development and optimization of Iterative Shrinkage/Thresholding (IST) algorithm applying to CS reconstruction of SAR […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

CuNeuQuant: A CUDA Implementation of the NeuQuant Image Quantization Algorithm

GPU-Accelerated Monte Carlo Simulations of Dense Stellar Systems

A Parallel Twig Join Algorithm for XML Processing using a GPGPU

Accelerating Boosting-based Face Detection on GPUs

Performance Improvement of TOUGH2 Simulation with Graphics Processing Unit

Performance Evaluation of Concurrent Lock-free Data Structures on GPUs

Lost in Translation: Challenges in Automating CUDA-to-OpenCL Translation

Mastering Software Variant Explosion for GPU Accelerators

Supporting Heterogenous Computing Environments in SaC

OpenACC Implementations Comparison

Can GPGPU Programming Be Liberated from the Data-Parallel Bottleneck?

High Performance Parallel Implementation of Compressive Sensing SAR Imaging

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)