high performance computing on graphics processing units: hgpu.org

Posts

Nov, 21

A new approach for sparse matrix vector product on NVIDIA GPUs

The sparse matrix vector product (SpMV) is a key operation in engineering and scientific computing and, hence, it has been subjected to intense research for a long time. The irregular computations involved in SpMV make its optimization challenging. Therefore, enormous effort has been devoted to devise data formats to store the sparse matrix with the […]

CUDA

Nov, 21

GPU-Based Image Processing Use Cases: A High-Level Approach

This paper addresses the gap between envisioned hardware-virtualized techniques for GPU programming and a conventional approach from the point of view of an application engineer taking software engineering aspects like maintainability, understandability and productivity, and resulting achieved gain in performance and scalability into account. This gap is discussed on the basis of use cases from […]

CUDA

Nov, 21

CT image reconstruction with half precision floating-point values

PURPOSE: Analytic CT image reconstruction is a computationally demanding task. Currently, the even more demanding iterative reconstruction algorithms find their way into clinical routine because their image quality is superior to analytic image reconstruction. The authors thoroughly analyze a so far unconsidered but valuable tool of tomorrow’s reconstruction hardware (CPU and GPU) that allows implementing […]

CUDA

Nov, 21

Efficient GPGPU-based parallel packet classification

With the rapid growth of network technologies, many new web services have been developed to provide various applications and computing functions. These services rely deeply on the internet. Therefore, packet classification is an important issue of network security that typically adopts a flexible packet filtering system to classify each processed packet. Traditional packet classification requires […]

CUDA

Nov, 21

Conflux: Embedding Massively Parallel Semantics in a High-Level Programming Language

As of late massively parallel devices have become mainstream and are widely used in research and industry. But even despite recent advances of the API, programming these devices has proven to be a difficult and error-prone task. We have designed Conflux, an embedded domain-specific language that integrates massively parallel semantics into a high-level programming language. […]

CUDA

Nov, 21

Graph-based Parallel Analysis of Large Analog Circuits Based on GPU Platforms

In this paper, we propose a new parallel analysis method for large analog circuits using determinant decision diagram (DDD) based graph technique. DDD-based symbolic analysis technique enables exact symbolic analysis of vary large analog circuits. Once the circuit small-signal characteristics are presented by DDDs, evaluation of DDDs will give exact numerical values. In this paper, […]

CUDA

Nov, 21

Challenge benchmarks that must be conquered to sustain the gpu revolution

The shift from GPUs to GPGPUs has brought with it many changes to the GPU architecture (e.g. more caches, more concurrent kernels, better synchronization). As GPUs press further into the general-purpose domain, architects must continue to address the performance of challenging workloads. This paper presents a set of challenge benchmarks and their key performance limitations […]

CUDA

Nov, 21

PATUS: A Code Generation and Autotuning Framework For Parallel Iterative Stencil Computations on Modern Microarchitectures

Stencil calculations comprise an important class of kernels in many scientific computing applications ranging from simple PDE solvers to constituent kernels in multigrid methods as well as image processing applications. In such types of solvers, stencil kernels are often the dominant part of the computation, and an efficient parallel implementation of the kernel is therefore […]

Nov, 20

Efficient Stack-less BVH Traversal for Ray Tracing

We propose a new, completely iterative traversal algorithm for ray tracing bounding volume hierarchies that is based on storing a parent pointer with each node, and on using simple state logic to infer which node to traverse next. Though our traversal algorithm does re-visit internal nodes, it intersects each visited node only once, and in […]

CUDA

Nov, 20

Implementing a Finite Difference-Based Real-time Sound Synthesizer using GPUs

In this paper, we describe an implementation of a real-time sound synthesizer using Finite Difference-based simulation of a two-dimensional membrane. Finite Difference (FD) methods can be the basis for physics-based music instrument models that generate realistic audio output. However, such methods are compute-intensive; large simulations cannot run in real time on current CPUs. Many current […]

CUDA

Nov, 20

Spatial interpolation in massively parallel computing environments

Prediction of environmental phenomena at non-observed locations is a fundamental task in geographic information science. Often, samples are taken at a limited number of sensor locations and spatial and spatio-temporal interpolation is used to generate continuous maps. The computational cost of the underlying algorithms usually grows with the number of data entering the interpolation and […]

CUDA

Nov, 20

Soft Error Resilient QR Factorization for Hybrid System

As the general purpose graphics processing units (GPGPU) are increasingly deployed for scientific computing for its raw performance advantages compared to CPUs, the fault tolerance issue has started to become more of a concern than before when they were exclusively used for graphics applications. The pairing of GPUs with CPUs to form a hybrid computing […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

A new approach for sparse matrix vector product on NVIDIA GPUs

GPU-Based Image Processing Use Cases: A High-Level Approach

CT image reconstruction with half precision floating-point values

Efficient GPGPU-based parallel packet classification

Conflux: Embedding Massively Parallel Semantics in a High-Level Programming Language

Graph-based Parallel Analysis of Large Analog Circuits Based on GPU Platforms

Challenge benchmarks that must be conquered to sustain the gpu revolution

PATUS: A Code Generation and Autotuning Framework For Parallel Iterative Stencil Computations on Modern Microarchitectures

Efficient Stack-less BVH Traversal for Ray Tracing

Implementing a Finite Difference-Based Real-time Sound Synthesizer using GPUs

Spatial interpolation in massively parallel computing environments

Soft Error Resilient QR Factorization for Hybrid System

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)