high performance computing on graphics processing units: hgpu.org

Posts

Apr, 12

Wire Speed Name Lookup: A GPU-based Approach

This paper studies the name lookup issue with longest prefix matching, which is widely used in URL filtering, content routing/switching, etc. Recently Content-Centric Networking (CCN) has been proposed as a clean slate future Internet architecture to naturally fit the contentcentric property of today’s Internet usage: instead of addressing end hosts, the Internet should operate based […]

CUDA

Apr, 12

Real-time Subsurface Scattering for Particle-based Fluids using Finite Volume Method

We present a real-time subsurface scattering simulation to perform real-time rendering of translucent particle-based fluids. After particle-based fluid simulation, we immediately build voxelized fluids, calledVoronoi fluids, with particle locations and neighbour lists using GPUs. And then, we perform a multiple subsurface scattering simulation over the Voronoi fluids with the diffusion equation (DE). We employ Finite […]

CUDA

•

OpenGL

Apr, 10

Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs

We describe an interface and an implementation for performing Kronecker product actions on NVIDIA GPUs for multiple small 2-D matrices and 3-D arrays processed in parallel as a batch. This method is suited to cases where the Kronecker product component matrices are identical but the operands in a matrix-free application vary in the batch. Any […]

CUDA

Apr, 10

CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions

BACKGROUND: The maximal sensitivity for local alignments makes the Smith-Waterman algorithm a popular choice for protein sequence database search based on pairwise alignment. However, the algorithm is compute-intensive due to a quadratic time complexity. Corresponding runtimes are further compounded by the rapid growth of sequence databases. RESULTS: We present CUDASW++ 3.0, a fast Smith-Waterman protein […]

CUDA

Apr, 9

Modeling of High Performance Programs to Support Heterogeneous Computing

In order to harness the power of multicore CPUs and GPUs, HPC (High Performance Computing) programmers and even end-users need new tools and techniques to express their core problem, divide that core problem into sub problems, allocate computational resources for the sub-problems, execute the resources, and collect results. HPC users focus more on the problem […]

CUDA

•

OpenCL

Apr, 9

OpenCL Fast Fourier Transform

Fast Fourier Transform is one of the most important numerical algorithms in history. It has wide range of applications: audio signal processing, medical imaging, image processing, pattern recognition, computational chemistry, error correcting codes and spectral methods for PDE’s. The goal of this project is to implement an OpenCL based FFT algorithm that has comparable performance […]

OpenCL

Apr, 9

Accelerating Image Reconstruction in Three-Dimensional Optoacoustic Tomography on Graphics Processing Units

PURPOSE: Optoacoustic tomography (OAT) is inherently a three-dimensional (3D) inverse problem. However, most studies of OAT image reconstruction still employ two-dimensional (2D) imaging models. One important reason is because 3D image reconstruction is computationally burdensome. The aim of this work is to accelerate existing image reconstruction algorithms for 3D OAT by use of parallel programming […]

CUDA

Apr, 9

A Performance Comparison of Different Graphics Processing Units Running Direct N-Body Simulations

Hybrid computational architectures based on the joint power of Central Processing Units and Graphic Processing Units (GPUs) are becoming popular and powerful hardware tools for a wide range of simulations in biology, chemistry, engineering, physics, etc.. In this paper we present a comparison of performance of various GPUs available on market when applied to the […]

OpenCL

Apr, 9

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

We present an interface and an implementation of the General Matrix Multiply (GEMM) routine for multiple small matrices processed simultaneously on NVIDIA graphics processing units (GPUs). We focus on matrix sizes under 16. The implementation can be easily extended to larger sizes. For single precision matrices, our implementation is 30% to 600% faster than the […]

CUDA

Apr, 8

pVOCL: Power-Aware Dynamic Placement and Migration in Virtualized GPU Environments

Power-hungry Graphics processing unit (GPU) accelerators are ubiquitous in high performance computing data centers today. GPU virtualization frameworks introduce new opportunities for effective management of GPU resources by decoupling them from application execution. However, power management of GPU-enabled server clusters faces significant challenges. The underlying system infrastructure shows complex power consumption characteristics depending on the […]

OpenCL

Apr, 8

A Hardware Multithreaded SpMV Kernel for the Convey HC-2ex

Applications exhibiting irregular behavior through poor memory locality have been a constant challenge for high-performance computing. Architectures supporting hardware multithreading (e.g. Tera MTA and Cray XMT) have been shown to deliver superior performance on such applications by masking memory latency. FPGAs have outperformed traditional architectures on applications that exhibit very large spatial locality and where […]

CUDA

Apr, 8

Load Balancing in a Changing World: Dealing with Heterogeneity and Performance Variability

Fully utilizing the power of modern heterogeneous systems requires judiciously dividing work across all of the available computational devices. Existing approaches for partitioning work require offline training and generate fixed partitions that fail to respond to fluctuations in device performance that occur at run time. We present a novel dynamic approach to work partitioning that […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

Wire Speed Name Lookup: A GPU-based Approach

Real-time Subsurface Scattering for Particle-based Fluids using Finite Volume Method

Batched Kronecker product for 2-D matrices and 3-D arrays on NVIDIA GPUs

CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions

Modeling of High Performance Programs to Support Heterogeneous Computing

OpenCL Fast Fourier Transform

Accelerating Image Reconstruction in Three-Dimensional Optoacoustic Tomography on Graphics Processing Units

A Performance Comparison of Different Graphics Processing Units Running Direct N-Body Simulations

A GEMM interface and implementation on NVIDIA GPUs for multiple small matrices

pVOCL: Power-Aware Dynamic Placement and Migration in Virtualized GPU Environments

A Hardware Multithreaded SpMV Kernel for the Convey HC-2ex

Load Balancing in a Changing World: Dealing with Heterogeneity and Performance Variability

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)