high performance computing on graphics processing units: hgpu.org

Posts

Feb, 11

Accurate Cross-Architecture Performance Modeling for Sparse Matrix-Vector Multiplication (SpMV) on GPUs

This paper presents an integrated analytical and profile-based cross-architecture performance modeling tool to specifically provide inter-architecture performance prediction for Sparse Matrix-Vector Multiplication (SpMV) on NVIDIA GPU architectures. To design and construct the tool, we investigate the inter-architecture relative performance for multiple SpMV kernels. For a sparse matrix, based on its SpMV kernel performance measured on […]

CUDA

Feb, 11

Implementing the Projected Spatial Rich Features on a GPU

The Projected Spatial Rich Model (PSRM) generates powerful steganalysis features, but requires the calculation of tens of thousands of convolutions with image noise residuals. This makes it very slow: the reference implementation takes an impractical 20{30 minutes per 1 megapixel (Mpix) image. We present a case study which first tweaks the definition of the PSRM […]

CUDA

Feb, 11

Genetically Improved CUDA C++ Software

Genetic Programming (GP) may dramatically increase the performance of software written by domain experts. GP and autotuning are used to optimise and refactor legacy GPGPU C code for modern parallel graphics hardware and software. Speed ups of more than six times on recent nVidia GPU cards are reported compared to the original kernel on the […]

CUDA

Feb, 9

GPGPU-Assisted Subpixel Tracking Method for Fiducial Markers

With an aim to realizing highly accurate position estimation, we propose in this paper a method for efficiently and accurately detecting the 3D positions and poses of traditional fiducial markers with black frames in high-resolution images taken by ordinary web cameras. Our tracking method can be efficiently executed utilizing GPGPU computation, and in order to […]

OpenCL

Feb, 9

Benchmarks for Intel MIC Architecture

Intel Many Integrated Core (MIC) Architecture combines about 60 cores onto a single chips. Intel MIC brand named Xeon Phi offers a theoretical maximum of more than 3 double precision GFLOPs than Intel Xeon E5 core. We carry out benchmarks for Intel MIC with a Monte Carlo simulation of LIBOR Market Model. The results show […]

Feb, 9

Extending the SkelCL Skeleton Library for Stencil Computations on Multi-GPU Systems

The implementation of stencil computations on modern, massively parallel systems with GPUs and other accelerators currently relies on manually-tuned coding using low-level approaches like OpenCL and CUDA, which makes it a complex, time-consuming, and error-prone task. We describe how stencil computations can be programmed in our SkelCL approach that combines high level of programming abstraction […]

OpenCL

Feb, 9

A Scalable Multi-Path Microarchitecture for Efficient GPU Control Flow

Graphics processing units (GPUs) are increasingly used for non-graphics computing. However, applications with divergent control flow incur performance degradation on current GPUs. These GPUs implement the SIMT execution model by serializing the execution of different control flow paths encountered by a warp. This serialization can mask thread level parallelism among the scalar threads comprising a […]

Feb, 9

GPU-based high-performance computing for radiation therapy

Recent developments in radiotherapy therapy demand high computation powers to solve challenging problems in a timely fashion in a clinical environment. The graphics processing unit (GPU), as an emerging high-performance computing platform, has been introduced to radiotherapy. It is particularly attractive due to its high computational power, small size, and low cost for facility deployment […]

CUDA

Feb, 8

Fast 2-D Ultrasound Strain Imaging: The Benefits of Using a GPU

Deformation of tissue can be accurately estimated from radio-frequency ultrasound data using a 2-dimensional normalized cross correlation (NCC)-based algorithm. This procedure, however, is very computationally time-consuming. A major time reduction can be achieved by parallelizing the numerous computations of NCC. In this paper, two approaches for parallelization have been investigated: the OpenMP interface on a […]

CUDA

Feb, 8

State-Based Gauss-Seidel Framework for Real-time 2D Ultrasound Image Sequence Denoising on GPUs

The ultrasound image sequences are not only majorly contaminated by multiplicative noises but they are also usually contaminated with additive noises. As in the past few decades, there were some works, which had focused on removing the noises from ultrasound images, such as in the JY model [1] and in the variational model, which were […]

CUDA

Feb, 8

Fast 3D Graphics Rendering Technique with CUDA Parallel Processing

3D Graphic Rendering has been used to express realistic, 3-dimensional, and emphasized effects in the graphics. As 3D Graphic Rendering developed and became more prevalent, the need for acceleration in data processing grew as well, leading to a development of GPU (Graphic Processing Unit) and shading language used for GPU such as GLSL (OpenGL Shading […]

CUDA

•

OpenGL

Feb, 8

Developmental Directions in Parallel Accelerators

Parallel accelerators such as massively-cored graphical processing units or many-cored co-processors such as the Xeon Phi are becoming widespread and affordable on many systems including blade servers and even desktops. The use of a single such accelerator is now quite common for many applications, but the use of multiple devices and hybrid combinations is still […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Accurate Cross-Architecture Performance Modeling for Sparse Matrix-Vector Multiplication (SpMV) on GPUs

Implementing the Projected Spatial Rich Features on a GPU

Genetically Improved CUDA C++ Software

GPGPU-Assisted Subpixel Tracking Method for Fiducial Markers

Benchmarks for Intel MIC Architecture

Extending the SkelCL Skeleton Library for Stencil Computations on Multi-GPU Systems

A Scalable Multi-Path Microarchitecture for Efficient GPU Control Flow

GPU-based high-performance computing for radiation therapy

Fast 2-D Ultrasound Strain Imaging: The Benefits of Using a GPU

State-Based Gauss-Seidel Framework for Real-time 2D Ultrasound Image Sequence Denoising on GPUs

Fast 3D Graphics Rendering Technique with CUDA Parallel Processing

Developmental Directions in Parallel Accelerators

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)