high performance computing on graphics processing units: hgpu.org

Posts

Aug, 23

High performance content-based matching using GPUs

Matching incoming event notifications against received subscriptions is a fundamental part of every publish-subscribe infrastructure. In the case of content-based systems this is a fairly complex and time consuming task, whose performance impacts that of the entire system. In the past, several algorithms have been proposed for efficient content-based event matching. While they differ in […]

CUDA

Aug, 23

Workload and network-optimized computing systems

This paper describes a recent system-level trend toward the use of massive on-chip parallelism combined with efficient hardware accelerators and integrated networking to enable new classes of applications and computing-systems functionality. This system transition is driven by semiconductor physics and emerging network-application requirements. In contrast to general-purpose approaches, workload and network-optimized computing provides significant cost, […]

Aug, 23

Efficient implementation of GPGPU synchronization primitives on CPUs

The GPGPU model represents a style of execution where thousands of threads execute in a data-parallel fashion, with a large subset (typically 10s to 100s) needing frequent synchronization. As the GPGPU model evolves target both GPUs and CPUs as acceleration targets, thread synchronization becomes an important problem when running on CPUs. CPUs have little hardware […]

OpenCL

Aug, 23

Performance Modelling and Traffic Characterisation of Optical Networks

A review is carried out on the traffic characteristics of an optical carrier’s OC-192 link, based on the IP packet size distribution, traffic burstiness and self-similarity. The generalised exponential (GE) distribution is employed to model the interarrival times of bursty traffic flows of IP packets whilst self-similar traffic is generated for each wavelength of each […]

CUDA

Aug, 22

Auto-tuning 3-D FFT library for CUDA GPUs

Existing implementations of FFTs on GPUs are optimized for specific transform sizes like powers of two, and exhibit unstable and peaky performance i.e., do not perform as well in other sizes that appear in practice. Our new auto-tuning 3-D FFT on CUDA generates high performance CUDA kernels for FFTs of varying transform sizes, alleviating this […]

CUDA

Aug, 22

Evaluation of streaming aggregation on parallel hardware architectures

We present a case study parallelizing streaming aggregation on three different parallel hardware architectures. Aggregation is a performance-critical operation for data summarization in stream computing, and is commonly found in sense-and-respond applications. Currently available commodity parallel hardware provides promise as accelerators for streaming aggregation. However, how streaming aggregation can map to the different parallel architectures […]

CUDA

Aug, 22

A taxonomy of accelerator architectures and their programming models

As the clock frequency of silicon chips is leveling off, the computer architecture community is looking for different solutions to continue application performance scaling. One such solution is the multicore approach, i.e., using multiple simple cores that enable higher performance than wide superscalar processors, provided that the workload can exploit the parallelism. Another emerging alternative […]

Aug, 22

Floating-point data compression at 75 Gb/s on a GPU

Numeric simulations often generate large amounts of data that need to be stored or sent to other compute nodes. This paper investigates whether GPUs are powerful enough to make real-time data compression and decompression possible in such environments, that is, whether they can operate at the 32- or 40-Gb/s throughput of emerging network cards. The […]

CUDA

Aug, 22

A fast GEMM implementation on the cypress GPU

We present benchmark results of optimized dense matrix multiplication kernels for Cypress GPU. We write general matrix multiply (GEMM) kernels for single (SP), double (DP) and double-double (DDP) precision. Our SGEMM and DGEMM kernels show ~ 2 Top/s and ~ 470 Glop/s, respectively. These results for SP and DP correspond to 73% and 87% of […]

Aug, 22

Cost-aware function migration in heterogeneous systems

Today’s approaches towards heterogeneous computing rely on either the programmer or dedicated programming models to efficiently integrate heterogeneous components. In this work, we propose an adaptive cost-aware function-migration mechanism built on top of a light-weight hardware abstraction layer. With this mechanism, the highly dynamic task of choosing the most beneficial processing unit will be hidden […]

Aug, 22

Towards metaprogramming for parallel systems on a chip

We demonstrate that the performance of commodity parallel systems significantly depends on low-level details, such as storage layout and iteration space mapping, which motivates the need for tools and techniques that separate a high-level algorithm description from low-level mapping and tuning. We propose to build a tool based on the concept of decoupled Access/Execute metadata […]

CUDA

Aug, 22

Load Balancing versus Occupancy Maximization on Graphics Processing Units: The Generalized Hough Transform as a Case Study

Programs developed under the Compute Unified Device Architecture obtain the highest performance rate, when the exploitation of hardware resources on a Graphics Processing Unit (GPU) is maximized. In order to achieve this purpose, load balancing among threads and a high value of processor occupancy, i.e. the ratio of active threads, are indispensable. However, in certain […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

High performance content-based matching using GPUs

Workload and network-optimized computing systems

Efficient implementation of GPGPU synchronization primitives on CPUs

Performance Modelling and Traffic Characterisation of Optical Networks

Auto-tuning 3-D FFT library for CUDA GPUs

Evaluation of streaming aggregation on parallel hardware architectures

A taxonomy of accelerator architectures and their programming models

Floating-point data compression at 75 Gb/s on a GPU

A fast GEMM implementation on the cypress GPU

Cost-aware function migration in heterogeneous systems

Towards metaprogramming for parallel systems on a chip

Load Balancing versus Occupancy Maximization on Graphics Processing Units: The Generalized Hough Transform as a Case Study

Recent source codes

UniCoder: Unified Visual-to-Code Generation via Symbolic Rewards and Reference-Guided Code Optimization

CuFuzz: An API-Knowledge-Graph Coverage-Driven Fuzzing Framework for CUDA Libraries

AutoPass: Evidence-Guided LLM Agents for Compiler Performance Tuning

Probe-and-Refine Tuning of Repository Guidance for AI Coding Agents

CUDAnalyst (CUDA + Analyst)

CodegenBench

KernelBenchX: A Comprehensive Benchmark for Evaluating LLM-Generated GPU Kernels

CUDA Kernel Fusion Benchmarks

IntelliKit: Agent-first tooling for AMD hardware

DITRON: Distributed Compiler based on Triton for Parallel Systems

Most viewed papers (last 30 days)