high performance computing on graphics processing units: hgpu.org

Posts

Sep, 30

Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation

For some classes of problems, NVIDIA CUDA abstraction and hardware properties combine with problem characteristics to limit the specific problem instances that can be effectively accelerated. As a real-world example, a twodimensional correlation-based template-matching MATLAB application is considered. While this problem has a well known solution for the common case of linear image filtering-small fixed […]

CUDA

Sep, 30

CBench: Analyzing Compute Performance for Modern NVIDIA and AMD GPUs

General purpose GPU computation is a fast growing ?eld with a variety of applications. For maximum performance, though, mapping high-level parallel algorithms to vendor hardware requires a solid grasp of both the algorithm’s computational requirements and the microarchitectural limitations of the GPU. This work aims to explore the performance of high and low arithmetic intensity […]

CUDA

•

OpenCL

Sep, 30

FATSEA-An Architectural Simulator for General Purpose Computing on GPUs

We present FATSEA, a functional and performance evaluation simulator written in C++ to handle kernels written in the CUDA programming language aimed for GPGPU computing. FATSEA takes a Parallel Thread eXecution (PTX ) code as input, which is a device independent code format generated by the Nvidia CUDA compiler, to validate results and estimate performance […]

CUDA

Sep, 30

Translating GPU binaries to tiered SIMD architectures with Ocelot

Parallel Thread Execution ISA (PTX) is a virtual instruction set used by NVIDIA GPUs that explicitly expresses hierarchical MIMD and SIMD style parallelism in an application. In such a programming model, the programmer and compiler are left with the not trivial, but not impossible, task of composing applications from parallel algorithms and data structures. Once […]

Sep, 30

Accelerating Geospatial Analysis on GPUs using CUDA

Inverse distance weighting (IDW) interpolation and viewshed are two popular algorithms for geospatial analysis. IDW interpolation assigns geographical values to unknown spatial points by using values from a usually scattered set of known points, and viewshed identifies the cells in a spatial raster that can be seen by observers. Although the implementations of both algorithms […]

CUDA

Sep, 30

Accelerating Foreign-Key Joins using Asymmetric Memory Channels

Indexed Foreign-Key Joins expose a very asymmetric access pattern: the Foreign-Key Index is sequentially scanned whilst the Primary-Key table is target of many quasi-random lookups which is the dominant cost factor. To reduce the costs of the random lookups the fact-table can be (re-) partitioned at runtime to increase access locality on the dimension table, […]

OpenCL

Sep, 30

Accelerating data mining workloads: current approaches and future challenges in system architecture design

Conventional systems based on general-purpose processors cannot keep pace with the exponential increase in the generation and collection of data. It is therefore important to explore alternative architectures that can provide the computational capabilities required to analyze ever-growing datasets. Programmable graphics processing units (GPUs) offer computational capabilities that surpass even high-end multi-core central processing units […]

CUDA

Sep, 30

A Polyphase Filter For GPUs And Multi-Core Processors

Radio astronomy is a subfield of astronomy that studies celestial objects at radio frequencies. Unlike visible light, these radio signals are not blocked by earth’s atmosphere, making it possible to detect them from the ground. Radio emissions have been observed from a number of celestial bodies, including stars and galaxies. Some celestial bodies that can […]

OpenCL

Sep, 30

Adding special-purpose processor support to the Erlang VM

This thesis investigates the possibility to extend the Erlang runtime system such that it can take advantage of special purpose compute units, such as GPUs and DSPs. Further more it investigates if certain parts of an Erlang system can be accelerated with help of these devices.

OpenCL

Sep, 29

Many-threaded implementation of differential evolution for the CUDA platform

Differential evolution is an efficient populational meta — heuristic optimization algorithm successful in solving difficult real world problems. Due to the simplicity of its operations and data structures, it is suitable for a parallel implementation on multicore systems and on the GPU. In this paper, we design a simple yet highly parallel implementation of the […]

CUDA

Sep, 29

Active thread compaction for GPU path tracing

Modern GPUs like NVidia’s Fermi internally operate in a SIMD manner by ganging multiple (32) scalar threads together into SIMD warps; if a warp’s threads diverge, the warp serially executes both branches, temporarily disabling threads that are not on that path. In this paper, we explore and thoroughly analyze the concept of active thread compaction—i.e., […]

CUDA

Sep, 29

Evolving CUDA PTX programs by quantum inspired linear genetic programming

The tremendous computing power of Graphics Processing Units (GPUs) can be used to accelerate the evolution process in Genetic Programming (GP). The automatic generation of code using the GPU usually follows two different approaches: compiling each evolved or interpreting multiple programs. Both approaches, however, have performance drawbacks. In this work, we propose a novel approach […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation

CBench: Analyzing Compute Performance for Modern NVIDIA and AMD GPUs

FATSEA-An Architectural Simulator for General Purpose Computing on GPUs

Translating GPU binaries to tiered SIMD architectures with Ocelot

Accelerating Geospatial Analysis on GPUs using CUDA

Accelerating Foreign-Key Joins using Asymmetric Memory Channels

Accelerating data mining workloads: current approaches and future challenges in system architecture design

A Polyphase Filter For GPUs And Multi-Core Processors

Adding special-purpose processor support to the Erlang VM

Many-threaded implementation of differential evolution for the CUDA platform

Active thread compaction for GPU path tracing

Evolving CUDA PTX programs by quantum inspired linear genetic programming

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)