high performance computing on graphics processing units: hgpu.org

Posts

Sep, 30

Stack-less SIMT reconvergence at low cost

Parallel architectures following the SIMT model such as GPUs benefit from application regularity by issuing concurrent threads running in lockstep on SIMD units. As threads take different paths across the control-flow graph, lockstep execution is partially lost, and must be regained whenever possible in order to maximize the occupancy of SIMD units. In this paper, […]

Sep, 30

A PTX Code Generator for LLVM

Today’s GPGPU architectures and corresponding high level programming languages like CUDA replace the traditionally restricted GPU pipelines. Proprietary compilers allow to translate these languages into native GPU assembly. Unfortunately, these compilers are non-customizable and restricted to static compilation. High performant application currently require particular manual optimizations. To overcome these cumbersome manual optimizations, this thesis develops […]

CUDA

Sep, 30

Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation

For some classes of problems, NVIDIA CUDA abstraction and hardware properties combine with problem characteristics to limit the specific problem instances that can be effectively accelerated. As a real-world example, a twodimensional correlation-based template-matching MATLAB application is considered. While this problem has a well known solution for the common case of linear image filtering-small fixed […]

CUDA

Sep, 30

CBench: Analyzing Compute Performance for Modern NVIDIA and AMD GPUs

General purpose GPU computation is a fast growing ?eld with a variety of applications. For maximum performance, though, mapping high-level parallel algorithms to vendor hardware requires a solid grasp of both the algorithm’s computational requirements and the microarchitectural limitations of the GPU. This work aims to explore the performance of high and low arithmetic intensity […]

CUDA

•

OpenCL

Sep, 30

FATSEA-An Architectural Simulator for General Purpose Computing on GPUs

We present FATSEA, a functional and performance evaluation simulator written in C++ to handle kernels written in the CUDA programming language aimed for GPGPU computing. FATSEA takes a Parallel Thread eXecution (PTX ) code as input, which is a device independent code format generated by the Nvidia CUDA compiler, to validate results and estimate performance […]

CUDA

Sep, 30

Translating GPU binaries to tiered SIMD architectures with Ocelot

Parallel Thread Execution ISA (PTX) is a virtual instruction set used by NVIDIA GPUs that explicitly expresses hierarchical MIMD and SIMD style parallelism in an application. In such a programming model, the programmer and compiler are left with the not trivial, but not impossible, task of composing applications from parallel algorithms and data structures. Once […]

Sep, 30

Accelerating Geospatial Analysis on GPUs using CUDA

Inverse distance weighting (IDW) interpolation and viewshed are two popular algorithms for geospatial analysis. IDW interpolation assigns geographical values to unknown spatial points by using values from a usually scattered set of known points, and viewshed identifies the cells in a spatial raster that can be seen by observers. Although the implementations of both algorithms […]

CUDA

Sep, 30

Accelerating Foreign-Key Joins using Asymmetric Memory Channels

Indexed Foreign-Key Joins expose a very asymmetric access pattern: the Foreign-Key Index is sequentially scanned whilst the Primary-Key table is target of many quasi-random lookups which is the dominant cost factor. To reduce the costs of the random lookups the fact-table can be (re-) partitioned at runtime to increase access locality on the dimension table, […]

OpenCL

Sep, 30

Accelerating data mining workloads: current approaches and future challenges in system architecture design

Conventional systems based on general-purpose processors cannot keep pace with the exponential increase in the generation and collection of data. It is therefore important to explore alternative architectures that can provide the computational capabilities required to analyze ever-growing datasets. Programmable graphics processing units (GPUs) offer computational capabilities that surpass even high-end multi-core central processing units […]

CUDA

Sep, 30

A Polyphase Filter For GPUs And Multi-Core Processors

Radio astronomy is a subfield of astronomy that studies celestial objects at radio frequencies. Unlike visible light, these radio signals are not blocked by earth’s atmosphere, making it possible to detect them from the ground. Radio emissions have been observed from a number of celestial bodies, including stars and galaxies. Some celestial bodies that can […]

OpenCL

Sep, 30

Adding special-purpose processor support to the Erlang VM

This thesis investigates the possibility to extend the Erlang runtime system such that it can take advantage of special purpose compute units, such as GPUs and DSPs. Further more it investigates if certain parts of an Erlang system can be accelerated with help of these devices.

OpenCL

Sep, 29

Many-threaded implementation of differential evolution for the CUDA platform

Differential evolution is an efficient populational meta — heuristic optimization algorithm successful in solving difficult real world problems. Due to the simplicity of its operations and data structures, it is suitable for a parallel implementation on multicore systems and on the GPU. In this paper, we design a simple yet highly parallel implementation of the […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Stack-less SIMT reconvergence at low cost

A PTX Code Generator for LLVM

Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation

CBench: Analyzing Compute Performance for Modern NVIDIA and AMD GPUs

FATSEA-An Architectural Simulator for General Purpose Computing on GPUs

Translating GPU binaries to tiered SIMD architectures with Ocelot

Accelerating Geospatial Analysis on GPUs using CUDA

Accelerating Foreign-Key Joins using Asymmetric Memory Channels

Accelerating data mining workloads: current approaches and future challenges in system architecture design

A Polyphase Filter For GPUs And Multi-Core Processors

Adding special-purpose processor support to the Erlang VM

Many-threaded implementation of differential evolution for the CUDA platform

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)