high performance computing on graphics processing units: hgpu.org

Posts

Oct, 1

A Comprehensive Performance Comparison of CUDA and OpenCL

This paper presents a comprehensive performance comparison between CUDA and OpenCL. We have selected 16 benchmarks ranging from synthetic applications to real-world ones. We make an extensive analysis of the performance gaps taking into account programming models, optimization strategies, architectural details, and underlying compilers. Our results show that, for most applications, CUDA performs at most […]

CUDA

•

OpenCL

Oct, 1

Accelerating Vector Calculations on GPU

Multicore computational accelerators such as Graphics Processor Units (GPUs) became common for gaining high-performance computing on a larger scale. Programming GPUs requires detailed knowledge of the underlying architecture in order to get maximum performance. In this paper we present solution of vector distance calculation on NVIDIA’s parallel computing architecture CUDA (Common Unified Device Architecture), where […]

CUDA

Oct, 1

Large Scale DNA Sequence Alignment and Kernel Method Implemented with GPUs

Large Scale DNA sequence alignment and Kernel method in molecular biology play critical roles in bioinformatics. Both of which are successfully implemented on the brook+ platform with AMD’s GPUs. Aiming at the characters of graphical stream processors, we propose internal and external approach cooperatively to promote the performance of the two algorithms. The experiments show […]

Oct, 1

Interactive Soft Tissue for Surgical Simulation

Medical simulation has the potential to revolutionise the training of medical practitioners. Advantages include reduced risk to patients, increased access to rare scenarios and virtually unlimited repeatability. However, in order to fulfil its potential, medical simulators require techniques to provide realistic user interaction with the simulated patient. Specifically, compelling real-time simulations that allow the trainee […]

CUDA

Oct, 1

Image registration on GPU

Image registration is a fundamental step in many applications involving image analysis. It consists of optimizing a similarity metric to find a spatial transformation to match two images (in 3D). It has application in medical images to build atlases (registering a population), or to align a patient to a template to detect pathologies. The main […]

OpenCL

Sep, 30

Exploring The Latency and Bandwidth Tolerance of CUDA Applications

CUDA applications represent a new body of parallel programs. Although several paradigms exist for programming distributed systems and many-core processors, many users struggle to achieve a program that is scalable across systems with different hardware characteristics. This paper explores the scalability of CUDA applications on systems with varying interconnect latencies, hiding a hardware detail from […]

CUDA

Sep, 30

Architecture-Aware Mapping and Optimization on Heterogeneous Computing Systems

The emergence of scientific applications embedded with multiple modes of parallelism has made heterogeneous computing systems indispensable in high performance computing. The popularity of such systems is evident from the fact that three out of the top five fastest supercomputers in the world employ heterogeneous computing, i.e., they use dissimilar computational units. A closer look […]

CUDA

•

OpenCL

Sep, 30

Real-Time Handling of GPU Interrupts in LITMUSRT

Graphics processing units (GPUs) are becoming increasingly important in today’s platforms as their increased generality allows for them to be used as powerful co-processors. However, unlike standard CPUs, GPUs are treated as I/O devices and require the use of interrupts to facilitate communication with the CPU. Interrupts cause delays in the execution of real-time tasks, […]

CUDA

Sep, 30

Enhancing Data Locality for Dynamic Simulations through Asynchronous Data Transformations and Adaptive Control

Many dynamic simulation programs contain complex, irregular memory reference patterns, and require runtime optimizations to enhance data locality. Current approaches periodically stop the execution of an application to reorder the computation or data based on the current program state to improve the data locality for the next period of execution. In this work, we examine […]

CUDA

Sep, 30

Stack-less SIMT reconvergence at low cost

Parallel architectures following the SIMT model such as GPUs benefit from application regularity by issuing concurrent threads running in lockstep on SIMD units. As threads take different paths across the control-flow graph, lockstep execution is partially lost, and must be regained whenever possible in order to maximize the occupancy of SIMD units. In this paper, […]

Sep, 30

A PTX Code Generator for LLVM

Today’s GPGPU architectures and corresponding high level programming languages like CUDA replace the traditionally restricted GPU pipelines. Proprietary compilers allow to translate these languages into native GPU assembly. Unfortunately, these compilers are non-customizable and restricted to static compilation. High performant application currently require particular manual optimizations. To overcome these cumbersome manual optimizations, this thesis develops […]

CUDA

Sep, 30

Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation

For some classes of problems, NVIDIA CUDA abstraction and hardware properties combine with problem characteristics to limit the specific problem instances that can be effectively accelerated. As a real-world example, a twodimensional correlation-based template-matching MATLAB application is considered. While this problem has a well known solution for the common case of linear image filtering-small fixed […]

CUDA

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

gpu_tracker: Python package for tracking and profiling GPU utilization in both desktop and high-performance computing environments

* * *

high performance computing on graphics processing units: hgpu.org

Posts

A Comprehensive Performance Comparison of CUDA and OpenCL

Accelerating Vector Calculations on GPU

Large Scale DNA Sequence Alignment and Kernel Method Implemented with GPUs

Interactive Soft Tissue for Surgical Simulation

Image registration on GPU

Exploring The Latency and Bandwidth Tolerance of CUDA Applications

Architecture-Aware Mapping and Optimization on Heterogeneous Computing Systems

Real-Time Handling of GPU Interrupts in LITMUSRT

Enhancing Data Locality for Dynamic Simulations through Asynchronous Data Transformations and Adaptive Control

Stack-less SIMT reconvergence at low cost

A PTX Code Generator for LLVM

Adaptable Two-Dimension Sliding Windows on NVIDIA GPUs with Runtime Compilation

Recent source codes

QArray

Celerity: High-level C++ for Accelerator Clusters

CIFAR-10 Airbench: 94% on CIFAR-10 in 3.29 second

gpu_tracker: Context manager and CLI that tracks the computational-resource-usage of a code block or shell command, particularly the GPU usage

LOOPer: a polyhedral compiler for expressing fast and portable data parallel algorithms

OpenMC Monte Carlo Code

Polygeist: C/C++ frontend for MLIR

Parallel Gaussian process with kernel approximation in CUDA

Optical flow algorithms for SYCL

OpenMP5-Offload-OpenMC-Intel-PVC

Most viewed papers (last 30 days)