high performance computing on graphics processing units: hgpu.org

Posts

Oct, 1

Customizable Domain-Specific Computing

To meet computing needs and overcome power density limitations, the computing industry has entered the era of parallelization. However, highly parallel, general-purpose computing systems face serious challenges in terms of performance, energy, heat dissipation, space, and cost. We believe that there is significant opportunity to look beyond parallelization and focus on domain-specific customization to bring […]

Oct, 1

CheCL: Transparent Checkpointing and Process Migration of OpenCL Applications

We propose a new transparent checkpoint/restart (CPR) tool, named CheCL, for high performance and dependable GPU computing. CheCL can perform CPR on an OpenCL application program without any modification and recompilation of its code. A conventional checkpointing system fails to checkpoint a process if the process uses OpenCL. Therefore, in CheCL, every API call is […]

OpenCL

Oct, 1

A Comprehensive Performance Comparison of CUDA and OpenCL

This paper presents a comprehensive performance comparison between CUDA and OpenCL. We have selected 16 benchmarks ranging from synthetic applications to real-world ones. We make an extensive analysis of the performance gaps taking into account programming models, optimization strategies, architectural details, and underlying compilers. Our results show that, for most applications, CUDA performs at most […]

CUDA

•

OpenCL

Oct, 1

Accelerating Vector Calculations on GPU

Multicore computational accelerators such as Graphics Processor Units (GPUs) became common for gaining high-performance computing on a larger scale. Programming GPUs requires detailed knowledge of the underlying architecture in order to get maximum performance. In this paper we present solution of vector distance calculation on NVIDIA’s parallel computing architecture CUDA (Common Unified Device Architecture), where […]

CUDA

Oct, 1

Large Scale DNA Sequence Alignment and Kernel Method Implemented with GPUs

Large Scale DNA sequence alignment and Kernel method in molecular biology play critical roles in bioinformatics. Both of which are successfully implemented on the brook+ platform with AMD’s GPUs. Aiming at the characters of graphical stream processors, we propose internal and external approach cooperatively to promote the performance of the two algorithms. The experiments show […]

Oct, 1

Interactive Soft Tissue for Surgical Simulation

Medical simulation has the potential to revolutionise the training of medical practitioners. Advantages include reduced risk to patients, increased access to rare scenarios and virtually unlimited repeatability. However, in order to fulfil its potential, medical simulators require techniques to provide realistic user interaction with the simulated patient. Specifically, compelling real-time simulations that allow the trainee […]

CUDA

Oct, 1

Image registration on GPU

Image registration is a fundamental step in many applications involving image analysis. It consists of optimizing a similarity metric to find a spatial transformation to match two images (in 3D). It has application in medical images to build atlases (registering a population), or to align a patient to a template to detect pathologies. The main […]

OpenCL

Sep, 30

Exploring The Latency and Bandwidth Tolerance of CUDA Applications

CUDA applications represent a new body of parallel programs. Although several paradigms exist for programming distributed systems and many-core processors, many users struggle to achieve a program that is scalable across systems with different hardware characteristics. This paper explores the scalability of CUDA applications on systems with varying interconnect latencies, hiding a hardware detail from […]

CUDA

Sep, 30

Architecture-Aware Mapping and Optimization on Heterogeneous Computing Systems

The emergence of scientific applications embedded with multiple modes of parallelism has made heterogeneous computing systems indispensable in high performance computing. The popularity of such systems is evident from the fact that three out of the top five fastest supercomputers in the world employ heterogeneous computing, i.e., they use dissimilar computational units. A closer look […]

CUDA

•

OpenCL

Sep, 30

Real-Time Handling of GPU Interrupts in LITMUSRT

Graphics processing units (GPUs) are becoming increasingly important in today’s platforms as their increased generality allows for them to be used as powerful co-processors. However, unlike standard CPUs, GPUs are treated as I/O devices and require the use of interrupts to facilitate communication with the CPU. Interrupts cause delays in the execution of real-time tasks, […]

CUDA

Sep, 30

Enhancing Data Locality for Dynamic Simulations through Asynchronous Data Transformations and Adaptive Control

Many dynamic simulation programs contain complex, irregular memory reference patterns, and require runtime optimizations to enhance data locality. Current approaches periodically stop the execution of an application to reorder the computation or data based on the current program state to improve the data locality for the next period of execution. In this work, we examine […]

CUDA

Sep, 30

Stack-less SIMT reconvergence at low cost

Parallel architectures following the SIMT model such as GPUs benefit from application regularity by issuing concurrent threads running in lockstep on SIMD units. As threads take different paths across the control-flow graph, lockstep execution is partially lost, and must be regained whenever possible in order to maximize the occupancy of SIMD units. In this paper, […]

high performance computing on graphics processing units: hgpu.org

Posts

Customizable Domain-Specific Computing

CheCL: Transparent Checkpointing and Process Migration of OpenCL Applications

A Comprehensive Performance Comparison of CUDA and OpenCL

Accelerating Vector Calculations on GPU

Large Scale DNA Sequence Alignment and Kernel Method Implemented with GPUs

Interactive Soft Tissue for Surgical Simulation

Image registration on GPU

Exploring The Latency and Bandwidth Tolerance of CUDA Applications

Architecture-Aware Mapping and Optimization on Heterogeneous Computing Systems

Real-Time Handling of GPU Interrupts in LITMUSRT

Enhancing Data Locality for Dynamic Simulations through Asynchronous Data Transformations and Adaptive Control

Stack-less SIMT reconvergence at low cost

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)