high performance computing on graphics processing units: hgpu.org

Posts

May, 9

Enabling task-level scheduling on heterogeneous platforms

OpenCL is an industry standard for parallel programming on heterogeneous devices. With OpenCL, compute-intensive portions of an application can be offloaded to a variety of processing units within a system. OpenCL is the first standard that focuses on portability, allowing programs to be written once and run seamlessly on multiple, heterogeneous devices, regardless of vendor. […]

OpenCL

May, 9

Divide-and-Conquer 3D Convex Hulls on the GPU

We describe a pure divide-and-conquer parallel algorithm for computing 3D convex hulls. We implement that algorithm on GPU hardware, and find a significant speedup over comparable CPU implementations.

OpenCL

May, 7

iGPU: Exception Support and Speculative Execution on GPUs

Since the introduction of fully programmable vertex shader hardware, GPU computing has made tremendous advances. Exception support and speculative execution are the next steps to expand the scope and improve the usability of GPUs. However, traditional mechanisms to support exceptions and speculative execution are highly intrusive to GPU hardware design. This paper builds on two […]

OpenCL

May, 7

Cross-Platform OpenCL Code and Performance Portability for CPU and GPU Architectures Investigated with a Climate and Weather Physics Model

Current multi- and many-core computing typically involves multi-core Central Processing Units (CPU) and many-core Graphical Processing Units (GPU) whose architectures are distinctly different. To keep longevity of application codes, it is highly desirable to have a programming paradigm to support these current and future architectures. Open Computing Language (OpenCL) is created to address this problem. […]

OpenCL

May, 7

Parallelization of calculations using GPU in optimization approach for macromodels construction

Construction of mathematical models for nonlinear dynamical systems using optimization requires significant computation efforts to solve the optimization task. The most CPU time is required by optimization procedure for goal function calculations, which is repeated many times for different model parameters. This allows to use processors with SIMD architecture of calculation parallelization. The effectiveness of […]

CUDA

May, 7

Implementation of digital down converter in GPU

Giant Metrewave Radio Telescope is undergoing an upgradation. GMRT is mainly used for pulsar, continuum and spectral line observations. Spectral Line observations require more resolution which can be achieved by narrowband mode. Thus to utilize the GMRT correlator resources efficiently and to speed up the further signal processing, Digital Down Converter is of great use. […]

CUDA

May, 7

Implementation and Optimization of Image Processing Algorithms on Embedded GPU

In this paper, we analyze the key factors underlying the implementation, evaluation, and optimization of image processing and computer vision algorithms on embedded GPU using OpenGL ES 2.0 shader model. First, we present the characteristics of the embedded GPU and its inherent advantage when compared to embedded CPU. Additionally, we propose techniques to achieve increased […]

OpenGL

May, 6

Comparison of OpenMP and OpenCL Parallel Processing Technologies

This paper presents a comparison of OpenMP and OpenCL based on the parallel implementation of algorithms from various fields of computer applications. The focus of our study is on the performance of benchmark comparing OpenMP and OpenCL. We observed that OpenCL programming model is a good option for mapping threads on different processing cores. Balancing […]

OpenCL

May, 6

Transparent Accelerator Migration in a Virtualized GPU Environment

This paper presents a framework to support transparent, live migration of virtual GPU accelerators in a virtualized execution environment. Migration is a critical capability in such environments because it provides support for fault tolerance, ondemand system maintenance, resource management, and load balancing in the mapping of virtual to physical GPUs. Techniques to increase responsiveness and […]

OpenCL

May, 6

Effects of Compiler Optimizations in OpenMP to CUDA Translation

One thrust of the OpenMP standard development focuses on support for accelerators. An important question is whether or not OpenMP extensions are needed, and how much performance difference they would make. The same question is relevant for related efforts in support of accelerators, such as OpenACC. The present paper pursues this question. We analyze the […]

CUDA

May, 6

Design of a Hybrid Memory System for General-Purpose Graphics Processing Units

Addressing a limited power budget is a prerequisite for maintaining the growth of computer system performance into and beyond the exascale. Two technologies with the potential to help solve this problem include general-purpose programming on graphics processors and fast non-volatile memories. Combining these technologies could yield devices capable of extreme-scale computation at lower power. The […]

CUDA

May, 6

One Stone Two Birds: Synchronization Relaxation and Redundancy Removal in GPU-CPU Translation

As an approach to promoting whole-system synergy on a heterogeneous computing system, compilation of fine-grained SPMD-threaded code (e.g., GPU CUDA code) for multicore CPU has drawn some recent attentions. This paper concentrates on two important sources of inefficiency that limit existing translators. The first is overly strong synchronizations; the second is thread-level partially redundant computations. […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Enabling task-level scheduling on heterogeneous platforms

Divide-and-Conquer 3D Convex Hulls on the GPU

iGPU: Exception Support and Speculative Execution on GPUs

Cross-Platform OpenCL Code and Performance Portability for CPU and GPU Architectures Investigated with a Climate and Weather Physics Model

Parallelization of calculations using GPU in optimization approach for macromodels construction

Implementation of digital down converter in GPU

Implementation and Optimization of Image Processing Algorithms on Embedded GPU

Comparison of OpenMP and OpenCL Parallel Processing Technologies

Transparent Accelerator Migration in a Virtualized GPU Environment

Effects of Compiler Optimizations in OpenMP to CUDA Translation

Design of a Hybrid Memory System for General-Purpose Graphics Processing Units

One Stone Two Birds: Synchronization Relaxation and Redundancy Removal in GPU-CPU Translation

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)