high performance computing on graphics processing units: hgpu.org

Posts

Nov, 20

A middleware for efficient stream processing in CUDA

This paper presents a middleware capable of out-of-order execution of kernels and data transfers for efficient stream processing in the compute unified device architecture (CUDA). Our middleware runs on the CUDA-compatible graphics processing unit (GPU). Using the middleware, application developers are allowed to easily overlap kernel computation with data transfer between the main memory and […]

CUDA

Nov, 20

Simulating a P system based efficient solution to SAT by using GPUs

P systems are inherently parallel and non-deterministic theoretical computing devices defined inside the field of Membrane Computing. Many P system simulators have been presented in this area, but they are inefficient since they can not handle the parallelism of these devices. Nowadays, we are witnessing the consolidation of the GPUs as a parallel framework to […]

Nov, 20

Simulation of one-layer shallow water systems on multicore and CUDA architectures

The numerical solution of shallow water systems is useful for several applications related to geophysical flows, but the big dimensions of the domains suggests the use of powerful accelerators to obtain numerical results in reasonable times. This paper addresses how to speed up the numerical solution of a first order well-balanced finite volume scheme for […]

CUDA

Nov, 20

Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Sort is a fundamental kernel used in many database operations. In-memory sorts are now feasible; sort performance is limited by compute flops and main memory bandwidth rather than I/O. In this paper, we present a competitive analysis of comparison and non-comparison based sorting algorithms on two modern architectures – the latest CPU and GPU architectures. […]

Nov, 20

SBLOCK: A Framework for Efficient Stencil-Based PDE Solvers on Multi-core Platforms

We present a new software framework for the implementation of applications that use stencil computations on block-structured grids to solve partial differential equations. A key feature of the framework is the extensive use of automatic source code generation which is used to achieve high performance on a range of leading multi-core processors. Results are presented […]

CUDA

Nov, 20

A GPGPU Transparent Virtualization Component for High Performance Computing Clouds

The GPU Virtualization Service (gVirtuS) presented in this work tries to fill the gap between in-house hosted computing clusters, equipped with GPGPUs devices, and pay-for-use high performance virtual clusters deployed via public or private computing clouds. gVirtuS allows an instanced virtual machine to access GPGPUs in a transparent and hypervisor independent way, with an overhead […]

Nov, 19

Active Structured Learning for High-Speed Object Detection

High-speed smooth and accurate visual tracking of objects in arbitrary, unstructured environments is essential for robotics and human motion analysis. However, building a system that can adapt to arbitrary objects and a wide range of lighting conditions is a challenging problem, especially if hard real-time constraints apply like in robotics scenarios. In this work, we […]

CUDA

Nov, 19

PantaRay: fast ray-traced occlusion caching of massive scenes

We describe the architecture of a novel system for precomputing sparse directional occlusion caches. These caches are used for accelerating a fast cinematic lighting pipeline that works in the spherical harmonics domain. The system was used as a primary lighting technology in the movie Avatar, and is able to efficiently handle massive scenes of unprecedented […]

Nov, 19

Parallel option pricing with Fourier space time-stepping method on graphics processing units

With the evolution of graphics processing units (GPUs) into powerful and cost-efficient computing architectures, their range of application has expanded tremendously, especially in the area of computational finance. Current research in the area, however, is limited both in terms of the type of options priced and the complexity of stock price models. This paper presents […]

Nov, 19

Simulation of P systems with active membranes on CUDA

P systems or Membrane Systems provide a high-level computational modelling framework that combines the structure and dynamic aspects of biological systems in a relevant and understandable way. They are inherently parallel and non-deterministic computing devices. In this article, we discuss the motivation, design principles and key of the implementation of a simulator for the class […]

CUDA

Nov, 19

Parallel hybrid metaheuristics for the flexible job shop problem

A parallel approach to flexible job shop scheduling problem is presented in this paper. We propose two double-level parallel metaheuristic algorithms based on the new method of the neighborhood determination. Algorithms proposed here include two major modules: the machine selection module refer to executed sequentially, and the operation scheduling module executed in parallel. On each […]

Nov, 19

Real time ultrasound image denoising

Image denoising is the process of removing the noise that perturbs image analysis methods. In some applications like segmentation or registration, denoising is intended to smooth homogeneous areas while preserving the contours. In many applications like video analysis, visual servoing or image-guided surgical interventions, real-time denoising is required. This paper presents a method for real-time […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

A middleware for efficient stream processing in CUDA

Simulating a P system based efficient solution to SAT by using GPUs

Simulation of one-layer shallow water systems on multicore and CUDA architectures

Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

SBLOCK: A Framework for Efficient Stencil-Based PDE Solvers on Multi-core Platforms

A GPGPU Transparent Virtualization Component for High Performance Computing Clouds

Active Structured Learning for High-Speed Object Detection

PantaRay: fast ray-traced occlusion caching of massive scenes

Parallel option pricing with Fourier space time-stepping method on graphics processing units

Simulation of P systems with active membranes on CUDA

Parallel hybrid metaheuristics for the flexible job shop problem

Real time ultrasound image denoising

Recent source codes

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

Most viewed papers (last 30 days)