high performance computing on graphics processing units: hgpu.org

Posts

Sep, 5

Efficient Implementation of RLS-Based Adaptive Filters on nVIDIA GeForce Graphics Processing Unit

This paper presents efficient implementation of RLS-based adaptive filters with a large number of taps on nVIDIA GeForce graphics processing unit (GPU) and CUDA software development environment. Modification of the order and the combination of calculations reduces the number of accesses to slow off-chip memory. Assigning tasks into multiple threads also takes memory access order […]

CUDA

Sep, 5

Real-Time Motion Artifact Compensation for PMD-ToF Images

Time-of-Flight (ToF) cameras gained a lot of scientific attention and became a vivid field of research in the last years. A still remaining problem of ToF cameras are motion artifacts in dynamic scenes. This paper presents a new preprocessing method for a fast motion artifact compensation. We introduce a ow like algorithm that supports motion […]

CUDA

Sep, 5

Work in Progress: Vortex Detection and Visualization for Design of Micro Air Vehicles and Turbomachinery

Vortex detection and visualization is an important technique for computational fluid dynamics (CFD) modelers and analysts. Since vortices are often not just local phenomena, algorithms for detecting the vortex core can be expanded by the use of streamline placement and termination methodologies to appropriately visualize the vortex. We are enhancing an existing VCDetect software tool […]

CUDA

Sep, 5

Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems

Heterogeneous computing on CPUs and GPUs has traditionally used fixed roles for each device: the GPU handles data parallel work by taking advantage of its massive number of cores while the CPU handles non data-parallel work, such as the sequential code or data transfer management. Unfortunately, this work distribution can be a poor solution as […]

OpenCL

Sep, 5

GPU & CPU implementation of Young – Van Vliet’s Recursive Gaussian Smoothing Filter

This document describes an implementation for GPU and CPU of Young and Van Vliet’s recursive Gaussian smoothing as an external module for the Insight Toolkit ITK, version 4.* www.itk.org. In the absence of an OpenCL-capable platform, the code will run the CPU implementation as an alternative to the existing Deriche recursive Gaussian smoothing filter in […]

CUDA

•

OpenCL

Sep, 4

Generation of the Scrambled Halton Sequence Using Accelerators

The Halton sequence is one of the most popular low-discrepancy sequences. In order to satisfy some practical requirements, the original sequence is usually modified in some way. The scrambling algorithm, proposed by Owen, has several theoretical advantages, but on the other hand is difficult to implement in practice due to the trade-off between high memory […]

CUDA

Sep, 4

The discrete dipole approximation code DDscat.C++: features, limitations and plans

We present a new freely available open-source C++ software for numerical solution of the electromagnetic waves absorption and scattering problems within the Discrete Dipole Approximation paradigm. The code is based upon the famous and free Fortan-90 code DDSCAT by B. Draine and P. Flatau. Started as a teaching project, the presented code DDscat.C++ differs from […]

CUDA

Sep, 4

Detecting multiple periodicities in observational data with the multi-frequency periodogram. II. Frequency Decomposer, a parallelized time-series analysis algorithm

This is a parallelized algorithm performing a decomposition of a noisy time series into a number of frequency components. The algorithm analyses all suspicious periodicities that can be revealed, including the ones that look like an alias or noise at a glance, but later may prove to be a real variation. After selection of the […]

CUDA

Sep, 4

Optimizing the MapReduce Framework on Intel Xeon Phi Coprocessor

With the ease-of-programming, flexibility and yet efficiency, MapReduce has become one of the most popular frameworks for building big-data applications. MapReduce was originally designed for distributed-computing, and has been extended to various architectures, e,g, multi-core CPUs, GPUs and FPGAs. In this work, we focus on optimizing the MapReduce framework on Xeon Phi, which is the […]

Sep, 4

Accelerating a Cloud-Based Software GNSS Receiver

In this paper we discuss ways to reduce the execution time of a software Global Navigation Satellite System (GNSS) receiver that is meant for offline operation in a cloud environment. Client devices record satellite signals they receive, and send them to the cloud, to be processed by this software. The goal of this project is […]

CUDA

Sep, 2

Accurate and Efficient Filtering using Anistropic Filter Decomposition

Efficient filtering remains an important challenge in computer graphics, particularly when filters are spatially-varying, have large extent, and/or exhibit complex anisotropic profiles. We present an efficient filtering approach for these difficult cases based on anisotropic filter decomposition (IFD). By decomposing complex filters into linear combinations of simpler, displaced isotropic kernels, and precomputing a compact prefiltered […]

CUDA

Sep, 2

Oncilla: A GAS Runtime for Efficient Resource Allocation and Data Movement in Accelerated Clusters

Accelerated and in-core implementations of Big Data applications typically require large amounts of host and accelerator memory as well as efficient mechanisms for transferring data to and from accelerators in heterogeneous clusters. Scheduling for heterogeneous CPU and GPU clusters has been investigated in depth in the high-performance computing (HPC) and cloud computing arenas, but there […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Efficient Implementation of RLS-Based Adaptive Filters on nVIDIA GeForce Graphics Processing Unit

Real-Time Motion Artifact Compensation for PMD-ToF Images

Work in Progress: Vortex Detection and Visualization for Design of Micro Air Vehicles and Turbomachinery

Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems

GPU & CPU implementation of Young – Van Vliet’s Recursive Gaussian Smoothing Filter

Generation of the Scrambled Halton Sequence Using Accelerators

The discrete dipole approximation code DDscat.C++: features, limitations and plans

Detecting multiple periodicities in observational data with the multi-frequency periodogram. II. Frequency Decomposer, a parallelized time-series analysis algorithm

Optimizing the MapReduce Framework on Intel Xeon Phi Coprocessor

Accelerating a Cloud-Based Software GNSS Receiver

Accurate and Efficient Filtering using Anistropic Filter Decomposition

Oncilla: A GAS Runtime for Efficient Resource Allocation and Data Movement in Accelerated Clusters

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)