high performance computing on graphics processing units: hgpu.org

Posts

Nov, 23

Fast 4pi track reconstruction in nuclear emulsion detectors based on GPU technology

Fast 4pi solid angle particle track recognition has been a challenge in particle physics for a long time, especially in using nuclear emulsion detectors. The recent advances in computing technology opened the way for its realization. A fast 4pi solid angle particle track reconstruction based on GPU technology combined with a multithread programming is reported […]

CUDA

Nov, 23

Dynamic Partitioning-based JPEG Decompression on Heterogeneous Multicore Architectures

With the emergence of social networks and improvements in computational photography, billions of JPEG images are shared and viewed on a daily basis. Desktops, tablets and smartphones constitute the vast majority of hardware platforms used for displaying JPEG images. Despite the fact that these platforms are heterogeneous multicores, no approach exists yet that is capable […]

OpenCL

Nov, 22

Accelerating Sequential Computer Vision Algorithms Using Commodity Parallel Hardware

Since 2004, the clock frequency of CPUs has not increased significantly. Computer Vision applications have an increasing demand for more processing power and are limited by the performance capabilities of sequential processor architectures. The only way to get better performance using commodity hardware is to adopt parallel programming. Many other related research projects have considered […]

OpenCL

Nov, 22

An improved parallel contrast-aware halftoning

Digital image halftoning is a widely used technique. However, achieving high fidelity tone reproduction and structural preservation with low computational time-cost remains a challenging problem. This paper presents a highly parallel algorithm to boost the real-time application of the serial structure-preserving error diffusion. The contrast-aware halftoning approach is one such technique with superior structure preservation, […]

CUDA

Nov, 22

An Optimal Offline Permutation Algorithm on the Hierarchical Memory Machine, with the GPU implementation

The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computation on CUDA-enabled GPUs. The offline permutation is a task to copy numbers stored in an array a of size n to an array b of the same size along a permutation P given in advance. A conventional algorithm […]

CUDA

Nov, 22

Optimization of the Oktay-Kronfeld Action Conjugate Gradient Inverter

Improving the Fermilab action to third order in heavy quark effective theory yields the Oktay-Kronfeld action, a promising candidate for precise calculations of the spectra of heavy quark systems and weak matrix elements relevant to searches for new physics. We have optimized the bi-stabilized conjugate gradient inverter in the SciDAC QOPQDP library and are developing […]

CUDA

Nov, 22

Bohrium: Unmodified NumPy Code on CPU, GPU, and Cluster

In this paper we introduce Bohrium, a runtime-system for mapping array-operations onto a number of different hardware platforms, from multi-core systems to clusters and GPU enabled systems. As a result, the Bohrium runtime system enables NumPy code to utilize CPU, GPU, and Clusters. Bohrium integrates seamlessly into NumPy through the implicit data parallelization of array […]

OpenCL

Nov, 21

Experience with Intel’s Many Integrated Core architecture in ATLAS software

Intel recently released the first commercial boards of its Many Integrated Core (MIC) Architecture. MIC is Intel’s solution for the domain of throughput computing, currently dominated by general purpose programming on graphics processors (GPGPU). MIC allows the use of the more familiar x86 programming model and supports standard technologies such as OpenMP, MPI, and Intel’s […]

Nov, 21

Direct Numeric Simulation of Sheared Convective Boundary Layer Entrainment with GPUs

Sheared convective boundary layers (SCBL) are a frequently observed boundary layer in nature and industry. This paper presents work conducted to validate a numerical fluid model of sheared convective boundary layers implemented in Nvidia’s CUDA programming language for graphical processing units. The code is based on finite difference implementation of the SIMPLE algorithm using the […]

CUDA

Nov, 21

Towards an interactive and automated script feature analysis of 3D scanned cuneiform tablets

Current digitalization projects of ancient artifacts in the field of cultural heritage produce large amounts of data that can not be managed and analyzed in a reasonable amount of time by means of conventional philological methods. Therefore, this paper presents a novel approach to performing a fast and interactive 3D script feature extraction, analysis and […]

OpenGL

Nov, 20

Multi-GPU Support on the Marrow Algorithmic Skeleton Framework

With the proliferation of general purpose GPUs, workload parallelization and datatransfer optimization became an increasing concern. The natural evolution from using a single GPU, is multiplying the amount of available processors, presenting new challenges, as tuning the workload decompositions and load balancing, when dealing with heterogeneous systems. Higher-level programming is a very important asset in […]

CUDA

•

OpenCL

Nov, 20

HyPHI – task based hybrid execution C++ library for the Intel Xeon Phi coprocessor

The Intel Threading Building Blocks (TBB) C++ library introduced task parallelism to a wide audience of application developers. The library is easy to use and powerful, but it is limited to shared-memory machines. In this paper we present HyPHI, a novel library for the Intel Xeon Phi coprocessor for building applications which execute using a […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Fast 4pi track reconstruction in nuclear emulsion detectors based on GPU technology

Dynamic Partitioning-based JPEG Decompression on Heterogeneous Multicore Architectures

Accelerating Sequential Computer Vision Algorithms Using Commodity Parallel Hardware

An improved parallel contrast-aware halftoning

An Optimal Offline Permutation Algorithm on the Hierarchical Memory Machine, with the GPU implementation

Optimization of the Oktay-Kronfeld Action Conjugate Gradient Inverter

Bohrium: Unmodified NumPy Code on CPU, GPU, and Cluster

Experience with Intel’s Many Integrated Core architecture in ATLAS software

Direct Numeric Simulation of Sheared Convective Boundary Layer Entrainment with GPUs

Towards an interactive and automated script feature analysis of 3D scanned cuneiform tablets

Multi-GPU Support on the Marrow Algorithmic Skeleton Framework

HyPHI – task based hybrid execution C++ library for the Intel Xeon Phi coprocessor

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)