high performance computing on graphics processing units: hgpu.org

Posts

Dec, 4

Exploiting Heterogeneous Systems: Keccak on OpenCL

Using graphics processing units (GPUs) in high-performance parallel computing continues to become more prevalent, often as part of a heterogeneous system. CUDA and OpenCL are APIs and enables programmers to developer GPGPU applications and softwares to massively parallel processors. In October 2, 2012, NIST announced the winner of its five-year competition to select a new […]

OpenCL

Dec, 4

Cluster-SkePU: A Multi-Backend Skeleton Programming Library for GPU Clusters

SkePU is a C++ template library with a simple and unified interface for expressing data parallel computations in terms of generic components, called skeletons, on multi-GPU systems using CUDA and OpenCL. The smart containers in SkePU, such as Matrix and Vector, perform data management with a lazy memory copying mechanism that reduces redundant data communication. […]

CUDA

Dec, 4

A Hybrid Approach to Parallel Connected Component Labeling Using CUDA

Connected component labeling (CCL) is a mandatory step in image segmentation where each object in an image is identified and uniquely labeled. Sequential CCL is a time-consuming operation and thus is often implemented within parallel processing framework to reduce execution time. Several parallel CCL methods have been proposed in the literature. Among them are NSZ […]

CUDA

Dec, 4

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Solving systems of linear equations is an important problem that spans almost all fields of science and mathematics. When these systems grow in size, iterative methods are used to solve these problems. This paper looks at optimizing these methods for CUDA Architectures. It discusses a multi-threaded CPU implementation, a GPU implementation, and a data optimized […]

CUDA

Dec, 4

Comparative Study of High Performance Computing Using Multi-core Parallel Systems

Multi-core based high performance computing systems are available with a reasonable price. Parallel programming paradigm needs to be adjusted to an individual system. Parallel computing systems were compared in this paper. Electroencephalography signals were collected in order to measure performance of parallel computing for CPU and GPU based systems. A CPU based system showed better […]

CUDA

Dec, 4

HSPA+/LTE-A Turbo Decoder on GPU and Multicore CPU

This paper compares two implementations of reconfigurable and high-throughput turbo decoders. The first implementation is optimized for an NVIDIA Kepler graphics processing unit (GPU), whereas the second implementation is for an Intel Ivy Bridge processor. Both implementations support max-log-MAP and log-MAP turbo decoding algorithms, various code rates, different interleaver types, and all block-lengths, as specified […]

CUDA

Dec, 4

Divergence Analysis

The growing interest in graphics processing units has brought renewed attention to the Single Instruction Multiple Data (SIMD) execution model. SIMD machines give application developers tremendous computational power; however, programming them is still challenging. In particular, developers must deal with memory and control flow divergences. These phenomena stem from a condition that we call data […]

CUDA

Dec, 4

Fingerprint grid enhancement on GPU

This paper presents an optimized GPU (Graphics Processing Unit) implementation for fingerprint images enhancement using a Gabor filter-bank based algorithm. Given a batch of fingerprint images, we apply the Gabor filter bank and compute image variances of the convolution responses. We then select parts of these responses and compose the final enhanced batches. The algorithm […]

CUDA

Dec, 3

Multithreaded Transposition of Square Matrices with Common Code for Intel Xeon Processors and Intel Xeon Phi Coprocessors

In-place matrix transposition, a standard operation in linear algebra, is a memory bandwidth-bound operation. The theoretical maximum performance of transposition is the memory copy bandwidth. However, due to non-contiguous memory access in the transposition operation, practical performance is usually lower. The ratio of the transposition rate to the memory copy bandwidth is a measure of […]

Dec, 3

GPU and CPU Cooperative Accelerated Road Detection

In this paper, we propose a fast and robust unstructured road detection method that integrates GPU (Graphics Processing Unit) and CPU implementations. In order to ensure the robustness of the algorithm, BP (Back Propagation) Neural Network is employed to learn the color features from a set of sample of both road region and off-road region, […]

Dec, 3

SESH framework: A Space Exploration Framework for GPU Application and Hardware Codesign

Graphics processing units (GPUs) have become increasingly popular accelerators in supercomputers, and this trend is likely to continue. With its disruptive architecture and a variety of optimization options, it is often desirable to understand the dynamics between potential application transformations and potential hardware features when designing future GPUs for scientific workloads. However, current codesign efforts […]

Dec, 3

Real-time High Resolution Fusion of Depth Maps on GPU

A system for live high quality surface reconstruction using a single moving depth camera on a commodity hardware is presented. High accuracy and real-time frame rate is achieved by utilizing graphics hardware computing capabilities via OpenCL and by using sparse data structure for volumetric surface representation. Depth sensor pose is estimated by combining serial texture […]

OpenCL

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Exploiting Heterogeneous Systems: Keccak on OpenCL

Cluster-SkePU: A Multi-Backend Skeleton Programming Library for GPU Clusters

A Hybrid Approach to Parallel Connected Component Labeling Using CUDA

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Comparative Study of High Performance Computing Using Multi-core Parallel Systems

HSPA+/LTE-A Turbo Decoder on GPU and Multicore CPU

Divergence Analysis

Fingerprint grid enhancement on GPU

Multithreaded Transposition of Square Matrices with Common Code for Intel Xeon Processors and Intel Xeon Phi Coprocessors

GPU and CPU Cooperative Accelerated Road Detection

SESH framework: A Space Exploration Framework for GPU Application and Hardware Codesign

Real-time High Resolution Fusion of Depth Maps on GPU

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)