high performance computing on graphics processing units: hgpu.org

Posts

Sep, 22

Accelerating Habanero-Java Programs with OpenCL Generation

The initial wave of programming models for general-purpose computing on GPUs, led by CUDA and OpenCL, has provided experts with low-level constructs to obtain significant performance and energy improvements on GPUs. However, these programming models are characterized by a challenging learning curve for non-experts due to their complex and low-level APIs. Looking to the future, […]

OpenCL

Sep, 22

Investigating the Performance of Motion Estimation Block-Matching Algorithms on GPU Cards

In the field of video compression, motion estimation (ME) is a process that leads to high computational complexity. Implementation of ME block-matching (BM) algorithms on general purpose Central Processing Unit (CPU), has resulted in poor performance. In this paper we investigate the performance of two BM ME algorithms: Three Step Search (TSS) and Four Step […]

CUDA

Sep, 22

Fast Endmember Extraction for Massive Hyperspectral Sensor Data on GPUs

Hyperspectral imaging sensor becomes increasingly important in multi-sensor collaborative observation. The spectral mixture problem seriously influences the efficiency of hyperspectral data exploitation, and endmember extraction is one of the key issues. Due to the high computational cost of algorithm and massive quantity of the hyperspectral sensor data, high-performance computing is extremely demanded for those scenarios […]

CUDA

Sep, 22

Paralleling Variable Block Size Motion Estimation of HEVC on Multi- Core CPU Plus GPU Platform

Motion estimation with variable block sizes (VBSME) is one of the most complex models in the HEVC encoder. The HEVC standard supports up to 12 variable block sizes ranging from 4×8/8×4 to 64×64 for motion estimation (ME) and motion compensation (MC). This feature contributes substantial coding gain compared with 7 variable block sizes in H.264/AVC […]

CUDA

Sep, 22

Geo-Correction of High-Resolution Imagery Using Fast Template Matching on a GPU in Emergency Mapping Contexts

The increasing availability of satellite imagery acquired by existing and new sensors allows a wide variety of new applications that depend on the use of diverse spectral and spatial resolution data sets. One of the pre-conditions for the use of hybrid image data sets is a consistent geo-correction capacity. We demonstrate how a novel fast […]

Sep, 21

Optimization solutions for the segmented sum algorithmic function

In this paper, there are depicted optimization solutions for the segmented sum algorithmic function, developed using the Compute Unified Device Architecture (CUDA), a powerful and efficient solution for optimizing a wide range of applications. The parallel-segmented sum is often used in building many data processing algorithms and through its optimization, one can improve the overall […]

CUDA

Sep, 21

A streaming model for nested data parallelism

Efficient parallel algorithms are often written with embedded knowledge of the back-end that they are meant to be executed on, and if they are not, the translation to target language often produces inefficient code. A concrete problem is space complexity in nested data parallel (NDP) languages such as NESL and Data Parallel Haskell, where large […]

CUDA

Sep, 21

Performing DCT8x8 Computation on GPU Using NVIDIA CUDA Technology

In this paper, we have proposed sequential and parallel Discrete Cosine Transform (DCT) in compute unified device architecture (CUDA) libraries. The introduction of programmable pipeline in the graphics processing units (GPU) has enabled configurability. GPU which is available in every computer has a tremendous feat of highly parallel SIMD processing, but its capability is often […]

CUDA

Sep, 21

A GPU Implementation of Parallel Constraint-based Local Search

In this paper we study the performance of constraint-based local search solvers on a GPU. The massively parallel architecture of the GPU makes it possible to explore parallelism at two different levels inside the local search algorithm. First, by executing multiple copies of the algorithm in a multi-walk manner and, second, by evaluating large neighborhoods […]

CUDA

Sep, 21

GPU Accelerated Parameter Estimation by Global Optimization using Interval Analysis

This master thesis treats the topic of non-linear parameter estimation using global optimization methods based on interval analysis (IA), accelerated by parallel implementation on a Graphics Processing Unit (GPU). Global optimization using IA is a mathematically rigorous Branch & Bound-type method, capable of reliably solving global optimization problems with continuously differentiable objective functions, even in […]

CUDA

Sep, 20

Preconditioned conjugate gradient solver for structural problems

Matrix solvers play a crucial role in solving real world physics problem. In engineering practice, transition analysis is most often used, which requires a series of similar matrices to be solved. However, any specific solver with/without preconditioner cannot achieve high performance gain for all matrices. This paper recommends Conjugate Gradient iterative solver with SSOR approximate […]

CUDA

Sep, 20

Can GPUs Sort Strings Efficiently?

String sorting or variable-length key sorting has lagged in performance on the GPU even as the fixed-length key sorting has improved dramatically. Radix sorting is the fastest on the GPUs. In this paper, we present a fast and efficient string sort on the GPU that is built on the available radix sort. Our method sorts […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Accelerating Habanero-Java Programs with OpenCL Generation

Investigating the Performance of Motion Estimation Block-Matching Algorithms on GPU Cards

Fast Endmember Extraction for Massive Hyperspectral Sensor Data on GPUs

Paralleling Variable Block Size Motion Estimation of HEVC on Multi- Core CPU Plus GPU Platform

Geo-Correction of High-Resolution Imagery Using Fast Template Matching on a GPU in Emergency Mapping Contexts

Optimization solutions for the segmented sum algorithmic function

A streaming model for nested data parallelism

Performing DCT8x8 Computation on GPU Using NVIDIA CUDA Technology

A GPU Implementation of Parallel Constraint-based Local Search

GPU Accelerated Parameter Estimation by Global Optimization using Interval Analysis

Preconditioned conjugate gradient solver for structural problems

Can GPUs Sort Strings Efficiently?

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)