high performance computing on graphics processing units: hgpu.org

Posts

Dec, 15

Real time Multi-GPU-based Event Detection in High Definition Videos

Video processing algorithms present a very important tool for many applications related to computer vision domain such as motion tracking, videos indexation, robot navigation and event detection. However, the new video standards, especially in high definitions, cause that the current implementations, even running on modern hardware, no longer respect the needs of real-time processing. In […]

CUDA

Dec, 15

OpenCL-Accelerated Computation of a 3D SPECT Projection Operator for the Content Adaptive Mesh Model

In this manuscript, we present a preliminary evaluation of a fully 3D projection operator calculation aimed at emission tomography on a non-circular orbit. The proposed methodology uses the content-adaptive mesh model (CAMM) for volumetric data representation. The CAMM is an efficient data representation based on adaptive non-uniform sampling and linear interpolation. The presented projection operator […]

OpenCL

Dec, 13

Data Transfer Matters for GPU Computing

Graphics processing units (GPUs) embrace manycore compute devices where massively parallel compute threads are offloaded from CPUs. This heterogeneous nature of GPU computing raises non-trivial data transfer problems especially against latency-critical real-time systems. However even the basic characteristics of data transfers associated with GPU computing are not well studied in the literature. In this paper, […]

CUDA

Dec, 13

GPU hardware acceleration for industrial applications: using computation to push beyond physical limitations

This thesis explores the possibility of utilizing Graphics Processing Units (GPUs) to address the computational demand of algorithms used to mitigate the inherent physical limitations in devices such as microscopes and 3D-scanners. We investigate the outcome and test our methodology for the following case studies: – the narrow field of view found in microscopes. – […]

CUDA

Dec, 13

All-pairs Shortest Path Algorithm based on MPI+CUDA Distributed Parallel Programming Model

In view of the problem that computing shortest paths in a graph is a complex and time-consuming process, and the traditional algorithm that rely on the CPU as computing unit solely can’t meet the demand of real-time processing, in this paper, we present an all-pairs shortest paths algorithm using MPI+CUDA hybrid programming model, which can […]

CUDA

Dec, 13

TuCCompi: A Multi-Layer Programing Model for Heterogeneous Systems with Auto-Tuning Capabilities

During the last decade, parallel processor architectures have become a powerful tool to deal with massively-parallel problems that require High Performance Computing (HPC). The last trend of HPC is the use of heterogeneous environments, that combine different computational power units, such as CPU-cores and GPUs. Performance maximization of any GPU parallel implementation of an algorithm […]

CUDA

Dec, 13

Augur: a Modeling Language for Data-Parallel Probabilistic Inference

It is time-consuming and error-prone to implement inference procedures for each new probabilistic model. Probabilistic programming addresses this problem by allowing a user to specify the model and having a compiler automatically generate an inference procedure for it. For this approach to be practical, it is important to generate inference code that has reasonable performance. […]

CUDA

Dec, 12

GPU Based Dose Calculation

The goal of this dissertation was to parallelize a dose calculation code for radiotherapy cancer treatment and explore the suitability of the new Intel Xeon Phi technology for such task. The source code proved to have many bugs and as such it took a long time to be able to produce consistent results. Thus, the […]

CUDA

Dec, 12

Development of Bayesian analysis program for extraction of polarisation observables at CLAS

At the mass of a proton, the strong force is not well understood. Various quark models exist, but it is important to determine which quark model(s) are most accurate. Experimentally, finding resonances predicted by some models and not others would give valuable insight into this fundamental interaction. Several labs around the world use photoproduction experiments […]

OpenCL

Dec, 12

Inter-block synchronization on a GPGPU

With the invention of multi-core processing unit technology, the graphics processing unit has evolved from single core graphic processing unit to multi-core programmable graphics processing units. Because of the GPUs’ architecture, people found that it is not only good at processing graphics related data, but also suitable for performing general purpose parallel computations. However, since […]

OpenCL

Dec, 12

Lessons learned from contrasting a BLAS kernel implementations

This work reviews the experience of implementing different versions of the SSPR rank-one update operation of the BLAS library. The main objective was to contrast CPU versus GPU implementation effort and complexity of an optimized BLAS routine, not considering performance. This work contributes with a sample procedure to compare BLAS kernel implementations, how to start […]

CUDA

Dec, 12

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems

The increasing scale and wealth of inter-connected data, such as those accrued by social network applications, demand the design of new techniques and platforms to efficiently derive actionable knowledge from large-scale graphs. However, real-world graphs are famously difficult to process efficiently. Not only they have a large memory footprint, but also most graph algorithms entail […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Real time Multi-GPU-based Event Detection in High Definition Videos

OpenCL-Accelerated Computation of a 3D SPECT Projection Operator for the Content Adaptive Mesh Model

Data Transfer Matters for GPU Computing

GPU hardware acceleration for industrial applications: using computation to push beyond physical limitations

All-pairs Shortest Path Algorithm based on MPI+CUDA Distributed Parallel Programming Model

TuCCompi: A Multi-Layer Programing Model for Heterogeneous Systems with Auto-Tuning Capabilities

Augur: a Modeling Language for Data-Parallel Probabilistic Inference

GPU Based Dose Calculation

Development of Bayesian analysis program for extraction of polarisation observables at CLAS

Inter-block synchronization on a GPGPU

Lessons learned from contrasting a BLAS kernel implementations

Efficient Large-Scale Graph Processing on Hybrid CPU and GPU Systems

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)