high performance computing on graphics processing units: hgpu.org

Posts

Dec, 8

Lightweight Modular Staging and Embedded Compilers: Abstraction Without Regret for High-Level High-Performance Programming

Programs expressed in a high-level programming language need to be translated to a low-level machine dialect for execution. This translation is usually accomplished by a compiler, which is able to translate any legal program to equivalent low-level code. But for individual source programs, automatic translation does not always deliver good results: Software engineering practice demands […]

CUDA

Dec, 5

A Data-Parallel Graphics Pipeline Implemented in OpenCL

This report documents implementation details, results, benchmarks and technical discussions for the work carried out within a master’s thesis at Linkoping University. Within the master’s thesis, the field of software rendering is explored in the age of parallel computing. Using the Open Computing Language, a complete graphics pipeline was implemented for use on general processing […]

OpenCL

Dec, 5

Mapping Streaming Applications to OpenCL

Graphic processing units (GPUs) have been gaining popularity in general purpose and high performance computing. A GPU is made up of a number of streaming multiprocessors (SM), each of which consists of many processing cores. A large number of general-purpose applications have been mapped onto GPUs efficiently. Stream processing applications, however, exhibit properties such as […]

OpenCL

Dec, 5

Parallel Cosegmentation via Submodular Optimization on Anisotropic Diffusion

With large number of related images being used for applications such as MR spectroscopy imaging, Object of interest 3D modelling and photo collages, the need of the hour is to accelerate image cosegmentation algorithms. Cosegmentation refers to the process of segmenting common regions from multiple related images. A novel distributed algorithm, CoSand [1], for cosegmentation […]

CUDA

Dec, 5

Gauge Field Generation on Large-Scale GPU-Enabled Systems

Over the past years GPUs have been successfully applied to the task of inverting the fermion matrix in lattice QCD calculations. Even strong scaling to capability-level supercomputers, corresponding to O(100) GPUs or more has been achieved. However strong scaling a whole gauge field generation algorithm to this regim requires significantly more functionality than just having […]

CUDA

Dec, 5

Usage of GPU in LS-DYNA

The increasing computing power of GPUs can be used to improve the performance of CAE systems.[1]. Within LS-DYNA an improved direct equation solver can be used, which accelerates the performance of implicit applications by use of a CUDA-based solver [2], [3], [4]. In this paper the performance improvements for different customer input decks for metal […]

CUDA

Dec, 4

Fast Parallel Sorting Algorithms on GPUs

This paper presents a comparative analysis of the three widely used parallel sorting algorithms: OddEven sort, Rank sort and Bitonic sort in terms of sorting rate, sorting time and speed-up on CPU and different GPU architectures. Alongside we have implemented novel parallel algorithm: min-max butterfly network, for finding minimum and maximum in large data sets. […]

OpenCL

Dec, 4

gR: A GPU-based Router

With the growing internet traffic and complexity of packet processing task, the throughput of routers is affected. Also modern routers need to provide additional services like security, QOS which further adds to the complexity. These issues can be addressed with the massive parallel computing capability of graphic processors. In this paper, we offload two of […]

CUDA

Dec, 4

FusionSim: Characterizing the Performance Benefits of Fused CPU/GPU Systems

We present FusionSim, a modeling framework capable of cycle-accurate simulation of a complete x86-based computer system with (a) a CPU and a GPU on the same die, and (b) a CPU and a GPU connected as separate components. We use FusionSim to characterize the performance of the Rodinia benchmarks on fused and discrete systems. We […]

Dec, 4

A MPI back-end for the OpenACC accULL. Exploiting OpenACC semantics in Message Passing Clusters

The irruption in the HPC scene of hardware acceletarors has made available unprecedented performance to developers. However, even expert developers may not be ready to exploit the complex hierarchies of these new heterogeneous systems. We need to find a way to leverage the programming effort in these architectures at programming language level, otherwise, developers will […]

OpenCL

Dec, 4

Molecular dynamics for long-range interacting systems on Graphic Processing Units

We present implementations of a fourth-order symplectic integrator on graphic processing units for three $N$-body models with long-range interactions of general interest: the Hamiltonian Mean Field, Ring and two-dimensional self-gravitating models. We discuss the algorithms, speedups and errors using one and two GPU units. Speedups can be as high as 140 compared to a serial […]

CUDA

Dec, 3

GPU-Based Implementation of JPEG2000 Encoder

JPEG2000 has become one of the most rewarding image coding standards. It provides a practical set of features which weren’t necessarily available in the previous standards. The features were realized as a result of two new techniques, namely the Discrete Wavelet Transform (DWT), and Embedded Block Coding with Optimized Truncation (EBCOT). The complexity of EBCOT […]

CUDA

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Lightweight Modular Staging and Embedded Compilers: Abstraction Without Regret for High-Level High-Performance Programming

A Data-Parallel Graphics Pipeline Implemented in OpenCL

Mapping Streaming Applications to OpenCL

Parallel Cosegmentation via Submodular Optimization on Anisotropic Diffusion

Gauge Field Generation on Large-Scale GPU-Enabled Systems

Usage of GPU in LS-DYNA

Fast Parallel Sorting Algorithms on GPUs

gR: A GPU-based Router

FusionSim: Characterizing the Performance Benefits of Fused CPU/GPU Systems

A MPI back-end for the OpenACC accULL. Exploiting OpenACC semantics in Message Passing Clusters

Molecular dynamics for long-range interacting systems on Graphic Processing Units

GPU-Based Implementation of JPEG2000 Encoder

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)