high performance computing on graphics processing units: hgpu.org

Posts

Nov, 16

Benchmarking performance of a hybrid Xeon/Xeon Phi system for parallel computation of similarity measures between large vectors

The paper deals with parallelization of computing similarity measures between large vectors. Such computations are important components within many applications and consequently are of high importance. Rather than focusing on optimization of the algorithm itself, assuming specific measures, the paper assumes a general scheme for finding similarity measures for all pairs of vectors and investigates […]

Nov, 13

CUDA-API-wrappers: Thin C++-flavored wrappers for the CUDA runtime API

nVIDIA’s Runtime API for CUDA is intended for use both in C and C++ code. As such, it uses a C-style API, the lower common denominator (with a few notable exceptions of templated function overloads). This library of wrappers around the Runtime API is intended to allow us to embrace many of the features of […]

CUDA

Nov, 13

OpenCL-based optimizations for acceleration of object tracking on FPGAs and GPUs

OpenCL support across many heterogeneous nodes (FPGAs, GPUs, CPUs) has increased the programmability of these systems significantly. At the same time, it opens up new challenges and design choices for system designers and application programmers. While OpenCL offers a universal semantic to capture the parallel behavior of applications independent of the target architecture, some customization […]

OpenCL

Nov, 13

Executing Dynamic Data Rate Actor Networks on OpenCL Platforms

Heterogeneous computing platforms consisting of general purpose processors (GPPs) and graphics processing units (GPUs) have become commonplace in personal mobile devices and embedded systems. For years, programming of these platforms was very tedious and simultaneous use of all available GPP and GPU resources required low-level programming to ensure efficient synchronization and data transfer between processors. […]

OpenCL

Nov, 13

Hadoopcl2: Motivating the design of a distributed, heterogeneous programming system with machine-learning applications

Machine learning (ML) algorithms have garnered increased interest as they demonstrate improved ability to extract meaningful trends from large, diverse, and noisy data sets. While research is advancing the state-of-the-art in ML algorithms, it is difficult to drastically improve the real-world performance of these algorithms. Porting new and existing algorithms from single-node systems to multi-node […]

OpenCL

Nov, 13

Fractal Art Generation using GPUs

Fractal image generation algorithms exhibit extreme parallelizability. Using general purpose graphics processing unit (GPU) programming to implement escape-time algorithms for Julia sets of functions,parallel methods generate visually attractive fractal images much faster than traditional methods. Vastly improved speeds are achieved using this method of computation, which allow real-time generation and display of images. A comparison […]

CUDA

Nov, 10

Shuffle Reduction Based Sparse Matrix-Vector Multiplication on Kepler GPU

GPU is the suitable equipment for accelerating computing-intensive applications in order to get the higher throughput for High Performance Computing (HPC). Sparse Matrix-Vector Multiplication (SpMV) is the core algorithm of HPC, so the SpMV’s throughput on GPU may affect the throughput on HPC platform. In the paper, we focus on the latency of reduction routine […]

CUDA

Nov, 10

PipeCNN: An OpenCL-Based FPGA Accelerator for Large-Scale Convolution Neuron Networks

Convolutional neural networks (CNNs) have been widely employed in many applications such as image classification, video analysis and speech recognition. Being compute-intensive, CNN computations are mainly accelerated by GPUs with high power dissipations. Recently, studies were carried out exploiting FPGA as CNN accelerator because of its reconfigurability and energy efficiency advantage over GPU, especially when […]

OpenCL

Nov, 10

Using multiple GPUs to accelerate string searching for digital forensic analysis

String searching within a large corpus of data is an important component of digital forensic (DF) analysis techniques such as file carving. The continuing increase in capacity of consumer storage devices requires corresponding im-provements to the performance of string searching techniques. As string search-ing is a trivially-parallelisable problem, GPGPU approaches are a natural fit – […]

OpenCL

Nov, 10

Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/many-core shared memory processors

B-spline based orbital representations are widely used in Quantum Monte Carlo (QMC) simulations of solids, historically taking as much as 50% of the total run time. Random accesses to a large four-dimensional array make it challenging to efficiently utilize caches and wide vector units of modern CPUs. We present node-level optimizations of B-spline evaluations on […]

Nov, 10

Memory layout in GPU implementation of lattice Boltzmann method for sparse 3D geometries

We describe a high-performance implementation of the lattice Boltzmann method (LBM) for sparse 3D geometries on graphic processors (GPU). The main contribution of this work is a data layout that allows to minimise the number of redundant memory transactions during the propagation step of LBM. We show that by using a uniform mesh of small […]

CUDA

Nov, 8

Balancing locality and concurrency: solving sparse triangular systems on GPUs

Many numerical optimisation problems rely on fast algorithms for solving sparse triangular systems of linear equations (STLs). To accelerate the solution of such equations, two types of approaches have been used: on GPUs, concurrency has been prioritised to the disadvantage of data locality, while on multi-core CPUs, data locality has been prioritised to the disadvantage […]

OpenCL

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Benchmarking performance of a hybrid Xeon/Xeon Phi system for parallel computation of similarity measures between large vectors

CUDA-API-wrappers: Thin C++-flavored wrappers for the CUDA runtime API

OpenCL-based optimizations for acceleration of object tracking on FPGAs and GPUs

Executing Dynamic Data Rate Actor Networks on OpenCL Platforms

Hadoopcl2: Motivating the design of a distributed, heterogeneous programming system with machine-learning applications

Fractal Art Generation using GPUs

Shuffle Reduction Based Sparse Matrix-Vector Multiplication on Kepler GPU

PipeCNN: An OpenCL-Based FPGA Accelerator for Large-Scale Convolution Neuron Networks

Using multiple GPUs to accelerate string searching for digital forensic analysis

Optimization and parallelization of B-spline based orbital evaluations in QMC on multi/many-core shared memory processors

Memory layout in GPU implementation of lattice Boltzmann method for sparse 3D geometries

Balancing locality and concurrency: solving sparse triangular systems on GPUs

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)