high performance computing on graphics processing units: hgpu.org

Posts

Dec, 20

A Review on Parallelization of Node based Game Tree Search Algorithms on GPU

Game tree search is a classical problem in the field of game theory and artificial intelligence. Focus of the system is on how to leverage massive parallelism capabilities of GPUs to accelerate the speed of game tree algorithms and propose a concise and general parallel game tree algorithm on GPUs. Comparison can be done for […]

CUDA

Dec, 20

A Parallel Recursive Approach for Solving All Pairs Shortest Path Problem on GPU using OpenCL

All-pairs shortest path problem(APSP) finds a large number of practical applications in real world. We owe to present a highly parallel and recursive solution for solving APSP problem based on Kleene’s algorithm. The proposed parallel approach for APSP is implemented using an open standard framework OpenCL which provides a development environment for utilizing massive parallel […]

OpenCL

Dec, 20

SignalPU: A programming model for DSP applications on parallel and heterogeneous clusters

The biomedical imagery, the numeric communications, the acoustic signal processing and many others digital signal processing applications (DSP) are present more and more everyday in the numeric world. They process growing data volume which is represented with more and more accuracy, and using complex algorithms with time constraints to satisfying. Consequently, a high requirement of […]

CUDA

Dec, 20

Towards an automatic generation of dense linear algebra solvers on parallel architectures

The increasing complexity of new parallel architectures has widened the gap between adaptability and efficiency of the codes. As high performance numerical libraries tend to focus more on performance, we wish to address this issue using a C++ library called NT2. By analyzing the properties of the linear algebra domain that can be extracted from […]

CUDA

Dec, 18

Optimising Hydrodynamics applications for the Cray XC30 with the application tool suite

Power constraints are forcing HPC systems to continue to increase hardware concurrency. Efficiently scaling applications on future machines will be essential for improved science and it is recognised that the "flat" MPI model will start to reach its scalability limits. The optimal approach is unknown, necessitating the use of mini-applications to rapidly evaluate new approaches. […]

CUDA

•

OpenCL

Dec, 18

Multicore Scheduling of Parallel Real-Time Tasks with Multiple Parallelization Options

Past researches on multicore scheduling assume that a computational unit has already been parallelized into a prefixed number of threads. However, with recent technologies such as OpenCL, a computational unit can be parallelized in many different ways with runtime selectable numbers of threads. This paper proposes an optimal algorithm for parallelizing and scheduling a set […]

OpenCL

Dec, 18

Efficient GPU Implementation for Single Block Orthogonal Dictionary Learning

Dictionary training for sparse representations involves dealing with large chunks of data and complex algorithms that determine time consuming implementations. SBO is an iterative dictionary learning algorithm based on constructing unions of orthonormal bases via singular value decomposition, that represents each data item through a single best fit orthobase. In this paper we present a […]

OpenCL

Dec, 18

GPU-Powered Coherent Beamforming

GPU-based beamforming is a relatively unexplored area in radio astronomy, possibly due to the assumption that any such system will be severely limited by the PCIe bandwidth required to transfer data to the GPU. We have developed a CUDA-based GPU implementation of a coherent beamformer, specifically designed and optimised for deployment at the BEST-2 array […]

CUDA

Dec, 18

DeepSpeech: Scaling up end-to-end speech recognition

We present a state-of-the-art speech recognition system developed using end-to-end deep learning. Our architecture is significantly simpler than traditional speech systems, which rely on laboriously engineered processing pipelines; these traditional systems also tend to perform poorly when used in noisy environments. In contrast, our system does not need hand-designed components to model background noise, reverberation, […]

CUDA

Dec, 16

Highly Efficient Forward and Backward Propagation of Convolutional Neural Networks for Pixelwise Classification

We present highly efficient algorithms for performing forward and backward propagation of Convolutional Neural Network (CNN) for pixelwise classification on images. For pixelwise classification tasks, such as image segmentation and object detection, surrounding image patches are fed into CNN for predicting the classes of centered pixels via forward propagation and for updating CNN parameters via […]

CUDA

Dec, 16

Multi-Centroid PSO Classification Learning on the GPU

Training classifiers can be seen as an optimization problem. With this view, we have developed a method to train a type of nearest centroid classifier with PSO. Results showed an improvement on most of the datasets tested. Additionally, we have developed a method to utilize the developed classifier with datasets containing both numeric and categorical […]

CUDA

Dec, 16

An Optimized GPU Memory Hierarchy Design for an OpenCL Kernel

With the advent of multi and many-core processors, communication has replaced computation as the performance bottleneck. Most current approaches to the problem try to tolerate memory access latency through a high amount of Thread-Level Parallelism. However, not all applications benefit from such techniques and there is a need to address the weakness of the underlying […]

OpenCL

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

A Review on Parallelization of Node based Game Tree Search Algorithms on GPU

A Parallel Recursive Approach for Solving All Pairs Shortest Path Problem on GPU using OpenCL

SignalPU: A programming model for DSP applications on parallel and heterogeneous clusters

Towards an automatic generation of dense linear algebra solvers on parallel architectures

Optimising Hydrodynamics applications for the Cray XC30 with the application tool suite

Multicore Scheduling of Parallel Real-Time Tasks with Multiple Parallelization Options

Efficient GPU Implementation for Single Block Orthogonal Dictionary Learning

GPU-Powered Coherent Beamforming

DeepSpeech: Scaling up end-to-end speech recognition

Highly Efficient Forward and Backward Propagation of Convolutional Neural Networks for Pixelwise Classification

Multi-Centroid PSO Classification Learning on the GPU

An Optimized GPU Memory Hierarchy Design for an OpenCL Kernel

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)