high performance computing on graphics processing units: hgpu.org

Posts

Aug, 13

Orthorectification by Using GPGPU Method

Thanks to the nature of the graphics processing, the newly released products offer highly parallel processing units with high-memory bandwidth and computational power of more than teraflops per second. The modern GPUs are not only powerful graphic engines but also they are high level parallel programmable processors with very fast computing capabilities and high-memory bandwidth […]

CUDA

Aug, 13

Real-Time Exact Graph Matching with Application in Human Action Recognition

Graph matching is one of the principal methods to formulate the correspondence between two set of points in computer vision and pattern recognition. Most formulations are based on the minimization of a difficult energy function which is known to be NP-hard. Traditional methods solve the minimization problem approximately. In this paper, we derive an exact […]

CUDA

Aug, 13

Spiking Neural Networks for Real-Time Infrared Images Processing in Thermo Vision Systems

Thermo vision are used in military, police custom traffic control, industrial and other specific applications for collecting and processing thermo visual information from infrared images. There is a problem in the steps of implementation of the developed methods and algorithms for infrared image processing in real time practical applications of thermo vision systems. Here is […]

CUDA

Aug, 13

Dense Matrix Computation on a Heterogenous Architecture: A Block Synchronous Approach

We present a strategy for efficient use of all components of a heterogenous compute node of a typical current generation cluster. Such nodes often comprise multiple sockets with a multicore processor per socket and one or more accelerators, possibly from different generations and/or types. Our strategy differs from schedulers such as Quark or SuperMatrix in […]

Aug, 13

Multi-GPU-based Swendsen-Wang multi-cluster algorithm for the simulation of two-dimensional q-state Potts model

We present the multiple GPU computing with the common unified device architecture (CUDA) for the Swendsen-Wang multi-cluster algorithm of two-dimensional (2D) q-state Potts model. Extending our algorithm for single GPU computing [Comp. Phys. Comm. 183 (2012) 1155], we realize the GPU computation of the Swendsen-Wang multi-cluster algorithm for multiple GPUs. We implement our code on […]

CUDA

Aug, 11

Real-Time Implementation of Remotely Sensed Hyperspectral Image Unmixing on GPUs

Spectral unmixing is one of the most popular techniques to analyze remotely sensed hyperspectral images. It generally comprises three stages: 1) reduction of the dimensionality of the original image to a proper subspace; 2) automatic identification of pure spectral signatures (called endmembers); and 3) estimation of the fractional abundance of each endmember in each pixel […]

CUDA

Aug, 11

Designing OP2 for GPU architectures

OP2 is an "active" library framework for the solution of unstructured mesh applications. It aims to decouple the specification of a scientific application from its parallel implementation to achieve code longevity and near-optimal performance through re-targeting the back-end to different multi-core/many-core hardware. This paper presents the design of the current OP2 library for generating efficient […]

Aug, 11

Compiler Optimizations for Industrial Unstructured Mesh CFD Applications on GPUs

Graphical Processing Units (GPUs) have shown acceleration factors over multicores for structured mesh-based Computational Fluid Dynamics (CFD). However, the value remains unclear for dynamic and irregular applications. Our motivating example is HYDRA, an application used in production at Rolls Royce for the simulation of turbomachinery components of jet engines. In previous work we presented three […]

CUDA

Aug, 11

OP2: An Active Library Framework for Solving Unstructured Mesh-based Applications on Multi-Core and Many-Core Architectures

OP2 is an "active" library framework for the solution of unstructured mesh-based applications. It utilizes source-to-source translation and compilation so that a single application code written using the OP2 API can be transformed into different parallel implementations for execution on different back-end hardware platforms. In this paper we present the design of the current OP2 […]

CUDA

Aug, 11

Large Scale Monte Carlo Tree Search on GPU

Monte Carlo Tree Search (MCTS) is a method for making optimal decisions in artificial intelligence (AI) problems, typically for move planning in combinatorial games. It combines the generality of random simulation with the precision of tree search. Research interest in MCTS has risen sharply due to its spectacular success with computer Go and its potential […]

CUDA

Aug, 10

Weighted Block-Asynchronous Iteration on GPU-Accelerated Systems

In this paper, we analyze the potential of using weights for block-asynchronous relaxation methods on GPUs. For this purpose, we introduce different weighting techniques similar to those applied in block-smoothers for multigrid methods. For test matrices taken from the University of Florida Matrix Collection we report the convergence behavior and the total runtime for the […]

CUDA

Aug, 10

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators

Hardware accelerators are becoming ubiquitous high performance scientific computing. They are capable of delivering an unprecedented level of concurrent execution contexts. High-level programming languages (e.g., CUDA), profiling tools (e.g., PAPI-CUDA, CUDA Profiler) are paramount to improve productivity, while effectively exploiting the underlying hardware. We present an optimized numerical kernel for computing the symmetric matrix-vector product […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Orthorectification by Using GPGPU Method

Real-Time Exact Graph Matching with Application in Human Action Recognition

Spiking Neural Networks for Real-Time Infrared Images Processing in Thermo Vision Systems

Dense Matrix Computation on a Heterogenous Architecture: A Block Synchronous Approach

Multi-GPU-based Swendsen-Wang multi-cluster algorithm for the simulation of two-dimensional q-state Potts model

Real-Time Implementation of Remotely Sensed Hyperspectral Image Unmixing on GPUs

Designing OP2 for GPU architectures

Compiler Optimizations for Industrial Unstructured Mesh CFD Applications on GPUs

OP2: An Active Library Framework for Solving Unstructured Mesh-based Applications on Multi-Core and Many-Core Architectures

Large Scale Monte Carlo Tree Search on GPU

Weighted Block-Asynchronous Iteration on GPU-Accelerated Systems

Optimizing Memory-Bound Numerical Kernels on GPU Hardware Accelerators

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)