high performance computing on graphics processing units: hgpu.org

Posts

Nov, 19

Exploiting concurrent kernel execution on graphic processing units

Graphics processing units (GPUs) have been accepted as a powerful and viable coprocessor solution in high-performance computing domain. In order to maximize the benefit of GPUs for a multicore platform, a mechanism is needed for CPU threads in a parallel application to share this computing resource for efficient execution. NVIDIA’s Fermi architecture pioneers the feature […]

CUDA

Nov, 19

Towards Faster Cloth Simulation: Examining the Preconditioned Conjugate Gradient

High quality cloth simulation is based on implicit methods. A variety of methods have been proposed to solve the linear systems of equations, with the conjugate gradient and multi-grid being the most commonly used. In this technical report we examine the preconditioned conjugate gradient method .More precisely, we analyze the quality of different preconditioning schemes […]

OpenCL

Nov, 19

Towards Efficient GPU Sharing on Multicore Processors

Scalable systems employing a mix of GPUs with CPUs are becoming increasingly prevalent in high-performance computing (HPC). The presence of such accelerators introduces significant challenges and complexities to both language developers and end users. This paper provides a close study of efficient coordination mechanisms to handle parallel requests from multiple hosts of control to a […]

CUDA

Nov, 19

ShoveRand: a model-driven framework to easily generate random numbers on GP-GPU

Stochastic simulations are often sensitive to the randomness source that characterizes the statistical quality of their results. Consequently, we need highly reliable Random Number Generators (RNGs) to feed such applications. Recent developments try to shrink the computation time by using more and more General Purpose Graphics Processing Units (GP-GPUs) to speed-up stochastic simulations. Such devices […]

CUDA

Nov, 19

SGPU 2: a runtime system for using large applications on clusters of hybrid nodes

In this article, we consider hybrid architectures that consist of standard CPU cores associated with accelerators (such as GPUs). These architectures are increasingly employed in large computing centers. We develop a strategy designed to deal with hybrid computing architectures from the computing performance and programmability points of view. We focus on hybrid computing clusters that […]

CUDA

Nov, 19

Predictive Modeling and Analysis of OP2 on Distributed Memory GPU Clusters

OP2 is an "active" library framework for the development and solution of unstructured mesh-based applications. It aims to decouple the scientific specification of an application from its parallel implementation to achieve code longevity and near-optimal performance through re-targeting the backend to different multi-core/many-core hardware. This paper presents a summary of a predictive performance analysis and […]

CUDA

Nov, 19

Teaching graphics processing and architecture using a hardware prototyping approach

Since its introduction over two decades ago, graphics hardware has continued to evolve to improve rendering performance and increase programmability. While most undergraduate courses in computer graphics focus on rendering algorithms and programming APIs, we have recently created an undergraduate senior elective course that focuses on graphics processing and architecture, with a strong emphasis on […]

OpenGL

Nov, 19

StreamMR: An Optimized MapReduce Framework for AMD GPUs

MapReduce is a programming model from Google that facilitates parallel processing on a cluster of thousands of commodity computers. The success of MapReduce in cluster environments has motivated several studies of implementing MapReduce on a graphics processing unit (GPU), but generally focusing on the NVIDIA GPU. Our investigation reveals that the design and mapping of […]

OpenCL

Nov, 18

Design and Implementation of a PTX Emulation Library

Intel co-founder Gordon E. Moore observed in 1965 that transistor density, the number of transistors that could be placed in an integrated circuit per square inch, increased exponentially, doubling roughly every two years. This would be later known as Moore’s Law, correctly predicting the trend that governed computing hardware manufacturing for the late 20th century. […]

Nov, 18

Particle-based Visualization of Large Cosmological Datasets

Large quantities of simulated cosmological particlebased data cause considerable problems when it comes to real-time visualization. This paper considers an out-ofcore approach for solving visualization problems on a single-desktop workstation. The approach proposed in this paper consists of two phases: the data preprocessing and its visualization. During the preprocessing, the cosmological data is hierarchically organized […]

OpenGL

Nov, 18

Tapping the supercomputer under your desk: Solving dynamic equilibrium models with graphics processors

This paper shows how to build algorithms that use graphics processing units (GPUs) installed in most modern computers to solve dynamic equilibrium models in economics. In particular, we rely on the compute unified device architecture (CUDA) of NVIDIA GPUs. We illustrate the power of the approach by solving a simple real business cycle model with […]

CUDA

Nov, 18

The MOPED framework: Object recognition and pose estimation for manipulation

We present MOPED, a framework for Multiple Object Pose Estimation and Detection that seamlessly integrates single-image and multi-image object recognition and pose estimation in one optimized, robust, and scalable framework. We address two main challenges in computer vision for robotics: robust performance in complex scenes, and low latency for real-time operation. We achieve robust performance […]

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Exploiting concurrent kernel execution on graphic processing units

Towards Faster Cloth Simulation: Examining the Preconditioned Conjugate Gradient

Towards Efficient GPU Sharing on Multicore Processors

ShoveRand: a model-driven framework to easily generate random numbers on GP-GPU

SGPU 2: a runtime system for using large applications on clusters of hybrid nodes

Predictive Modeling and Analysis of OP2 on Distributed Memory GPU Clusters

Teaching graphics processing and architecture using a hardware prototyping approach

StreamMR: An Optimized MapReduce Framework for AMD GPUs

Design and Implementation of a PTX Emulation Library

Particle-based Visualization of Large Cosmological Datasets

Tapping the supercomputer under your desk: Solving dynamic equilibrium models with graphics processors

The MOPED framework: Object recognition and pose estimation for manipulation

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)