high performance computing on graphics processing units: hgpu.org

Posts

Nov, 20

Autotuning GEMMs for Fermi

In recent years, the use of graphics chips has been recognized as a viable way of accelerating scientific and engineering applications, even more so since the introduction of the Fermi architecture by NVIDIA, with features essential to numerical computing, such as fast double precision arithmetic and memory protected with error correction codes. Being the crucial […]

CUDA

Nov, 20

Hierarchical QR factorization algorithms for multi-core cluster systems

This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed multi-core nodes. These platforms make the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential […]

Nov, 20

Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures

We present a new methodology for utilizing all CPU cores and all GPUs on a heterogeneous multicore and multi-GPU system to support matrix computations efficiently. Our approach is able to achieve the objectives of a high degree of parallelism, minimized synchronization, minimized communication, and load balancing. Our main idea is to treat the heterogeneous system […]

CUDA

Nov, 20

Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs

GPUs are excellent accelerators for data parallel applications with regular data access patterns. It is challenging, however, to optimize computations with irregular data access patterns on GPUs. One such computation is the Symmetric Matrix Vector product (SYMV) for dense linear algebra. Optimizing the SYMV kernel is important because it forms the basis of fundamental algorithms […]

CUDA

Nov, 20

Parallelized Incomplete Poisson Preconditioner in Cloth Simulation

Efficient cloth simulation is an important problem for interactive applications that involve virtual humans, such as computer games. A common aspect of many methods that have been developed to simulate cloth is a linear system of equations, which is commonly solved using conjugate gradient or multi-grid approaches. In this paper, we introduce to the computer […]

Nov, 19

Using the High Productivity Language Chapel to Target GPGPU Architectures

It has been widely shown that GPGPU architectures offer large performance gains compared to their traditional CPU counterparts for many applications. The downside to these architectures is that the current programming models present numerous challenges to the programmer: lower-level languages, explicit data movement, loss of portability, and challenges in performance optimization. In this paper, we […]

CUDA

Nov, 19

Anisotropic mesh coarsening and refinement on GPU architecture

Finite element and finite volume methods on unstructured meshes offer a powerful approach to solving partial differential equations in complex domains. It has diverse application in areas such as industrial and geophysical fluid dynamics, structural mechanics, and radiative transfer. A key strength of the approach is the unstructured meshes exibility in conforming to complex geometry […]

CUDA

Nov, 19

Exploiting concurrent kernel execution on graphic processing units

Graphics processing units (GPUs) have been accepted as a powerful and viable coprocessor solution in high-performance computing domain. In order to maximize the benefit of GPUs for a multicore platform, a mechanism is needed for CPU threads in a parallel application to share this computing resource for efficient execution. NVIDIA’s Fermi architecture pioneers the feature […]

CUDA

Nov, 19

Towards Faster Cloth Simulation: Examining the Preconditioned Conjugate Gradient

High quality cloth simulation is based on implicit methods. A variety of methods have been proposed to solve the linear systems of equations, with the conjugate gradient and multi-grid being the most commonly used. In this technical report we examine the preconditioned conjugate gradient method .More precisely, we analyze the quality of different preconditioning schemes […]

OpenCL

Nov, 19

Towards Efficient GPU Sharing on Multicore Processors

Scalable systems employing a mix of GPUs with CPUs are becoming increasingly prevalent in high-performance computing (HPC). The presence of such accelerators introduces significant challenges and complexities to both language developers and end users. This paper provides a close study of efficient coordination mechanisms to handle parallel requests from multiple hosts of control to a […]

CUDA

Nov, 19

ShoveRand: a model-driven framework to easily generate random numbers on GP-GPU

Stochastic simulations are often sensitive to the randomness source that characterizes the statistical quality of their results. Consequently, we need highly reliable Random Number Generators (RNGs) to feed such applications. Recent developments try to shrink the computation time by using more and more General Purpose Graphics Processing Units (GP-GPUs) to speed-up stochastic simulations. Such devices […]

CUDA

Nov, 19

SGPU 2: a runtime system for using large applications on clusters of hybrid nodes

In this article, we consider hybrid architectures that consist of standard CPU cores associated with accelerators (such as GPUs). These architectures are increasingly employed in large computing centers. We develop a strategy designed to deal with hybrid computing architectures from the computing performance and programmability points of view. We focus on hybrid computing clusters that […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Autotuning GEMMs for Fermi

Hierarchical QR factorization algorithms for multi-core cluster systems

Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures

Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs

Parallelized Incomplete Poisson Preconditioner in Cloth Simulation

Using the High Productivity Language Chapel to Target GPGPU Architectures

Anisotropic mesh coarsening and refinement on GPU architecture

Exploiting concurrent kernel execution on graphic processing units

Towards Faster Cloth Simulation: Examining the Preconditioned Conjugate Gradient

Towards Efficient GPU Sharing on Multicore Processors

ShoveRand: a model-driven framework to easily generate random numbers on GP-GPU

SGPU 2: a runtime system for using large applications on clusters of hybrid nodes

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)