high performance computing on graphics processing units: hgpu.org

Posts

Nov, 20

Using mobile GPU for general-purpose computing – a case study of face recognition on smartphones

As GPU becomes an integrated component in handheld devices like smartphones, we have been investigating the opportunities and limitations of utilizing the ultra-low-power GPU in a mobile platform as a general-purpose accelerator, similar to its role in desktop and server platforms. The special focus of our investigation has been on mobile GPU’s role for energy-optimized […]

Nov, 20

Autotuning GEMMs for Fermi

In recent years, the use of graphics chips has been recognized as a viable way of accelerating scientific and engineering applications, even more so since the introduction of the Fermi architecture by NVIDIA, with features essential to numerical computing, such as fast double precision arithmetic and memory protected with error correction codes. Being the crucial […]

CUDA

Nov, 20

Hierarchical QR factorization algorithms for multi-core cluster systems

This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed multi-core nodes. These platforms make the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential […]

Nov, 20

Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures

We present a new methodology for utilizing all CPU cores and all GPUs on a heterogeneous multicore and multi-GPU system to support matrix computations efficiently. Our approach is able to achieve the objectives of a high degree of parallelism, minimized synchronization, minimized communication, and load balancing. Our main idea is to treat the heterogeneous system […]

CUDA

Nov, 20

Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs

GPUs are excellent accelerators for data parallel applications with regular data access patterns. It is challenging, however, to optimize computations with irregular data access patterns on GPUs. One such computation is the Symmetric Matrix Vector product (SYMV) for dense linear algebra. Optimizing the SYMV kernel is important because it forms the basis of fundamental algorithms […]

CUDA

Nov, 20

Parallelized Incomplete Poisson Preconditioner in Cloth Simulation

Efficient cloth simulation is an important problem for interactive applications that involve virtual humans, such as computer games. A common aspect of many methods that have been developed to simulate cloth is a linear system of equations, which is commonly solved using conjugate gradient or multi-grid approaches. In this paper, we introduce to the computer […]

Nov, 19

Using the High Productivity Language Chapel to Target GPGPU Architectures

It has been widely shown that GPGPU architectures offer large performance gains compared to their traditional CPU counterparts for many applications. The downside to these architectures is that the current programming models present numerous challenges to the programmer: lower-level languages, explicit data movement, loss of portability, and challenges in performance optimization. In this paper, we […]

CUDA

Nov, 19

Anisotropic mesh coarsening and refinement on GPU architecture

Finite element and finite volume methods on unstructured meshes offer a powerful approach to solving partial differential equations in complex domains. It has diverse application in areas such as industrial and geophysical fluid dynamics, structural mechanics, and radiative transfer. A key strength of the approach is the unstructured meshes exibility in conforming to complex geometry […]

CUDA

Nov, 19

Exploiting concurrent kernel execution on graphic processing units

Graphics processing units (GPUs) have been accepted as a powerful and viable coprocessor solution in high-performance computing domain. In order to maximize the benefit of GPUs for a multicore platform, a mechanism is needed for CPU threads in a parallel application to share this computing resource for efficient execution. NVIDIA’s Fermi architecture pioneers the feature […]

CUDA

Nov, 19

Towards Faster Cloth Simulation: Examining the Preconditioned Conjugate Gradient

High quality cloth simulation is based on implicit methods. A variety of methods have been proposed to solve the linear systems of equations, with the conjugate gradient and multi-grid being the most commonly used. In this technical report we examine the preconditioned conjugate gradient method .More precisely, we analyze the quality of different preconditioning schemes […]

OpenCL

Nov, 19

Towards Efficient GPU Sharing on Multicore Processors

Scalable systems employing a mix of GPUs with CPUs are becoming increasingly prevalent in high-performance computing (HPC). The presence of such accelerators introduces significant challenges and complexities to both language developers and end users. This paper provides a close study of efficient coordination mechanisms to handle parallel requests from multiple hosts of control to a […]

CUDA

Nov, 19

ShoveRand: a model-driven framework to easily generate random numbers on GP-GPU

Stochastic simulations are often sensitive to the randomness source that characterizes the statistical quality of their results. Consequently, we need highly reliable Random Number Generators (RNGs) to feed such applications. Recent developments try to shrink the computation time by using more and more General Purpose Graphics Processing Units (GP-GPUs) to speed-up stochastic simulations. Such devices […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Using mobile GPU for general-purpose computing – a case study of face recognition on smartphones

Autotuning GEMMs for Fermi

Hierarchical QR factorization algorithms for multi-core cluster systems

Efficient Support for Matrix Computations on Heterogeneous Multi-core and Multi-GPU Architectures

Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs

Parallelized Incomplete Poisson Preconditioner in Cloth Simulation

Using the High Productivity Language Chapel to Target GPGPU Architectures

Anisotropic mesh coarsening and refinement on GPU architecture

Exploiting concurrent kernel execution on graphic processing units

Towards Faster Cloth Simulation: Examining the Preconditioned Conjugate Gradient

Towards Efficient GPU Sharing on Multicore Processors

ShoveRand: a model-driven framework to easily generate random numbers on GP-GPU

Recent source codes

OpScanner

Atlas CLI: Machine Learning (ML) Lifecycle & Transparency Manager

transformers_tvm: Implementation of Encoder Decoder transformer on TVM

INT v.s. FP: A framework to compare low-bit integer and float-point formats

AutoDock-GPU: AutoDock for GPUs and other accelerators

NCCLX: collective communication framework

Tutoring LLM into a Better CUDA Optimizer

Adaptivity in AdaptiveCpp: Optimizing Performance by Leveraging Runtime Information During JIT-Compilation

Kernel Library for LLM Serving

Neptune: Advanced ML Operator Fusion for Locality and Parallelism on GPUs

Most viewed papers (last 30 days)