high performance computing on graphics processing units: hgpu.org

Posts

Sep, 19

GPU Algorithms for Efficient Exascale Discretizations

In this paper we describe the research and development activities in the Center for Efficient Exascale Discretization within the US Exascale Computing Project, targeting state-of-the-art high-order finite-element algorithms for high-order applications on GPU-accelerated platforms. We discuss the GPU developments in several components of the CEED software stack, including the libCEED, MAGMA, MFEM, libParanumal, and Nek […]

CUDA

Sep, 19

A Study of Mixed Precision Strategies for GMRES on GPUs

Support for lower precision computation is becoming more common in accelerator hardware due to lower power usage, reduced data movement and increased computational performance. However, computational science and engineering (CSE) problems require double precision accuracy in several domains. This conflict between hardware trends and application needs has resulted in a need for mixed precision strategies […]

CUDA

Sep, 5

Supporting CUDA for an extended RISC-V GPU architecture

With the rapid development of scientific computation, more and more researchers and developers are committed to implementing various workloads/operations on different devices. Among all these devices, NVIDIA GPU is the most popular choice due to its comprehensive documentation and excellent development tools. As a result, there are abundant resources for hand-writing high-performance CUDA codes. However, […]

CUDA

•

OpenCL

Sep, 5

Data-Oriented Language Implementation of Lattice-Boltzmann Method for Dense and Sparse Geometries

The performance of lattice-Boltzmann solver implementations usually depends mainly on memory access patterns. Achieving high performance requires then complex code which handles careful data placement and ordering of memory transactions. In this work, we analyse the performance of an implementation based on a new approach called the data-oriented language, which allows the combining of complex […]

CUDA

Sep, 5

WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU

Deep reinforcement learning (RL) is a powerful framework to train decision-making models in complex dynamical environments. However, RL can be slow as it learns through repeated interaction with a simulation of the environment. Accelerating RL requires both algorithmic and engineering innovations. In particular, there are key systems engineering bottlenecks when using RL in complex environments […]

CUDA

Sep, 5

LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs

Exploiting data locality in GPGPUs is critical for efficiently using the smaller data caches and handling the memory bottleneck problem. This paper proposes a thread block-centric locality analysis, which identifies the locality among the thread blocks (TBs) in terms of a number of common data references. In LocalityGuru, we seek to employ a detailed just-in-time […]

CUDA

Sep, 5

High Performance GPU Code Generation for Matrix-Matrix Multiplication using MLIR: Some Early Results

This report presents some early results on code generation targeting tensor cores on NVIDIA GPUs using the MLIR compiler infrastructure. The state-of-the-art in high-performance deep learning today is primarily driven by manually optimized highly tuned libraries. The approach to develop such libraries is often not modular or reusable to the same extent that compiler infrastructure […]

CUDA

Aug, 29

A Survey of Big Data, High Performance Computing, and Machine Learning Benchmarks

In recent years, there has been a convergence of Big Data (BD), High Performance Computing (HPC), and Machine Learning (ML) systems. This convergence is due to the increasing complexity of long data analysis pipelines on separated software stacks. With the increasing complexity of data analytics pipelines comes a need to evaluate their systems, in order […]

Aug, 29

neoSYCL: a SYCL implementation for SX-Aurora TSUBASA

Recently, the high-performance computing world has moved to more heterogeneous architectures. Thus, it has become a standard practice to offload a part of application execution to dedicated accelerators. However, the disadvantage in productivity is still a problem in programming for accelerators. This paper proposes neoSYCL: a SYCL implementation for SX-Aurora TSUBASA, aiming to improve productivity […]

Aug, 29

The Art of Balance: A RateupDB Experience of Building a CPU/GPU Hybrid Database Product

GPU-accelerated database systems have been studied for more than 10 years, ranging from prototyping development to industry products serving in multiple domains of data applications. Existing GPU database research solutions are often focused on specific aspects in parallel algorithms and system implementations for specific features, while industry product development generally concentrates on delivering a whole […]

CUDA

Aug, 29

Performance Portability and Evaluation of Heterogeneous Components of SeisSol Targeted to Upcoming Intel HPC GPUs

For the first time in over 20 years, Intel is selling discrete graphics cards, including products for high-performance computing, scheduled for release in 2022. This thesis investigates programming models for the upcoming Intel GPUs and selects the Sycl standard, provided by oneAPI and hipSYCL, to port the heterogeneous components of SeisSol. The modules in question […]

CUDA

•

OpenCL

Aug, 29

GRIM: A General, Real-Time Deep Learning Inference Framework for Mobile Devices based on Fine-Grained Structured Weight Sparsity

It is appealing but challenging to achieve real-time deep neural network (DNN) inference on mobile devices because even the powerful modern mobile devices are considered as "resource-constrained" when executing large-scale DNNs. It necessitates the sparse model inference via weight pruning, i.e., DNN weight sparsity, and it is desirable to design a new DNN weight sparsity […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

GPU Algorithms for Efficient Exascale Discretizations

A Study of Mixed Precision Strategies for GMRES on GPUs

Supporting CUDA for an extended RISC-V GPU architecture

Data-Oriented Language Implementation of Lattice-Boltzmann Method for Dense and Sparse Geometries

WarpDrive: Extremely Fast End-to-End Deep Multi-Agent Reinforcement Learning on a GPU

LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs

High Performance GPU Code Generation for Matrix-Matrix Multiplication using MLIR: Some Early Results

A Survey of Big Data, High Performance Computing, and Machine Learning Benchmarks

neoSYCL: a SYCL implementation for SX-Aurora TSUBASA

The Art of Balance: A RateupDB Experience of Building a CPU/GPU Hybrid Database Product

Performance Portability and Evaluation of Heterogeneous Components of SeisSol Targeted to Upcoming Intel HPC GPUs

GRIM: A General, Real-Time Deep Learning Inference Framework for Mobile Devices based on Fine-Grained Structured Weight Sparsity

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)