high performance computing on graphics processing units: hgpu.org

Posts

Jul, 14

Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU Systems

Sparse matrix-vector multiplication (SpMV) is one of the important kernels of many iterative algorithms for solving sparse linear systems. The limited storage and computational resources of individual GPUs restrict both the scale and speed of SpMV computing in problem-solving. As real-world engineering problems continue to increase in complexity, the imperative for collaborative execution of iterative […]

CUDA

Jul, 14

Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace Hopper

Memory management across discrete CPU and GPU physical memory is traditionally achieved through explicit GPU allocations and data copy or unified virtual memory. The Grace Hopper Superchip, for the first time, supports an integrated CPU-GPU system page table, hardware-level addressing of system allocated memory, and cache-coherent NVLink-C2C interconnect, bringing an alternative solution for enabling a […]

CUDA

Jul, 14

Automating Heterogeneous Parallelism in Numerical Differential Equations

Scientific computing is an amalgamation of numerical methods and computer science. Developments in numerical analysis have allowed stable and accurate numerical schemes, whereas computer algorithms have been successfully adopted to standard multicore systems of today, enabling parallelism. Combining efficient numerical algorithms with efficient parallelism presents a challenge mainly due to the independent development of these […]

CUDA

Jul, 14

The Impact of Modern Consumer GPUs on Commonly Used Secure Password Standards

As home network-based devices and servers become more accessible [1], the need for cybersecurity awareness and best practices to secure wireless network is increasingly important. With the growing affordability of advanced hardware technology, such as modern gaming PCs equipped with powerful graphics processing units (GPUs), which can facilitate password brute force cracking on a wider […]

CUDA

•

OpenCL

Jul, 14

Automated C/C++ Program Repair for High-Level Synthesis via Large Language Models

In High-Level Synthesis (HLS), converting a regular C/C++ program into its HLS-compatible counterpart (HLS-C) still requires tremendous manual effort. Various program scripts have been introduced to automate this process. But the resulting codes usually contain many issues that should be manually repaired by developers. Since Large Language Models (LLMs) have the ability to automate code […]

Jul, 7

Automatic Code Rewriting for Performance Portability

Rewriting code for cleanliness, API changes, and new programming models is a common, yet time-consuming task. This is important for HPC applications that desire performance portability in particular, since these applications are usually very long lived and wish to run on many architectures, so they need to be written such that they can make good […]

OpenCL

Jul, 7

Supercharging Federated Learning with Flower and NVIDIA FLARE

Several open-source systems, such as Flower and NVIDIA FLARE, have been developed in recent years while focusing on different aspects of federated learning (FL). Flower is dedicated to implementing a cohesive approach to FL, analytics, and evaluation. Over time, Flower has cultivated extensive strategies and algorithms tailored for FL application development, fostering a vibrant FL […]

Jul, 7

Chat AI: A Seamless Slurm-Native Solution for HPC-Based Services

The increasing adoption of large language models (LLMs) has created a pressing need for an efficient, secure and private serving infrastructure, which allows researchers to run open-source or custom fine-tuned LLMs and ensures users that their data remains private and is not stored without their consent. While high-performance computing (HPC) systems equipped with state-of-the-art GPUs […]

Jul, 7

Towards Unified Analysis of GPU Consistency

After more than 30 years of research, there is a solid understanding of the consistency guarantees given by CPU systems. Unfortunately, the same is not yet true for GPUs. The growing popularity of general purpose GPU programming has been a call for action which industry players like Nvidia and Khronos have answered by formalizing their […]

OpenCL

Jul, 7

PSCToolkit: solving sparse linear systems with a large number of GPUs

In this chapter, we describe the Parallel Sparse Computation Toolkit (PSCToolkit), a suite of libraries for solving large-scale linear algebra problems in an HPC environment. In particular, we focus on the tools provided for the solution of symmetric and positive-definite linear systems using up to 8192 GPUs on the EuroHPC-JU Leonardo supercomputer. PSCToolkit is an […]

CUDA

Jun, 30

Composing Distributed Computations Through Task and Kernel Fusion

We introduce Diffuse, a system that dynamically performs task and kernel fusion in distributed, task-based runtime systems. The key component of Diffuse is an intermediate representation of distributed computation that enables the necessary analyses for the fusion of distributed tasks to be performed in a scalable manner. We pair task fusion with a JIT compiler […]

CUDA

Jun, 30

CATBench: A Compiler Autotuning Benchmarking Suite for Black-box Optimization

Bayesian optimization is a powerful method for automating tuning of compilers. The complex landscape of autotuning provides a myriad of rarely considered structural challenges for black-box optimizers, and the lack of standardized benchmarks has limited the study of Bayesian optimization within the domain. To address this, we present CATBench, a comprehensive benchmarking suite that captures […]

OpenCL

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU Systems

Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace Hopper

Automating Heterogeneous Parallelism in Numerical Differential Equations

The Impact of Modern Consumer GPUs on Commonly Used Secure Password Standards

Automated C/C++ Program Repair for High-Level Synthesis via Large Language Models

Automatic Code Rewriting for Performance Portability

Supercharging Federated Learning with Flower and NVIDIA FLARE

Chat AI: A Seamless Slurm-Native Solution for HPC-Based Services

Towards Unified Analysis of GPU Consistency

PSCToolkit: solving sparse linear systems with a large number of GPUs

Composing Distributed Computations Through Task and Kernel Fusion

CATBench: A Compiler Autotuning Benchmarking Suite for Black-box Optimization

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)