high performance computing on graphics processing units: hgpu.org

Posts

May, 12

Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps

CPU-GPU heterogeneous systems are now commonly used in HPC (High-Performance Computing). However, improving the utilization and energy-efficiency of such systems is still one of the most critical issues. As one single program typically cannot fully utilize all resources within a node/chip, co-scheduling (or co-locating) multiple programs with complementary resource requirements is a promising solution. Meanwhile, […]

CUDA

May, 12

CuPBoP: Making CUDA a Portable Language

CUDA is designed speciically for NVIDIA GPUs and is not compatible with non-NVIDIA devices. Enabling CUDA execution on alternative backends could greatly beneit the hardware community by fostering a more diverse software ecosystem. To address the need for portability, our objective is to develop a framework that meets key requirements, such as extensive coverage, comprehensive […]

CUDA

•

OpenCL

May, 5

A Survey of Deep Learning Library Testing Methods

In recent years, software systems powered by deep learning (DL) techniques have significantly facilitated people’s lives in many aspects. As the backbone of these DL systems, various DL libraries undertake the underlying optimization and computation. However, like traditional software, DL libraries are not immune to bugs, which can pose serious threats to users’ personal property […]

May, 5

GROMACS on AMD GPU-Based HPC Platforms: Using SYCL for Performance and Portability

GROMACS is a widely-used molecular dynamics software package with a focus on performance, portability, and maintainability across a broad range of platforms. Thanks to its early algorithmic redesign and flexible heterogeneous parallelization, GROMACS has successfully harnessed GPU accelerators for more than a decade. With the diversification of accelerator platforms in HPC and no obvious choice […]

May, 5

Porting HPC Applications to AMD Instinct MI300A Using Unified Memory and OpenMP

AMD Instinct MI300A is the world’s first data center accelerated processing unit (APU) with memory shared between the AMD "Zen 4" EPYC cores and third generation CDNA compute units. A single memory space offers several advantages: i) it eliminates the need for data replication and costly data transfers, ii) it substantially simplifies application development and […]

May, 5

Automatic BLAS Offloading on Unified Memory Architecture: A Study on NVIDIA Grace-Hopper

Porting codes to GPU often requires major efforts. While several tools exist for automatically offload numerical libraries such as BLAS and LAPACK, they often prove impractical due to the high cost of mandatory data transfer. The new unified memory architecture in NVIDIA Grace-Hopper allows high bandwidth cache-coherent memory access of all memory from both CPU […]

CUDA

May, 5

Experiences with implementing Kokkos’ SYCL backend

With the recent diversification of the hardware landscape in the high-performance computing community, performance-portability solutions are becoming more and more important. One of the most popular choices is Kokkos. In this paper, we describe how Kokkos maps to SYCL 2020, how SYCL had to evolve to enable a full Kokkos implementation, and where we still […]

Apr, 21

SimSYCL: A SYCL Implementation Targeting Development, Debugging, Simulation and Conformance

The open SYCL standard has established itself as a cross-vendor, cross-platform means to develop software which benefits from GPU and accelerator parallelism. Inherent difficulties in portability between and debuggability of programs for these targets remain. However, as we demonstrate, the SYCL specification lends itself to be implemented purely in software in a manner that is […]

Apr, 21

Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

FPGAs are a promising platform for accelerating Deep Learning (DL) applications, due to their high performance, low power consumption, and reconfigurability. Recently, the leading FPGA vendors have enhanced their architectures to more efficiently support the computational demands of DL workloads. However, the two most prominent AI-optimized FPGAs, i.e., AMD/Xilinx Versal ACAP and Intel Stratix 10 […]

Apr, 21

Software Optimization and Orchestration for Heterogeneous and Distributed Architectures

In the context of the Edge-Cloud computing continuum, containerization and orchestration have become two key requirements in software development best practices. Containerization allows for better resource utilization, platform-independent development, and secure software deployment. Orchestration automates the deployment, networking, scaling, and availability of containerized workloads and services. However, there are still several open challenges. First, the […]

Apr, 21

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey

With the rapid growth in the volume of data sets, models, and devices in the domain of deep learning, there is increasing attention on large-scale distributed deep learning. In contrast to traditional distributed deep learning, the large-scale scenario poses new challenges that include fault tolerance, scalability of algorithms and infrastructures, and heterogeneity in data sets, […]

CUDA

Apr, 21

Python-Based Quantum Chemistry Calculations with GPU Acceleration

To meet the increasing demand of quantum chemistry calculations in data-driven chemical research, the collaboration between industrial stakeholders and the quantum chemistry community has led to the development of GPU4PySCF, a GPU-accelerated Python package. This open-source project is accessible via its public GitHub repository. This paper outlines the primary features, innovations, and advantages of this […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Optimizing Hardware Resource Partitioning and Job Allocations on Modern GPUs under Power Caps

CuPBoP: Making CUDA a Portable Language

A Survey of Deep Learning Library Testing Methods

GROMACS on AMD GPU-Based HPC Platforms: Using SYCL for Performance and Portability

Porting HPC Applications to AMD Instinct MI300A Using Unified Memory and OpenMP

Automatic BLAS Offloading on Unified Memory Architecture: A Study on NVIDIA Grace-Hopper

Experiences with implementing Kokkos’ SYCL backend

SimSYCL: A SYCL Implementation Targeting Development, Debugging, Simulation and Conformance

Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

Software Optimization and Orchestration for Heterogeneous and Distributed Architectures

Communication-Efficient Large-Scale Distributed Deep Learning: A Comprehensive Survey

Python-Based Quantum Chemistry Calculations with GPU Acceleration

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)