high performance computing on graphics processing units: hgpu.org

Posts

Jan, 13

CGP-Tuning: Structure-Aware Soft Prompt Tuning for Code Vulnerability Detection

Large language models (LLMs) have been proposed as powerful tools for detecting software vulnerabilities, where task-specific fine-tuning is typically employed to provide vulnerability-specific knowledge to the LLMs for this purpose. However, traditional full-parameter fine-tuning is inefficient for modern, complex LLMs, which contain billions of parameters. Soft prompt tuning has been suggested as a more efficient […]

CUDA

Jan, 13

SCALE-Ahead-Of-Time Compilation of CUDA for AMD GPUs

SCALE is a new solution by Spectral Compute that empowers developers to write code once and deploy it across a range of GPU hardware platforms without modifying the original code. Designed to extend CUDA’s capabilities to AMD GPUs, SCALE maintains CUDA compatibility while introducing novel features that streamline GPU programming. This demo paper presents SCALE’s […]

CUDA

Jan, 13

LeetDecoding: A PyTorch Library for Exponentially Decaying Causal Linear Attention with CUDA Implementations

The machine learning and data science community has made significant while dispersive progress in accelerating transformer-based large language models (LLMs), and one promising approach is to replace the original causal attention in a generative pre-trained transformer (GPT) with exponentially decaying causal linear attention. In this paper, we present LeetDecoding, which is the first Python package […]

CUDA

Jan, 13

Data Parallel Visualization and Rendering on the RAMSES Supercomputer with ANARI

3D visualization and rendering in HPC are very heterogenous applications, though fundamentally the tasks involved are well-defined and do not differ much from application to application. The Khronos Group’s ANARI standard seeks to consolidate 3D rendering across sci-vis applications. This paper makes an effort to convey challenges of 3D rendering and visualization with ANARI in […]

Jan, 13

Validation of GPU Computation in Decentralized, Trustless Networks

Verifying computational processes in decentralized networks poses a fundamental challenge, particularly for Graphics Processing Unit (GPU) computations. Our investigation reveals significant limitations in existing approaches: exact recomputation fails due to computational non-determinism across GPU nodes, Trusted Execution Environments (TEEs) require specialized hardware, and Fully Homomorphic Encryption (FHE) faces prohibitive computational costs. To address these challenges, […]

Jan, 6

Enhancing Deployment-Time Predictive Model Robustness for Code Analysis and Optimization

Supervised machine learning techniques have shown promising results in code analysis and optimization problems. However, a learning-based solution can be brittle because minor changes in hardware or application workloads — such as facing a new CPU architecture or code pattern — may jeopardize decision accuracy, ultimately undermining model robustness. We introduce Prom, an open-source library […]

OpenCL

Jan, 6

Finding Missed Code Size Optimizations in Compilers using LLMs

Compilers are complex, and significant effort has been expended on testing them. Techniques such as random program generation and differential testing have proved highly effective and have uncovered thousands of bugs in production compilers. The majority of effort has been expended on validating that a compiler produces correct code for a given input, while less […]

Jan, 6

Debunking the CUDA Myth Towards GPU-based AI Systems

With the rise of AI, NVIDIA GPUs have become the de facto standard for AI system design. This paper presents a comprehensive evaluation of Intel Gaudi NPUs as an alternative to NVIDIA GPUs for AI model serving. First, we create a suite of microbenchmarks to compare Intel Gaudi-2 with NVIDIA A100, showing that Gaudi-2 achieves […]

CUDA

Jan, 6

Performant Automatic BLAS Offloading on Unified Memory Architecture with OpenMP First-Touch Style Data Movement

BLAS is a fundamental building block of advanced linear algebra libraries and many modern scientific computing applications. GPUs are known for their strong arithmetic computing capabilities and are highly suited for BLAS operations. However, porting code to GPUs often requires significant effort, especially for large, complex codes or legacy codes, even for BLAS-heavy applications. While […]

CUDA

Jan, 6

A comparison of HPC-based quantum computing simulators using Quantum Volume

This paper compares quantum computing simulators running on a single CPU or GPU-based HPC node using the Quantum Volume benchmark commonly proposed for comparing NISQ systems. As simulators do not suffer from noise, the metric used in the comparison is the time required to simulate a set Quantum Volume. The results are important to estimate […]

CUDA

•

OpenCL

Dec, 29

Scalable Access-Pattern Aware I/O Acceleration and Multi-Tiered Data Management for HPC and AI Workloads

The exponential growth of data-intensive scientific simulations and deep learning workloads presents significant challenges for high-performance computing~(HPC) systems. These workloads generate massive data volumes at unprecedented velocities, straining the capabilities of existing memory hierarchies, I/O subsystems, and scheduling mechanisms. This dissertation addresses critical challenges in data management and workload scheduling to enhance performance, scalability, and […]

CUDA

Dec, 29

Asynchronous-Many-Task Systems: Challenges and Opportunities – Scaling an AMR Astrophysics Code on Exascale machines using Kokkos and HPX

Dynamic and adaptive mesh refinement is pivotal in high-resolution, multi-physics, multi-model simulations, necessitating precise physics resolution in localized areas across expansive domains. Today’s supercomputers’ extreme heterogeneity presents a significant challenge for dynamically adaptive codes, highlighting the importance of achieving performance portability at scale. Our research focuses on astrophysical simulations, particularly stellar mergers, to elucidate early […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

CGP-Tuning: Structure-Aware Soft Prompt Tuning for Code Vulnerability Detection

SCALE-Ahead-Of-Time Compilation of CUDA for AMD GPUs

LeetDecoding: A PyTorch Library for Exponentially Decaying Causal Linear Attention with CUDA Implementations

Data Parallel Visualization and Rendering on the RAMSES Supercomputer with ANARI

Validation of GPU Computation in Decentralized, Trustless Networks

Enhancing Deployment-Time Predictive Model Robustness for Code Analysis and Optimization

Finding Missed Code Size Optimizations in Compilers using LLMs

Debunking the CUDA Myth Towards GPU-based AI Systems

Performant Automatic BLAS Offloading on Unified Memory Architecture with OpenMP First-Touch Style Data Movement

A comparison of HPC-based quantum computing simulators using Quantum Volume

Scalable Access-Pattern Aware I/O Acceleration and Multi-Tiered Data Management for HPC and AI Workloads

Asynchronous-Many-Task Systems: Challenges and Opportunities – Scaling an AMR Astrophysics Code on Exascale machines using Kokkos and HPX

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)