high performance computing on graphics processing units: hgpu.org

Posts

Feb, 18

Graphtoy: Fast Software Simulation of Applications for AMD’s AI Engines

This work presents Graphtoy, a coroutine-based compute graph simulator built in C++20, which can be embedded into a target application for rapid step-by-step prototyping of graphs targeting AMD’s AI Engines, as used in Versal FPGAs and Ryzen 7040 CPUs. By using a molecular docking application as a case study, we demonstrate: 1) how compute graphs […]

Feb, 18

An Evaluative Comparison of Performance Portability across GPU Programming Models

Ensuring high productivity in scientific software development necessitates developing and maintaining a single codebase that can run efficiently on a range of accelerator-based supercomputing platforms. While prior work has investigated the performance portability of a few selected proxy applications or programming models, this paper provides a comprehensive study of a range of proxy applications implemented […]

CUDA

Feb, 18

pSTL-Bench: A Micro-Benchmark Suite for Assessing Scalability of C++ Parallel STL Implementations

Since the advent of parallel algorithms in the C++17 Standard Template Library (STL), the STL has become a viable framework for creating performance-portable applications. Given multiple existing implementations of the parallel algorithms, a systematic, quantitative performance comparison is essential for choosing the appropriate implementation for a particular hardware configuration. In this work, we introduce a […]

CUDA

Feb, 18

QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference

We introduce QUICK, a group of novel optimized CUDA kernels for the efficient inference of quantized Large Language Models (LLMs). QUICK addresses the shared memory bank-conflict problem of state-of-the-art mixed precision matrix multiplication kernels. Our method interleaves the quantized weight matrices of LLMs offline to skip the shared memory write-back after the dequantization. We demonstrate […]

CUDA

Feb, 12

Multi-line AI-assisted Code Authoring

CodeCompose is an AI-assisted code authoring tool powered by large language models (LLMs) that provides inline suggestions to 10’s of thousands of developers at Meta. In this paper, we present how we scaled the product from displaying single-line suggestions to multi-line suggestions. This evolution required us to overcome several unique challenges in improving the usability […]

Feb, 12

DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence

The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models […]

Feb, 12

Out of kernel tuning and optimizations for portable large-scale docking experiments on GPUs

Virtual screening is an early stage in the drug discovery process that selects the most promising candidates. In the urgent computing scenario, finding a solution in the shortest time frame is critical. Any improvement in the performance of a virtual screening application translates into an increase in the number of candidates evaluated, thereby raising the […]

CUDA

Feb, 12

Evaluating the Wide Area Classroom After 24,000 HPC Students

As of 2023 we have taught more than 24,000 students over the course of 106 events using the Wide Area Classroom, a novel distributed teaching platform. This has been a successful effort gauged by several important metrics. We describe both the technical and logistical structure of these events as well as the specific HPC curriculums […]

Feb, 12

Training DNN Models over Heterogeneous Clusters with Optimal Performance

Adjusting batch sizes and adaptively tuning other hyperparameters can significantly speed up deep neural network (DNN) training. Despite the ubiquity of heterogeneous clusters, existing adaptive DNN training techniques solely consider homogeneous environments. Optimizing distributed DNN training over heterogeneous clusters is technically challenging, and directly adapting existing techniques results in low utilization and poor performance. To […]

CUDA

Feb, 4

Gallatin: A General-Purpose GPU Memory Manager

Dynamic memory management is critical for efficiently porting modern data processing pipelines to GPUs. However, building a general-purpose dynamic memory manager on GPUs is challenging due to the massive parallelism and weak memory coherence. Existing state-of-the-art GPU memory managers, Ouroboros and Reg-Eff, employ traditional data structures such as arrays and linked lists to manage memory […]

CUDA

Feb, 4

Deductive verification for SYCL

A heterogeneous computing system is a system composed of different types of computing units. SYCL is a software development framework with which programs can be developed for such systems. It uses the concept of kernels, where a kernel executes code inside it in parallel, and different kernels can be executed concurrently on multiple computing units. […]

CUDA

•

OpenCL

Feb, 4

LeftoverLocals: Listening to LLM Responses Through Leaked GPU Local Memory

This paper describes LeftoverLocals: a vulnerability that allows data recovery from GPU memory created by another process on Apple, Qualcomm, and AMD GPUs. LeftoverLocals impacts the security posture of GPU applications, with particular significance to LLMs and ML models that run on impacted GPUs. By recovering local memory, an optimized GPU memory region, we built […]

OpenCL

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Graphtoy: Fast Software Simulation of Applications for AMD’s AI Engines

An Evaluative Comparison of Performance Portability across GPU Programming Models

pSTL-Bench: A Micro-Benchmark Suite for Assessing Scalability of C++ Parallel STL Implementations

QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference

Multi-line AI-assisted Code Authoring

DeepSeek-Coder: When the Large Language Model Meets Programming – The Rise of Code Intelligence

Out of kernel tuning and optimizations for portable large-scale docking experiments on GPUs

Evaluating the Wide Area Classroom After 24,000 HPC Students

Training DNN Models over Heterogeneous Clusters with Optimal Performance

Gallatin: A General-Purpose GPU Memory Manager

Deductive verification for SYCL

LeftoverLocals: Listening to LLM Responses Through Leaked GPU Local Memory

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)