high performance computing on graphics processing units: hgpu.org

Posts

Jun, 12

Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numerical Behaviors

Tensor Cores have been an important unit to accelerate Fused Matrix Multiplication Accumulation (MMA) in all NVIDIA GPUs since Volta Architecture. To program Tensor Cores, users have to use either legacy wmma APIs or current mma APIs. Legacy wmma APIs are more easy-to-use but can only exploit limited features and power of Tensor Cores. Specifically, […]

Jun, 12

Onesweep: A Faster Least Significant Digit Radix Sort for GPUs

We present Onesweep, a least-significant digit (LSD) radix sorting algorithm for large GPU sorting problems residing in global memory. Our parallel algorithm employs a method of single-pass prefix sum that only requires ~2n global read/write operations for each digit-binning iteration. This exhibits a significant reduction in last-level memory traffic versus contemporary GPU radix sorting implementations, […]

CUDA

Jun, 5

Portable, Scalable Approaches for Improving Asynchronous Many-Task Runtime Node Use

This research addresses node-level scalability, portability, and heterogeneous computing challenges facing asynchronous many-task (AMT) runtime systems. These challenges have arisen due to increasing socket/core/thread counts and diversity among supported architectures on current and emerging high-performance computing (HPC) systems. This places greater emphasis on thread scalability and simultaneous use of diverse architectures to maximize node use […]

CUDA

Jun, 5

Dropbear: Machine Learning Marketplaces made Trustworthy with Byzantine Model Agreement

Marketplaces for machine learning (ML) models are emerging as a way for organizations to monetize models. They allow model owners to retain control over hosted models by using cloud resources to execute ML inference requests for a fee, preserving model confidentiality. Clients that rely on hosted models require trustworthy inference results, even when models are […]

Jun, 5

FELARE: Fair Scheduling of Machine Learning Applications on Heterogeneous Edge Systems

Edge computing enables smart IoT-based systems via concurrent and continuous execution of latency-sensitive machine learning (ML) applications. These edge-based machine learning systems are often battery-powered (i.e., energy-limited). They use heterogeneous resources with diverse computing performance (e.g., CPU, GPU, and/or FPGAs) to fulfill the latency constraints of ML applications. The challenge is to allocate user requests […]

Jun, 5

Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning

To break the bottlenecks of mainstream cloud-based machine learning (ML) paradigm, we adopt device-cloud collaborative ML and build the first end-to-end and general-purpose system, called Walle, as the foundation. Walle consists of a deployment platform, distributing ML tasks to billion-scale devices in time; a data pipeline, efficiently preparing task input; and a compute container, providing […]

Jun, 5

End-to-end Optimization of Machine Learning Prediction Queries

Prediction queries are widely used across industries to perform advanced analytics and draw insights from data. They include a data processing part (e.g., for joining, filtering, cleaning, featurizing the datasets) and a machine learning (ML) part invoking one or more trained models to perform predictions. These parts have so far been optimized in isolation, leaving […]

May, 29

Fast GPU bounding boxes on tree-structured scenes

Computation of bounding boxes is a fundamental problem in high performance rendering, as it is an input to visibility culling and binning operations. In a scene description structured as a tree, clip nodes and blend nodes entail intersection and union of bounding boxes, respectively. These are straightforward to compute on the CPU using a sequential […]

May, 29

User’s needs influencing HPC technologies

The user requirements imposed by modern challenges are influencing future High Performance Computing (HPC) technologies and use cases. This report analyses a wide range of user requirements and new technologies and their impact on European and worldwide HPC trends, in particular in the PRACE and EuroHPC ecosystems, as well as HPC infrastructures provided by member […]

CUDA

•

OpenCL

May, 29

SOL: Reducing the Maintenance Overhead for Integrating Hardware Support into AI Frameworks

The increased interest in Artificial Intelligence (AI) raised the need for highly optimized and sophisticated AI frameworks. Starting with the Lua-based Torch many frameworks have emerged over time, such as Theano, Caffe, Chainer, CNTK, MxNet, PyTorch, DL4J, or TensorFlow. All of these provide a high level scripting API that allows users to easily design neural […]

CUDA

May, 29

Fault Injection techniques for GPU Reliability Evaluation

A Graphical Processing Unit (GPU) is a computer chip that renders graphics and images by performing rapid mathematical calculations. In recent years, GPUs are exploited for reasons beyond graphics processing as General Purpose GPUs (GPGPUs); they work as hardware accelerators for high-performance computing in many different fields, including safety-critical applications. In these domains, Convolutional Neural […]

CUDA

May, 29

Lossless Acceleration for Seq2seq Generation with Aggressive Decoding

We study lossless acceleration for seq2seq generation with a novel decoding algorithm — Aggressive Decoding. Unlike the previous efforts (e.g., non-autoregressive decoding) speeding up seq2seq generation at the cost of quality loss, our approach aims to yield the identical (or better) generation compared with autoregressive decoding but in a significant speedup, achieved by innovative cooperation […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

chemtrain-deploy: A parallel and scalable framework for machine learning potentials in million-atom MD simulations

microSYCL: SYCL micro-benchmarks repository

Exploring SYCL as a Portability Layer for High-Performance Computing on CPUs

See all packages

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Dissecting Tensor Cores via Microbenchmarks: Latency, Throughput and Numerical Behaviors

Onesweep: A Faster Least Significant Digit Radix Sort for GPUs

Portable, Scalable Approaches for Improving Asynchronous Many-Task Runtime Node Use

Dropbear: Machine Learning Marketplaces made Trustworthy with Byzantine Model Agreement

FELARE: Fair Scheduling of Machine Learning Applications on Heterogeneous Edge Systems

Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning

End-to-end Optimization of Machine Learning Prediction Queries

Fast GPU bounding boxes on tree-structured scenes

User’s needs influencing HPC technologies

SOL: Reducing the Maintenance Overhead for Integrating Hardware Support into AI Frameworks

Fault Injection techniques for GPU Reliability Evaluation

Lossless Acceleration for Seq2seq Generation with Aggressive Decoding

Recent source codes

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

Most viewed papers (last 30 days)