high performance computing on graphics processing units: hgpu.org

Posts

Jun, 12

Onesweep: A Faster Least Significant Digit Radix Sort for GPUs

We present Onesweep, a least-significant digit (LSD) radix sorting algorithm for large GPU sorting problems residing in global memory. Our parallel algorithm employs a method of single-pass prefix sum that only requires ~2n global read/write operations for each digit-binning iteration. This exhibits a significant reduction in last-level memory traffic versus contemporary GPU radix sorting implementations, […]

CUDA

Jun, 5

Portable, Scalable Approaches for Improving Asynchronous Many-Task Runtime Node Use

This research addresses node-level scalability, portability, and heterogeneous computing challenges facing asynchronous many-task (AMT) runtime systems. These challenges have arisen due to increasing socket/core/thread counts and diversity among supported architectures on current and emerging high-performance computing (HPC) systems. This places greater emphasis on thread scalability and simultaneous use of diverse architectures to maximize node use […]

CUDA

Jun, 5

Dropbear: Machine Learning Marketplaces made Trustworthy with Byzantine Model Agreement

Marketplaces for machine learning (ML) models are emerging as a way for organizations to monetize models. They allow model owners to retain control over hosted models by using cloud resources to execute ML inference requests for a fee, preserving model confidentiality. Clients that rely on hosted models require trustworthy inference results, even when models are […]

Jun, 5

FELARE: Fair Scheduling of Machine Learning Applications on Heterogeneous Edge Systems

Edge computing enables smart IoT-based systems via concurrent and continuous execution of latency-sensitive machine learning (ML) applications. These edge-based machine learning systems are often battery-powered (i.e., energy-limited). They use heterogeneous resources with diverse computing performance (e.g., CPU, GPU, and/or FPGAs) to fulfill the latency constraints of ML applications. The challenge is to allocate user requests […]

Jun, 5

Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning

To break the bottlenecks of mainstream cloud-based machine learning (ML) paradigm, we adopt device-cloud collaborative ML and build the first end-to-end and general-purpose system, called Walle, as the foundation. Walle consists of a deployment platform, distributing ML tasks to billion-scale devices in time; a data pipeline, efficiently preparing task input; and a compute container, providing […]

Jun, 5

End-to-end Optimization of Machine Learning Prediction Queries

Prediction queries are widely used across industries to perform advanced analytics and draw insights from data. They include a data processing part (e.g., for joining, filtering, cleaning, featurizing the datasets) and a machine learning (ML) part invoking one or more trained models to perform predictions. These parts have so far been optimized in isolation, leaving […]

May, 29

Fast GPU bounding boxes on tree-structured scenes

Computation of bounding boxes is a fundamental problem in high performance rendering, as it is an input to visibility culling and binning operations. In a scene description structured as a tree, clip nodes and blend nodes entail intersection and union of bounding boxes, respectively. These are straightforward to compute on the CPU using a sequential […]

May, 29

User’s needs influencing HPC technologies

The user requirements imposed by modern challenges are influencing future High Performance Computing (HPC) technologies and use cases. This report analyses a wide range of user requirements and new technologies and their impact on European and worldwide HPC trends, in particular in the PRACE and EuroHPC ecosystems, as well as HPC infrastructures provided by member […]

CUDA

•

OpenCL

May, 29

Fault Injection techniques for GPU Reliability Evaluation

A Graphical Processing Unit (GPU) is a computer chip that renders graphics and images by performing rapid mathematical calculations. In recent years, GPUs are exploited for reasons beyond graphics processing as General Purpose GPUs (GPGPUs); they work as hardware accelerators for high-performance computing in many different fields, including safety-critical applications. In these domains, Convolutional Neural […]

CUDA

May, 29

SOL: Reducing the Maintenance Overhead for Integrating Hardware Support into AI Frameworks

The increased interest in Artificial Intelligence (AI) raised the need for highly optimized and sophisticated AI frameworks. Starting with the Lua-based Torch many frameworks have emerged over time, such as Theano, Caffe, Chainer, CNTK, MxNet, PyTorch, DL4J, or TensorFlow. All of these provide a high level scripting API that allows users to easily design neural […]

CUDA

May, 29

Lossless Acceleration for Seq2seq Generation with Aggressive Decoding

We study lossless acceleration for seq2seq generation with a novel decoding algorithm — Aggressive Decoding. Unlike the previous efforts (e.g., non-autoregressive decoding) speeding up seq2seq generation at the cost of quality loss, our approach aims to yield the identical (or better) generation compared with autoregressive decoding but in a significant speedup, achieved by innovative cooperation […]

CUDA

May, 22

Blockchain Goes Green? Part II: Characterizing the Performance and Cost of Blockchains on the Cloud and at the Edge

While state-of-the-art permissioned blockchains can achieve thousands of transactions per second on commodity hardware with x86/64 architecture, their performance when running on different architectures is not clear. The goal of this work is to characterize the performance and cost of permissioned blockchains on different hardware systems, which is important as diverse application domains are adopting […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

Onesweep: A Faster Least Significant Digit Radix Sort for GPUs

Portable, Scalable Approaches for Improving Asynchronous Many-Task Runtime Node Use

Dropbear: Machine Learning Marketplaces made Trustworthy with Byzantine Model Agreement

FELARE: Fair Scheduling of Machine Learning Applications on Heterogeneous Edge Systems

Walle: An End-to-End, General-Purpose, and Large-Scale Production System for Device-Cloud Collaborative Machine Learning

End-to-end Optimization of Machine Learning Prediction Queries

Fast GPU bounding boxes on tree-structured scenes

User’s needs influencing HPC technologies

Fault Injection techniques for GPU Reliability Evaluation

SOL: Reducing the Maintenance Overhead for Integrating Hardware Support into AI Frameworks

Lossless Acceleration for Seq2seq Generation with Aggressive Decoding

Blockchain Goes Green? Part II: Characterizing the Performance and Cost of Blockchains on the Cloud and at the Edge

Recent source codes

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

PELSI: Power-Efficient Layer-Switched Inference

Ouroboros: Virtualized Queues for dynamic memory management

MSCCL++: A GPU-driven communication stack for scalable AI applications

Benchmark compute shader of Unity against InteropUnityCUDA

Most viewed papers (last 30 days)