high performance computing on graphics processing units: hgpu.org

Posts

Aug, 13

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

GPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. To address this issue, traditional approaches integrate lossy compression directly into GPU-aware collectives, which still suffer from serious issues such as underutilized GPU devices and uncontrolled data distortion. In this paper, we propose gZCCL, a general framework […]

Aug, 13

A Model Extraction Attack on Deep Neural Networks Running on GPUs

Deep Neural Networks (DNNs) have become ubiquitous due to their performance on prediction and classification problems. However, they face a variety of threats as their usage spreads. Model extraction attacks, which steal DNN models, endanger intellectual property, data privacy, and security. Previous research has shown that system-level side channels can be used to leak the […]

Aug, 13

SYnergy: Fine-grained Energy-Efficient Heterogeneous Computing for Scalable Energy Saving

Energy-efficient computing uses power management techniques such as frequency scaling to save energy. Implementing energy-efficient techniques on large-scale computing systems is challenging for several reasons. While most modern architectures, including GPUs, are capable of frequency scaling, these features are often not available on large systems. In addition, achieving higher energy savings requires precise energy tuning […]

Aug, 13

Static and Dynamic Analyses for Efficient GPU Execution

In this thesis we describe a host of static and dynamic techniques for efficient execution of GPU programs. Most significant is the array short-circuiting technique, which automatically rewrites array updates and concatenations to happen in-place when deemed safe. The optimization is based on FunMem, an intermediate representation with non-semantic memory information that we also introduce. […]

OpenCL

Aug, 13

Isolated Scheduling for Distributed Training Tasks in GPU Clusters

Distributed machine learning (DML) technology makes it possible to train large neural networks in a reasonable amount of time. Meanwhile, as the computing power grows much faster than network capacity, network communication has gradually become the bottleneck of DML. Current multi-tenant GPU clusters face network contention caused by hash-collision problem which not only further increases […]

Jul, 30

Monadic Deep Learning

The Java and Scala community has built a very successful big data ecosystem. However, most of neural networks running on it are modeled in dynamically typed programming languages. These dynamically typed deep learning frameworks treat neural networks as differentiable expressions that contain many trainable variable, and perform automatic differentiation on those expressions when training them. […]

CUDA

•

OpenCL

Jul, 30

Bandicoot: C++ Library for GPU Linear Algebra and Scientific Computing

This report provides an introduction to the Bandicoot C++ library for GPU linear algebra and scientific computing, detailing its user interface and performance characteristics as well as the technical details of its internal design. Bandicoot is the GPU-enabled counterpart to the well-known Armadillo C++ linear algebra library, aimed at allowing users to enable GPU computation […]

CUDA

•

OpenCL

Jul, 30

Efficiency without Tears: Securing Multilingual Programs with TRINITY

Despite the fact that most real-world programs are developed in multiple languages in the era of data science, existing security techniques are still limited to single-language programs. Worse yet, languages designed for high-performance computing often ignore the necessary security checking in foreign function interfaces (FFI) to pursue supreme execution efficiency. In consequence, security flaws and […]

OpenCL

Jul, 30

Fast Knowledge Graph Completion using Graphics Processing Units

Knowledge graphs can be used in many areas related to data semantics such as question-answering systems, knowledge based systems. However, the currently constructed knowledge graphs need to be complemented for better knowledge in terms of relations. It is called knowledge graph completion. To add new relations to the existing knowledge graph by using knowledge graph […]

CUDA

Jul, 30

A portable C++ library for memory and compute abstraction on multi-core CPUs and GPUs

We present a C++ library for transparent memory and compute abstraction across CPU and GPU architectures. Our library combines generic data structures like vectors, multi-dimensional arrays, maps, graphs, and sparse grids with basic generic algorithms like arbitrary-dimensional convolutions, copying, merging, sorting, prefix sum, reductions, neighbor search, and filtering. The memory layout of the data structures […]

CUDA

•

OpenCL

Jul, 24

ProtoX: A First Look

We present a first look at ProtoX, a code generation framework for stencil and pointwise operations that occur frequently in the numerical solution of partial differential equations. ProtoX has Proto as its library frontend and SPIRAL as the backend. Proto is a C++ based domain specific library which optimizes the algorithms used to compute the […]

Jul, 24

qecGPT: decoding Quantum Error-correcting Codes with Generative Pre-trained Transformers

We propose a general framework for decoding quantum error-correcting codes with generative modeling. The model utilizes autoregressive neural networks, specifically Transformers, to learn the joint probability of logical operators and syndromes. This training is in an unsupervised way, without the need for labeled training data, and is thus referred to as pre-training. After the pre-training, […]

* * *

high performance computing on graphics processing units: hgpu.org

Posts

gZCCL: Compression-Accelerated Collective Communication Framework for GPU Clusters

A Model Extraction Attack on Deep Neural Networks Running on GPUs

SYnergy: Fine-grained Energy-Efficient Heterogeneous Computing for Scalable Energy Saving

Static and Dynamic Analyses for Efficient GPU Execution

Isolated Scheduling for Distributed Training Tasks in GPU Clusters

Monadic Deep Learning

Bandicoot: C++ Library for GPU Linear Algebra and Scientific Computing

Efficiency without Tears: Securing Multilingual Programs with TRINITY

Fast Knowledge Graph Completion using Graphics Processing Units

A portable C++ library for memory and compute abstraction on multi-core CPUs and GPUs

ProtoX: A First Look

qecGPT: decoding Quantum Error-correcting Codes with Generative Pre-trained Transformers

Recent source codes

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

KISim: Kubernetes Intelligent Scheduling Simulator

Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems

Efficient GPU Implementation of Multi-Precision Integer Division

ParEval: A Parallel Code Evaluation Benchmark

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores

exa-AMD: Exascale Accelerated Materials Discovery

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

Most viewed papers (last 30 days)