high performance computing on graphics processing units: hgpu.org

Posts

Jan, 21

A Survey on Hardware Accelerators for Large Language Models

Large Language Models (LLMs) have emerged as powerful tools for natural language processing tasks, revolutionizing the field with their ability to understand and generate human-like text. As the demand for more sophisticated LLMs continues to grow, there is a pressing need to address the computational challenges associated with their scale and complexity. This paper presents […]

Jan, 21

swCUDA: Auto parallel code translation framework from CUDA to ATHREAD for new generation sunway supercomputer

Since specific hardware characteristics and low-level programming model are adapted to both NVIDIA GPU and new generation Sunway architecture, automatically translating mature CUDA kernels to Sunway ATHREAD kernels are realistic but challenging work. To address this issue, swCUDA, an auto parallel code translation framework is proposed. To that end, we create scale afne translation to […]

CUDA

Jan, 21

Minuet: Accelerating 3D Sparse Convolutions on GPUs

Sparse Convolution (SC) is widely used for processing 3D point clouds that are inherently sparse. Different from dense convolution, SC preserves the sparsity of the input point cloud by only allowing outputs to specific locations. To efficiently compute SC, prior SC engines first use hash tables to build a kernel map that stores the necessary […]

CUDA

Jan, 21

Parallel and Heterogeneous Timing Analysis: Partition, Algorithm, and System

Static timing analysis (STA) is an integral part in the overall design flow because it verifies the expected timing behaviors of a circuit. However, as the circuit complexity continues to enlarge, there is an increasing need for enhancing the performance of existing STA algorithms using emerging heterogeneous parallelism that comprises manycore central processing units (CPUs) […]

CUDA

Jan, 21

MGARD: A multigrid framework for high-performance, error-controlled data compression and refactoring

We describe MGARD, a software providing MultiGrid Adaptive Reduction for floating-point scientific data on structured and unstructured grids. With exceptional data compression capability and precise error control, MGARD addresses a wide range of requirements, including storage reduction, high-performance I/O, and in-situ data analysis. It features a unified application programming interface (API) that seamlessly operates across […]

CUDA

Jan, 14

HAP: SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis

Single-Program-Multiple-Data (SPMD) parallelism has recently been adopted to train large deep neural networks (DNNs). Few studies have explored its applicability on heterogeneous clusters, to fully exploit available resources for large model learning. This paper presents HAP, an automated system designed to expedite SPMD DNN training on heterogeneous clusters. HAP jointly optimizes the tensor sharding strategy, […]

CUDA

Jan, 14

HiRace: Accurate and Fast Source-Level Race Checking of GPU Programs

Data races are egregious parallel programming bugs on CPUs. They are even worse on GPUs due to the hierarchical thread and memory structure, which makes it possible to write code that is correctly synchronized within a thread group while not being correct across groups. Thus far, all major data-race checkers for GPUs suffer from at […]

CUDA

Jan, 14

Preliminary report: Initial evaluation of StdPar implementations on AMD GPUs for HPC

Recently, AMD platforms have not supported offloading C++17 PSTL (StdPar) programs to the GPU. Our previous work highlights how StdPar is able to achieve good performance across NVIDIA and Intel GPU platforms. In that work, we acknowledged AMD’s past effort such as HCC, which unfortunately is deprecated and does not support newer hardware platforms. Recent […]

OpenCL

Jan, 14

Orion: Interference-aware, Fine-grained GPU Sharing for ML Applications

GPUs are critical for maximizing the throughput-per-Watt of deep neural network (DNN) applications. However, DNN applications often underutilize GPUs, even when using large batch sizes and eliminating input data processing or communication stalls. DNN workloads consist of data-dependent operators, with different compute and memory requirements. While an operator may saturate GPU compute units or memory […]

CUDA

Jan, 14

Code Generation for a Variety of Accelerators for a Graph DSL

Sparse graphs are ubiquitous in real and virtual worlds. With the phenomenal growth in semi-structured and unstructured data, sizes of the underlying graphs have witnessed a rapid growth over the years. Analyzing such large structures necessitates parallel processing, which is challenged by the intrinsic irregularity of sparse computation, memory access, and communication. It would be […]

CUDA

•

OpenCL

Jan, 7

Deep Learning for Obfuscated Code Analysis

Modern software development relies increasingly on third-party code dependencies, which enables rapid development but also increases risk of introducing bugs, malware, or unauthorized intellectual property. The goal of this dissertation is to reduce these risks making them easier to detect. Determining the meaning of an arbitrary program reduces to solving the halting problem, which is […]

CUDA

•

OpenCL

Jan, 7

UniFL: Accelerating Federated Learning Using Heterogeneous Hardware Under a Unified Framework

Federated learning (FL) is now considered a critical method for breaking down data silos. However, data encryption can significantly increase computing time, limiting its large-scale deployment. While hardware acceleration can be an effective solution, existing research has largely focused on a single hardware type, which hinders the acceleration of FL across the various heterogeneous hardware […]

CUDA

•

OpenCL