high performance computing on graphics processing units: hgpu.org

Posts

Jun, 16

How much can we gain from Tensor Kernel Fusion on GPUs?

Kernel fusion is a crucial optimization technique for GPU applications, particularly deep neural networks, where it involves combining multiple consecutive kernels into a single larger kernel. This approach aims to enhance performance by reducing the need for slow off-chip memory accesses. Instead, intermediate results between successive kernels are stored in faster on-chip memory like shared […]

CUDA

Jun, 9

Memory Interference and Performance Prediction in GPU-Accelerated Heterogeneous Systems

Nowadays, a variety of applications, including automated factories, autonomous vehicles, and Cyber Physical Systems (CPS), are experiencing significant growth. Given the diverse range of challenges that must be addressed, such as real-time management and visualization of a factory’s current state through a 3D digital twin, trajectory calculation within autonomous vehicles, visualizing Human Machine Interfaces (HMI), […]

CUDA

Jun, 9

Gaining Cross-Platform Parallelism for HAL’s Molecular Dynamics Package using SYCL

Molecular dynamics simulations are one of the methods in scientific computing that benefit from GPU acceleration. For those devices, SYCL is a promising API for writing portable codes. In this paper, we present the case study of "HAL’s MD package" that has been successfully migrated from CUDA to SYCL. We describe the different strategies that […]

CUDA

Jun, 9

More Bang For Your Buck(et): Fast and Space-efficient Hardware-accelerated Coarse-granular Indexing on GPUs

In recent work, we have shown that NVIDIA’s raytracing cores on RTX video cards can be exploited to realize hardware-accelerated lookups for GPU-resident database indexes. On a high level, the concept materializes all keys as triangles in a 3D scene and indexes them. Lookups are performed by firing rays into the scene and utilizing the […]

Jun, 9

Fast and Practical Strassen’s Matrix Multiplication using FPGAs

Matrix multiplication is a cornerstone operation in a wide array of scientific fields, including machine learning and computer graphics. The standard algorithm for matrix multiplication has a complexity of O(n3) for n×n matrices. Strassen’s algorithm improves this to O(n2.807), but its practicality is limited for small to medium matrix sizes due to the large number […]

OpenCL

Jun, 9

Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

This paper introduces Helix, a distributed system for high-throughput, low-latency large language model (LLM) serving on heterogeneous GPU clusters. A key idea behind Helix is to formulate inference computation of LLMs over heterogeneous GPUs and network connections as a max-flow problem for a directed, weighted graph, whose nodes represent GPU instances and edges capture both […]

Jun, 2

Machine learning enhanced code optimization for high-level synthesis (ML-ECOHS)

While Field-Programmable Gate Arrays (FPGAs) exist in many design configurations throughout the data center, cloud, and edge, the promise of performance and flexibility offered by the FPGA often remains unrealized for lack of hardware design expertise, with most computation remaining in fixed hardware such as CPUs, GPUs, and ASICs e.g. tensor processors. Identifying programmability as […]

OpenCL

Jun, 2

Addressing Challenges in Utilizing GPUs for Accelerating Privacy-Preserving Computation

Cloud computing increasingly handles confidential data, like private inference and query databases. Two strategies are used for secure computation: (1) employing CPU Trusted Execution Environments (TEEs) like AMD SEV, Intel SGX, or ARM TrustZone, and (2) utilizing emerging cryptographic methods like Fully Homomorphic Encryption (FHE) with libraries such as HElib, Microsoft SEAL, and PALISADE. To […]

CUDA

Jun, 2

Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

Matrix multiplication is fundamental in the backpropagation algorithm used to train deep neural network models. Libraries like Intel’s MKL or NVIDIA’s cuBLAS implemented new and optimized matrix multiplication techniques that increase performance and reduce computational costs. These techniques can also be implemented in CUDA and SYCL and functions with AVX2 and AVX512 instructions, which have […]

CUDA

Jun, 2

An implementation of tensor product patch smoothers on GPU

We present a GPU implementation of vertex-patch smoothers for higher order finite element methods in two and three dimensions. Analysis shows that they are not memory bound with respect to GPU DRAM, but with respect to on-chip scratchpad memory. Multigrid operations are optimized through localization and reorganized local operations in on-chip memory, achieving minimal global […]

CUDA

Jun, 2

A Survey of Cloud-Based GPU Threats and Their Impact on AI, HPC, and Cloud Computing

Graphics processing units (GPUs) are the hardware engines driving the AI revolution. Large language model (LLM)-powered generative AI (GenAI) became mainstream with the public release of OpenAI’s ChatGPT. AI usage has given rise to innovative AI-powered applications for businesses, productivity, image generation, video generation, data analysis, and social media, among others. Powering AI applications are […]

CUDA

•

OpenCL

May, 26

Enabling full-speed random access to the entire memory on the A100 GPU

We describe some features of the A100 memory architecture. In particular, we give a technique to reverse-engineer some hardware layout information. Using this information, we show how to avoid TLB issues to obtain full-speed random HBM access to the entire memory, as long as we constrain any particular thread to a reduced access window of […]

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

How much can we gain from Tensor Kernel Fusion on GPUs?

Memory Interference and Performance Prediction in GPU-Accelerated Heterogeneous Systems

Gaining Cross-Platform Parallelism for HAL’s Molecular Dynamics Package using SYCL

More Bang For Your Buck(et): Fast and Space-efficient Hardware-accelerated Coarse-granular Indexing on GPUs

Fast and Practical Strassen’s Matrix Multiplication using FPGAs

Helix: Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs

Machine learning enhanced code optimization for high-level synthesis (ML-ECOHS)

Addressing Challenges in Utilizing GPUs for Accelerating Privacy-Preserving Computation

Evaluation of computational and energy performance in matrix multiplication algorithms on CPU and GPU using MKL, cuBLAS and SYCL

An implementation of tensor product patch smoothers on GPU

A Survey of Cloud-Based GPU Threats and Their Impact on AI, HPC, and Cloud Computing

Enabling full-speed random access to the entire memory on the A100 GPU

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)