30098

Posts

Aug, 17

The Fused Kernel Library: A C++ API to Develop Highly-Efficient GPU Libraries

Existing GPU libraries often struggle to fully exploit the parallel resources and on-chip memory (SRAM) of GPUs when chaining multiple GPU functions as individual kernels. While Kernel Fusion (KF) techniques like Horizontal Fusion (HF) and Vertical Fusion (VF) can mitigate this, current library implementations often require library developers to manually create fused kernels. Hence, library […]
Aug, 17

Luthier: Bridging Auto-Tuning and Vendor Libraries for Efficient Deep Learning Inference

Recent deep learning compilers commonly adopt auto-tuning approaches that search for the optimal kernel configuration in tensor programming from scratch, requiring tens of hours per operation and neglecting crucial optimization factors for parallel computing on asymmetric multicore processors. Meanwhile, hand-optimized inference libraries from hardware vendors provide high performance but lack the flexibility and automation needed […]
Aug, 17

GPUHammer: Rowhammer Attacks on GPU Memories are Practical

Rowhammer is a read disturbance vulnerability in modern DRAM that causes bit-flips, compromising security and reliability. While extensively studied on Intel and AMD CPUs with DDR and LPDDR memories, its impact on GPUs using GDDR memories, critical for emerging machine learning applications, remains unexplored. Rowhammer attacks on GPUs face unique challenges: (1) proprietary mapping of […]
Aug, 17

Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling

This paper presents Block, a distributed scheduling framework designed to optimize load balancing and auto-provisioning across instances in large language model serving frameworks by leveraging contextual information from incoming requests. Unlike popular model serving systems that rely on monolithic and heuristic task schedulers, Block operates as a fully distributed, stateless, and predictive scheduling system to […]
Aug, 17

Performant Unified GPU Kernels for Portable Singular Value Computation Across Hardware and Precision

This paper presents a portable, GPU-accelerated implementation of a QR-based singular value computation algorithm in Julia. The singular value ecomposition (SVD) is a fundamental numerical tool in scientific computing and machine learning, providing optimal low-rank matrix approximations. Its importance has increased even more in large-scale machine learning pipelines, including large language models (LLMs), where it […]
Aug, 10

AGFT: An Adaptive GPU Frequency Tuner for Real-Time LLM Inference Optimization

The explosive growth of interactive Large Language Models (LLMs) has placed unprecedented demands for low latency on cloud GPUs, forcing them into high-power modes and causing escalating energy costs. Real-time inference workloads exhibit significant dynamic volatility, presenting substantial energy-saving opportunities. However, traditional static or rule-based power management strategies struggle to exploit these opportunities without compromising […]
Aug, 10

Understanding the Landscape of Ampere GPU Memory Errors

Graphics Processing Units (GPUs) have become a de facto solution for accelerating high-performance computing (HPC) applications. Understanding their memory error behavior is an essential step toward achieving efficient and reliable HPC systems. In this work, we present a large-scale cross-supercomputer study to characterize GPU memory reliability, covering three supercomputers – Delta, Polaris, and Perlmutter – […]
Aug, 10

ConTraPh: Contrastive Learning for Parallelization and Performance Optimization

With the advancement of HPC platforms, the demand for high-performing applications continues to grow. One effective way to enhance program performance is through parallelization. However, fully leveraging the powerful hardware of HPC platforms poses significant challenges. Even experienced developers must carefully consider factors such as runtime, memory usage, and thread-scheduling overhead. Additionally, achieving successful parallelization […]
Aug, 10

SIGMo: High-Throughput Batched Subgraph Isomorphism on GPUs for Molecular Matching

Subgraph isomorphism is a fundamental graph problem with applications in diverse domains from biology to social network analysis. Of particular interest is molecular matching, which uses a subgraph isomorphism formulation for the drug discovery process. While subgraph isomorphism is known to be NP-complete and computationally expensive, in the molecular matching formulation a number of domain […]
Aug, 10

DGEMM without FP64 Arithmetic – using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

Since AI computations require low-precision matrix multiplications, processors with enhanced performance for these operations are increasing along with the growing demand for AI computations. However, it is difficult to use these operations directly for scientific computations. The Ozaki scheme, an accurate matrix multiplication method proposed by Ozaki et al. in 2012, enables FP64 matrix multiplication […]
Aug, 3

NPUEval: Optimizing NPU Kernels with LLMs and Open Source Compilers

Neural processing units (NPUs) are gaining prominence in power-sensitive devices like client devices, with AI PCs being defined by their inclusion of these specialized processors. Running AI workloads efficiently on these devices requires libraries of optimized kernels. Creating efficient kernels demands expertise in domain-specific C++ with vector intrinsics and in-depth knowledge of the target architecture. […]
Aug, 3

GBOTuner: Autotuning of OpenMP Parallel Codes with Bayesian Optimization and Code Representation Transfer Learning

Empirical autotuning methods such as Bayesian optimization (BO) are a powerful approach that allows us to optimize tuning parameters of parallel codes as black-boxes. However, BO is an expensive approach because it relies on empirical samples from true evaluations for varying parameter configurations. In this thesis, we present GBOTuner, an autotuning framework for optimizing the […]

Recent source codes

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us:

contact@hpgu.org