high performance computing on graphics processing units: hgpu.org

Posts

Feb, 10

Ilargi: a GPU Compatible Factorized ML Model Training Framework

The machine learning (ML) training over disparate data sources traditionally involves materialization, which can impose substantial time and space overhead due to data movement and replication. Factorized learning, which leverages direct computation on disparate sources through linear algebra (LA) rewriting, has emerged as a viable alternative to improve computational efficiency. However, the adaptation of factorized […]

CUDA

Feb, 10

Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs

Recent advancements in Large Language Models (LLMs) have led to increasingly diverse requests, accompanied with varying resource (compute and memory) demands to serve them. However, this in turn degrades the cost-efficiency of LLM serving as common practices primarily rely on homogeneous GPU resources. In response to this problem, this work conducts a thorough study about […]

Feb, 10

Compiler Support for Speculation in Decoupled Access/Execute Architectures

Irregular codes are bottlenecked by memory and communication latency. Decoupled access/execute (DAE) is a common technique to tackle this problem. It relies on the compiler to separate memory address generation from the rest of the program, however, such a separation is not always possible due to control and data dependencies between the access and execute […]

Feb, 3

On the Partitioning of GPU Power among Multi-Instances

Efficient power management in cloud data centers is essential for reducing costs, enhancing performance, and minimizing environmental impact. GPUs, critical for tasks like machine learning (ML) and GenAI, are major contributors to power consumption. NVIDIA’s Multi-Instance GPU (MIG) technology improves GPU utilization by enabling isolated partitions with per-partition resource tracking, facilitating GPU sharing by multiple […]

CUDA

Feb, 3

Fully-Automated Code Generation for Efficient Computation of Sparse Matrix Permanents on GPUs

Registers are the fastest memory components within the GPU’s complex memory hierarchy, accessed by names rather than addresses. They are managed entirely by the compiler through a process called register allocation, during which the compiler attempts to cache predictable data from thread-local memory into thread-private registers. Computing the permanent of a sparse matrix poses a […]

CUDA

Feb, 3

CPU-GPU co-execution through the exploitation of hybrid technologies via SYCL

The performance and energy efficiency offered by heterogeneous systems are highly useful for modern C++ applications, but the technological variety demands adequate portability and programmability. Initiatives such as Intel oneAPI facilitate the exploitation of Intel CPUs and GPUs, but not NVIDIA GPUs, which are present in systems of all kinds and are necessarily leveraged by […]

CUDA

•

OpenCL

Feb, 3

Modernization and Optimization of MPI Codes

MPI has become the de facto standard for distributed memory computing since its inception in 1994. While the MPI standard has evolved to include new technologies like RDMA, many applications still rely on the original set of MPI operations. This thesis initially investigates the current usage of MPI. We note that developers underutilize modern MPI […]

Feb, 3

Profiling Apple Silicon Performance for ML Training

Apple Silicon has attracted much attention for its performance and role in machine learning (ML) training. Unlike NVIDIA GPUs, which have traditionally dominated ML training, Apple Silicon has a significant difference in memory architecture. It uses Unified Memory, which integrates CPU and GPU memory instead of separate CPU memory and GPU VRAM. However, it is […]

CUDA

Jan, 27

Column-Oriented Datalog on the GPU

Datalog is a logic programming language widely used in knowledge representation and reasoning (KRR), program analysis, and social media mining due to its expressiveness and high performance. Traditionally, Datalog engines use either row-oriented or column-oriented storage. Engines like VLog and Nemo favor column-oriented storage for efficiency on limited-resource machines, while row-oriented engines like Souffle use […]

CUDA

Jan, 27

Good things come in small packages: Should we adopt Lite-GPUs in AI infrastructure?

To match the blooming demand of generative AI workloads, GPU designers have so far been trying to pack more and more compute and memory into single complex and expensive packages. However, there is growing uncertainty about the scalability of individual GPUs and thus AI clusters, as state-of-the-art GPUs are already displaying packaging, yield, and cooling […]

Jan, 27

Exploring data flow design and vectorization with oneAPI for streaming applications on CPU+GPU

In recent times, oneAPI has emerged as a competitive framework to optimize streaming applications on heterogeneous CPU+GPU architectures, since it provides portability and performance thanks to the SYCL programming language and efficient parallel libraries as oneTBB. However, this approach opens up a wealth of implementations alternatives in this type of applications: from how to design […]

Jan, 27

Adaptive Optimization Techniques for High-Performance Computing

The dataset sizes and computing needs of increasingly prevalent high-performance computing (HPC) applications have grown exponentially over the last decade. Moreover, modern computing architectures are evolving with different paradigms, and accelerators have become indispensable parts of computing. Consequently, the imperative for performance optimization for HPC applications and intelligent resource management for evolving architectures has become […]

CUDA

high performance computing on graphics processing units: hgpu.org

Posts

Ilargi: a GPU Compatible Factorized ML Model Training Framework

Demystifying Cost-Efficiency in LLM Serving over Heterogeneous GPUs

Compiler Support for Speculation in Decoupled Access/Execute Architectures

On the Partitioning of GPU Power among Multi-Instances

Fully-Automated Code Generation for Efficient Computation of Sparse Matrix Permanents on GPUs

CPU-GPU co-execution through the exploitation of hybrid technologies via SYCL

Modernization and Optimization of MPI Codes

Profiling Apple Silicon Performance for ML Training

Column-Oriented Datalog on the GPU

Good things come in small packages: Should we adopt Lite-GPUs in AI infrastructure?

Exploring data flow design and vectorization with oneAPI for streaming applications on CPU+GPU

Adaptive Optimization Techniques for High-Performance Computing

Recent source codes

Luthier: Bridging Auto-Tuning and Vendor Libraries for Efficient Deep Learning Inference

Fused Kernel Library (FKL)

GPUHammer: Rowhammer Attacks on GPU Memories are Practical

Block: Balance Loader of LLM Serving with Context, Knowledge and Predictive Scheduling

SIGMo: Scalable Isomorphism Graph Matching on GPUs

DGEMM without FP64 Arithmetic - using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme

GEAK-agent: LLM-based AI agent, which can write correct and efficient GPU kernels automatically

OpenDwarfs 2025: re-engineered version of the OpenDwarfs benchmark suite, for compatibility with modern platforms

Specx: Speculative task-based runtime system

Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

Most viewed papers (last 30 days)