30171

Posts

Sep, 7

CrossTL: A Universal Programming Language Translator with Unified Intermediate Representation

We present CrossTL, a universal programming language translator enabling bidirectional translation between multiple languages through a unified intermediate representation called CrossGL. Traditional approaches require separate translators for each language pair, leading to exponential complexity growth. CrossTL uses a single universal IR to facilitate translations between CUDA, HIP, Metal, DirectX HLSL, OpenGL GLSL, Vulkan SPIR-V, Rust, […]
Sep, 7

AnnotationGym: A Generic Framework for Automatic Source Code Annotation

A common approach to code optimization is to insert compiler hints in the source code using annotations. Two major challenges with using annotations effectively are their complexity and lack of portability. This means, first, that significant developer expertise is required, and, second, that the supported annotations, as well as their syntax and use, can vary […]
Sep, 7

Harnessing Batched BLAS/LAPACK Kernels on GPUs for Parallel Solutions of Block Tridiagonal Systems

We present a GPU implementation for the factorization and solution of block-tridiagonal symmetric positive definite linear systems, which commonly arise in time-dependent estimation and optimal control problems. Our method employs a recursive algorithm based on Schur complement reduction, transforming the system into a hierarchy of smaller, independent blocks that can be efficiently solved in parallel […]
Sep, 7

GPU-acceleration of the Discontinuous Galerkin Shallow Water Equations Solver (DG-SWEM) using CUDA and OpenACC

This paper presents a porting of DG-SWEM, a discontinuous Galerkin solver for coastal ocean circulation, and in particular storm surge, to GPU using two separate approaches: CUDA Fortran and OpenACC. Time-explicit discontinuous Galerkin methods have been shown to exhibit a large amount of data parallelism due to the loose coupling between elements, and thus are […]
Sep, 7

Managing Multi Instance GPUs for High Throughput and Energy Savings

Focus to learn morModern GPUs such as the Ampere series (A30, A100) as well as the Hopper series (H100, H200) offer performance as well as security isolation features. They also support a good amount of concurrency, but taking advantage of it can be quite challenging due to the complex constraints on partitioning the chip. In […]
Aug, 31

Scalable Engine and the Performance of Different LLM Models in a SLURM based HPC architecture

This work elaborates on a High performance computing (HPC) architecture based on Simple Linux Utility for Resource Management (SLURM) [1] for deploying heterogeneous Large Language Models (LLMs) into a scalable inference engine. Dynamic resource scheduling and seamless integration of containerized microservices have been leveraged herein to manage CPU, GPU, and memory allocations efficiently in multi-node […]
Aug, 31

Dissecting CPU-GPU Unified Physical Memory on AMD MI300A APUs

Discrete GPUs are a cornerstone of HPC and data center systems, requiring management of separate CPU and GPU memory spaces. Unified Virtual Memory (UVM) has been proposed to ease the burden of memory management; however, at a high cost in performance. The recent introduction of AMD’s MI300A Accelerated Processing Units (APUs)–as deployed in the El […]
Aug, 31

Scaling GPU-Accelerated Databases beyond GPU Memory Size

There has been considerable interest in leveraging GPUs’ computational power and high memory bandwidth for analytical database workloads. However, their limited memory capacity remains a fundamental limitation for databases whose sizes far exceed the GPU memory size. This challenge is exacerbated by the slow PCIe data transfer speed, that creates a bottleneck in overall system […]
Aug, 31

BePilot: An AI Programming Assistant for Compiler Backend Development

Compiler backends are tasked with generating executable machine code for various processors. As the diversity of processors continues to grow, it is imperative for programmers to tailor specific compiler backends to accommodate each one. However, compiler backend development remains a labor-intensive and time-consuming process, with limited automation tools available. Although large language models (LLMs) have […]
Aug, 31

Accelerating a Linear Programming Algorithm on AMD GPUs

Linear Programming (LP) is a foundational optimization technique with widespread applications in finance, energy trading, and supply chain logistics. However, traditional Central Processing Unit (CPU)-based LP solvers often struggle to meet the latency and scalability demands of dynamic, high-dimensional industrial environments, creating a significant computational challenge. This project addresses these limitations by accelerating linear programming […]
Aug, 24

Profiling Concurrent Vision Inference Workloads on NVIDIA Jetson – Extended

The proliferation of IoT devices and advancements in network technologies have intensified the demand for real-time data processing at the network edge. To address these demands, low-power AI accelerators, particularly GPUs, are increasingly deployed for inference tasks, enabling efficient computation while mitigating cloud-based systems’ latency and bandwidth limitations. Despite their growing deployment, GPUs remain underutilised […]
Aug, 24

Towards Efficient and Practical GPU Multitasking in the Era of LLM

GPU singletasking is becoming increasingly inefficient and unsustainable as hardware capabilities grow and workloads diversify. We are now at an inflection point where GPUs must embrace multitasking, much like CPUs did decades ago, to meet the demands of modern AI workloads. In this work, we highlight the key requirements for GPU multitasking, examine prior efforts, […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us:

contact@hpgu.org