Posts
Jul, 20
Specx: a C++ task-based runtime system for heterogeneous distributed architectures
Parallelization is needed everywhere, from laptops and mobile phones to supercomputers. Among parallel programming models, task-based programming has demonstrated a powerful potential and is widely used in high-performance scientific computing. Not only does it allow efficient parallelization across distributed heterogeneous computing nodes, but it also allows for elegant source code structuring by describing hardware-independent algorithms. […]
Jul, 13
Mutual-Supervised Learning for Sequential-to-Parallel Code Translation
The rise of GPU-based high-performance computing (HPC) has driven the widespread adoption of parallel programming models such as CUDA. Yet, the inherent complexity of parallel programming creates a demand for the automated sequential-to-parallel approaches. However, data scarcity poses a significant challenge for machine learning-based sequential-to-parallel code translation. Although recent back-translation methods show promise, they still […]
Jul, 13
KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling
Autoscaling GPU inference workloads in Kubernetes remains challenging due to the reactive and threshold-based nature of default mechanisms such as the Horizontal Pod Autoscaler (HPA), which struggle under dynamic and bursty traffic patterns and lack integration with GPU-level metrics. We present KIS-S, a unified framework that combines KISim, a GPU-aware Kubernetes Inference Simulator, with KIScaler, […]
Jul, 13
Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms
The NVIDIA Collective Communication Library (NCCL) is a critical software layer enabling high-performance collectives on large-scale GPU clusters. Despite being open source with a documented API, its internal design remains largely opaque. The orchestration of communication channels, selection of protocols, and handling of memory movement across devices and nodes are not well understood, making it […]
Jul, 13
Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and High-Performance GPUs
This study presents a benchmarking analysis of the Qualcomm Cloud AI 100 Ultra (QAic) accelerator for large language model (LLM) inference, evaluating its energy efficiency (throughput per watt) and performance against leading NVIDIA (A100, H200) and AMD (MI300A) GPUs within the National Research Platform (NRP) ecosystem. A total of 15 open-source LLMs, ranging from 117 […]
Jul, 13
Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems
As GPU-using tasks become more common in embedded, safety-critical systems, efficiency demands necessitate sharing a single GPU among multiple tasks. Unfortunately, existing ways to schedule multiple tasks onto a GPU often either result in a loss of ability to meet deadlines, or a loss of efficiency. In this work, we develop a system-level spatial compute […]
Jul, 6
Efficient GPU Implementation of Multi-Precision Integer Division
Efficient arithmetic on multi-precision integers is a cornerstone of many scientific and cryptographic applications that require computations on integers that exceed the native sizes supported by modern processors. While GPU-efficient addition and multiplication has been well explored, division has been subject to less attention due to its greater algorithmic complexity. This thesis attempts to bridge […]
Jul, 6
P4OMP: Retrieval-Augmented Prompting for OpenMP Parallelism in Serial Code
We present P4OMP, a retrieval-augmented framework for transforming serial C/C++ code into OpenMP-annotated parallel code using large language models (LLMs). To our knowledge, this is the first system to apply retrieval-based prompting for OpenMP pragma correctness without model fine-tuning or compiler instrumentation. P4OMP leverages Retrieval-Augmented Generation (RAG) with structured instructional knowledge from OpenMP tutorials to […]
Jul, 6
ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks
GPGPU architectures have become significantly diverse in recent years, which has led to an emergence of a variety of specialized programming models and software stacks to support them. While portable execution models exist, they still require significant developer effort to port to and optimize for different hardware architectures. Recent advances in large language models (LLMs) […]
Jul, 6
Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication
Sparse matrix multiplication operators (i.e., SpMM and SDDMM) are widely used in deep learning and scientific computing. Modern accelerators are commonly equipped with Tensor cores and CUDA cores to accelerate sparse operators. The former brings superior computing power but only for structured matrix multiplication, while the latter has relatively lower performance but with higher programming […]
Jul, 6
Accelerated discovery and design of Fe-Co-Zr magnets with tunable magnetic anisotropy through machine learning and parallel computing
Rare earth (RE)-free permanent magnets, as alternative substitutes for RE-containing magnets for sustainable energy technologies and modern electronics, have attracted considerable interest. We performed a comprehensive search for new hard magnetic materials in the ternary Fe-Co-Zr space by leveraging a scalable, machine learning-assisted materials discovery framework running on GPU-enabled exascale computing resources. This framework integrates […]
Jun, 29
WiLLM: An Open Wireless LLM Communication System
The rapid evolution of LLMs threatens to overwhelm existing wireless infrastructure, necessitating architectural innovations for burgeoning mobile LLM services. This paper introduces WiLLM, the first open-source wireless system specifically designed for these services. First, we establish a new paradigm by deploying LLMs in core networks (CNs) with abundant GPUs. This enables distributed inference services, strategically […]

