Posts
Jul, 13
Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and High-Performance GPUs
This study presents a benchmarking analysis of the Qualcomm Cloud AI 100 Ultra (QAic) accelerator for large language model (LLM) inference, evaluating its energy efficiency (throughput per watt) and performance against leading NVIDIA (A100, H200) and AMD (MI300A) GPUs within the National Research Platform (NRP) ecosystem. A total of 15 open-source LLMs, ranging from 117 […]
Jul, 13
Hardware Compute Partitioning on NVIDIA GPUs for Composable Systems
As GPU-using tasks become more common in embedded, safety-critical systems, efficiency demands necessitate sharing a single GPU among multiple tasks. Unfortunately, existing ways to schedule multiple tasks onto a GPU often either result in a loss of ability to meet deadlines, or a loss of efficiency. In this work, we develop a system-level spatial compute […]
Jul, 6
Efficient GPU Implementation of Multi-Precision Integer Division
Efficient arithmetic on multi-precision integers is a cornerstone of many scientific and cryptographic applications that require computations on integers that exceed the native sizes supported by modern processors. While GPU-efficient addition and multiplication has been well explored, division has been subject to less attention due to its greater algorithmic complexity. This thesis attempts to bridge […]
Jul, 6
Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication
Sparse matrix multiplication operators (i.e., SpMM and SDDMM) are widely used in deep learning and scientific computing. Modern accelerators are commonly equipped with Tensor cores and CUDA cores to accelerate sparse operators. The former brings superior computing power but only for structured matrix multiplication, while the latter has relatively lower performance but with higher programming […]
Jul, 6
Accelerated discovery and design of Fe-Co-Zr magnets with tunable magnetic anisotropy through machine learning and parallel computing
Rare earth (RE)-free permanent magnets, as alternative substitutes for RE-containing magnets for sustainable energy technologies and modern electronics, have attracted considerable interest. We performed a comprehensive search for new hard magnetic materials in the ternary Fe-Co-Zr space by leveraging a scalable, machine learning-assisted materials discovery framework running on GPU-enabled exascale computing resources. This framework integrates […]
Jul, 6
P4OMP: Retrieval-Augmented Prompting for OpenMP Parallelism in Serial Code
We present P4OMP, a retrieval-augmented framework for transforming serial C/C++ code into OpenMP-annotated parallel code using large language models (LLMs). To our knowledge, this is the first system to apply retrieval-based prompting for OpenMP pragma correctness without model fine-tuning or compiler instrumentation. P4OMP leverages Retrieval-Augmented Generation (RAG) with structured instructional knowledge from OpenMP tutorials to […]
Jul, 6
ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks
GPGPU architectures have become significantly diverse in recent years, which has led to an emergence of a variety of specialized programming models and software stacks to support them. While portable execution models exist, they still require significant developer effort to port to and optimize for different hardware architectures. Recent advances in large language models (LLMs) […]
Jun, 29
WiLLM: An Open Wireless LLM Communication System
The rapid evolution of LLMs threatens to overwhelm existing wireless infrastructure, necessitating architectural innovations for burgeoning mobile LLM services. This paper introduces WiLLM, the first open-source wireless system specifically designed for these services. First, we establish a new paradigm by deploying LLMs in core networks (CNs) with abundant GPUs. This enables distributed inference services, strategically […]
Jun, 29
Survey of HPC in US Research Institutions
The rapid growth of AI, data-intensive science, and digital twin technologies has driven an unprecedented demand for high-performance computing (HPC) across the research ecosystem. While national laboratories and industrial hyperscalers have invested heavily in exascale and GPU-centric architectures, university-operated HPC systems remain comparatively under-resourced. This survey presents a comprehensive assessment of the HPC landscape across […]
Jun, 29
Omniwise: Predicting GPU Kernels Performance with LLMs
In recent years, the rapid advancement of deep neural networks (DNNs) has revolutionized artificial intelligence, enabling models with unprecedented capabilities in understanding, generating, and processing complex data. These powerful architectures have transformed a wide range of downstream applications, tackling tasks beyond human reach. In this paper, we introduce Omniwise, the first end-to-end, self-supervised fine-tuning pipeline […]
Jun, 29
GCStack+GCScaler: Fast and Accurate GPU Performance Analyses Using Fine-Grained Stall Cycle Accounting and Interval Analysis
To design next-generation Graphics Processing Units (GPUs), GPU architects rely on GPU performance analyses to identify key GPU performance bottlenecks and explore GPU design spaces. Unfortunately, the existing GPU performance analysis mechanisms make it difficult for GPU architects to conduct fast and accurate GPU performance analyses. The existing mechanisms can provide misleading insights into GPU […]
Jun, 29
No More Shading Languages: Compiling C++ to Vulkan Shaders
Graphics APIs have traditionally relied on shading languages, however, these languages have a number of fundamental defects and limitations. By contrast, GPU compute platforms offer powerful, feature-rich languages suitable for heterogeneous compute. We propose reframing shading languages as embedded domain-specific languages, layered on top of a more general language like C++, doing away with traditional […]