Posts
Jun, 29
Survey of HPC in US Research Institutions
The rapid growth of AI, data-intensive science, and digital twin technologies has driven an unprecedented demand for high-performance computing (HPC) across the research ecosystem. While national laboratories and industrial hyperscalers have invested heavily in exascale and GPU-centric architectures, university-operated HPC systems remain comparatively under-resourced. This survey presents a comprehensive assessment of the HPC landscape across […]
Jun, 29
Omniwise: Predicting GPU Kernels Performance with LLMs
In recent years, the rapid advancement of deep neural networks (DNNs) has revolutionized artificial intelligence, enabling models with unprecedented capabilities in understanding, generating, and processing complex data. These powerful architectures have transformed a wide range of downstream applications, tackling tasks beyond human reach. In this paper, we introduce Omniwise, the first end-to-end, self-supervised fine-tuning pipeline […]
Jun, 29
GCStack+GCScaler: Fast and Accurate GPU Performance Analyses Using Fine-Grained Stall Cycle Accounting and Interval Analysis
To design next-generation Graphics Processing Units (GPUs), GPU architects rely on GPU performance analyses to identify key GPU performance bottlenecks and explore GPU design spaces. Unfortunately, the existing GPU performance analysis mechanisms make it difficult for GPU architects to conduct fast and accurate GPU performance analyses. The existing mechanisms can provide misleading insights into GPU […]
Jun, 29
No More Shading Languages: Compiling C++ to Vulkan Shaders
Graphics APIs have traditionally relied on shading languages, however, these languages have a number of fundamental defects and limitations. By contrast, GPU compute platforms offer powerful, feature-rich languages suitable for heterogeneous compute. We propose reframing shading languages as embedded domain-specific languages, layered on top of a more general language like C++, doing away with traditional […]
Jun, 29
WiLLM: An Open Wireless LLM Communication System
The rapid evolution of LLMs threatens to overwhelm existing wireless infrastructure, necessitating architectural innovations for burgeoning mobile LLM services. This paper introduces WiLLM, the first open-source wireless system specifically designed for these services. First, we establish a new paradigm by deploying LLMs in core networks (CNs) with abundant GPUs. This enables distributed inference services, strategically […]
Jun, 22
LiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters
Parallel computing with multiple GPUs has become the dominant paradigm for machine learning tasks, especially those of large language models (LLMs). To reduce the latency incurred by inter-GPU communication, a common practice for parallel tasks has been to allocate GPUs based on their physical proximity. However, this long-standing assumption has notable limitations, particularly in large-scale, […]
Jun, 22
A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs
Sparse data structures are commonly used in neural networks to reduce the memory footprint. These data structures are compact but cause irregularities such as random memory accesses, which prevent efficient use of the memory hierarchy. GPUs are a common platform for machine learning practitioners, but running compact data structures on these devices often leads to […]
Jun, 22
A CPU+FPGA OpenCL Heterogeneous Computing Platform for Multi-Kernel Pipeline
Over the past decades, Field-Programmable Gate Arrays (FPGAs) have become a choice for heterogeneous computing due to their flexibility, energy efficiency, and processing speed. OpenCL is used in FPGA heterogeneous computing for its high-level abstraction and cross-platform compatibility. Previous works have introduced optimization techniques in OpenCL for FPGAs to leverage FPGA-specific advantages. However, the multi-kernel […]
Jun, 22
A First Look at Bugs in LLM Inference Engines
Large language model-specific inference engines (in short as emph{LLM inference engines}) have become a fundamental component of modern AI infrastructure, enabling the deployment of LLM-powered applications (LLM apps) across cloud and local devices. Despite their critical role, LLM inference engines are prone to bugs due to the immense resource demands of LLMs and the complexities […]
Jun, 22
Engineering Supercomputing Platforms for Biomolecular Applications
A range of computational biology software (GROMACS, AMBER, NAMD, LAMMPS, OpenMM, Psi4 and RELION) was benchmarked on a representative selection of HPC hardware, including AMD EPYC 7742 CPU nodes, NVIDIA V100 and AMD MI250X GPU nodes, and an NVIDIA GH200 testbed. The raw performance, power efficiency and data storage requirements of the software was evaluated […]
Jun, 15
CUDA-LLM: LLMs Can Write Efficient CUDA Kernels
Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation. However, generating the code which is deeply hardware-specific, architecture-aware, and performance-critical, especially for massively parallel GPUs, remains a complex challenge. In this work, we explore the use of LLMs for the automated generation and optimization of CUDA programs, with the goal of producing […]
Jun, 15
HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration
The rapid growth of deep learning has driven exponential increases in model parameters and computational demands. NVIDIA GPUs and their CUDA-based software ecosystem provide robust support for parallel computing, significantly alleviating computational bottlenecks. Meanwhile, due to the cultivation of user programming habits and the high performance of GPUs, the CUDA ecosystem has established a dominant […]