30008

Posts

Jul, 6

Efficient GPU Implementation of Multi-Precision Integer Division

Efficient arithmetic on multi-precision integers is a cornerstone of many scientific and cryptographic applications that require computations on integers that exceed the native sizes supported by modern processors. While GPU-efficient addition and multiplication has been well explored, division has been subject to less attention due to its greater algorithmic complexity. This thesis attempts to bridge […]
Jul, 6

P4OMP: Retrieval-Augmented Prompting for OpenMP Parallelism in Serial Code

We present P4OMP, a retrieval-augmented framework for transforming serial C/C++ code into OpenMP-annotated parallel code using large language models (LLMs). To our knowledge, this is the first system to apply retrieval-based prompting for OpenMP pragma correctness without model fine-tuning or compiler instrumentation. P4OMP leverages Retrieval-Augmented Generation (RAG) with structured instructional knowledge from OpenMP tutorials to […]
Jul, 6

ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks

GPGPU architectures have become significantly diverse in recent years, which has led to an emergence of a variety of specialized programming models and software stacks to support them. While portable execution models exist, they still require significant developer effort to port to and optimize for different hardware architectures. Recent advances in large language models (LLMs) […]
Jul, 6

Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication

Sparse matrix multiplication operators (i.e., SpMM and SDDMM) are widely used in deep learning and scientific computing. Modern accelerators are commonly equipped with Tensor cores and CUDA cores to accelerate sparse operators. The former brings superior computing power but only for structured matrix multiplication, while the latter has relatively lower performance but with higher programming […]
Jul, 6

Accelerated discovery and design of Fe-Co-Zr magnets with tunable magnetic anisotropy through machine learning and parallel computing

Rare earth (RE)-free permanent magnets, as alternative substitutes for RE-containing magnets for sustainable energy technologies and modern electronics, have attracted considerable interest. We performed a comprehensive search for new hard magnetic materials in the ternary Fe-Co-Zr space by leveraging a scalable, machine learning-assisted materials discovery framework running on GPU-enabled exascale computing resources. This framework integrates […]
Jun, 29

WiLLM: An Open Wireless LLM Communication System

The rapid evolution of LLMs threatens to overwhelm existing wireless infrastructure, necessitating architectural innovations for burgeoning mobile LLM services. This paper introduces WiLLM, the first open-source wireless system specifically designed for these services. First, we establish a new paradigm by deploying LLMs in core networks (CNs) with abundant GPUs. This enables distributed inference services, strategically […]
Jun, 29

Survey of HPC in US Research Institutions

The rapid growth of AI, data-intensive science, and digital twin technologies has driven an unprecedented demand for high-performance computing (HPC) across the research ecosystem. While national laboratories and industrial hyperscalers have invested heavily in exascale and GPU-centric architectures, university-operated HPC systems remain comparatively under-resourced. This survey presents a comprehensive assessment of the HPC landscape across […]
Jun, 29

Omniwise: Predicting GPU Kernels Performance with LLMs

In recent years, the rapid advancement of deep neural networks (DNNs) has revolutionized artificial intelligence, enabling models with unprecedented capabilities in understanding, generating, and processing complex data. These powerful architectures have transformed a wide range of downstream applications, tackling tasks beyond human reach. In this paper, we introduce Omniwise, the first end-to-end, self-supervised fine-tuning pipeline […]
Jun, 29

GCStack+GCScaler: Fast and Accurate GPU Performance Analyses Using Fine-Grained Stall Cycle Accounting and Interval Analysis

To design next-generation Graphics Processing Units (GPUs), GPU architects rely on GPU performance analyses to identify key GPU performance bottlenecks and explore GPU design spaces. Unfortunately, the existing GPU performance analysis mechanisms make it difficult for GPU architects to conduct fast and accurate GPU performance analyses. The existing mechanisms can provide misleading insights into GPU […]
Jun, 29

No More Shading Languages: Compiling C++ to Vulkan Shaders

Graphics APIs have traditionally relied on shading languages, however, these languages have a number of fundamental defects and limitations. By contrast, GPU compute platforms offer powerful, feature-rich languages suitable for heterogeneous compute. We propose reframing shading languages as embedded domain-specific languages, layered on top of a more general language like C++, doing away with traditional […]
Jun, 22

LiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters

Parallel computing with multiple GPUs has become the dominant paradigm for machine learning tasks, especially those of large language models (LLMs). To reduce the latency incurred by inter-GPU communication, a common practice for parallel tasks has been to allocate GPUs based on their physical proximity. However, this long-standing assumption has notable limitations, particularly in large-scale, […]
Jun, 22

A Novel Compiler Transformation for Fast Sparse Matrix Multiplication in GPUs

Sparse data structures are commonly used in neural networks to reduce the memory footprint. These data structures are compact but cause irregularities such as random memory accesses, which prevent efficient use of the memory hierarchy. GPUs are a common platform for machine learning practitioners, but running compact data structures on these devices often leads to […]

Recent source codes

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us: