29903

Posts

May, 18

Comparing Parallel Functional Array Languages: Programming and Performance

Parallel functional array languages are an emerging class of programming languages that promise to combine low-effort parallel programming with good performance and performance portability. We systematically compare the designs and implementations of five different functional array languages: Accelerate, APL, DaCe, Futhark, and SaC. We demonstrate the expressiveness of functional array programming by means of four […]
May, 18

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Graph embeddings provide continuous vector representations of nodes in a graph, which are widely applicable in community detection, recommendations, and various scientific fields. However, existing graph embedding systems either face scalability challenges due to the high cost of RAM and multiple GPUs, or rely on disk storage at the expense of I/O efficiency. In this […]
May, 18

Can Large Language Models Predict Parallel Code Performance?

Accurate determination of the performance of parallel GPU code typically requires execution-time profiling on target hardware — an increasingly prohibitive step due to limited access to high-end GPUs. This paper explores whether Large Language Models (LLMs) can offer an alternative approach for GPU performance prediction without relying on hardware. We frame the problem as a […]
May, 18

GPU Performance Portability needs Autotuning

As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. Today’s reliance on a single dominant platform limits portability, creates vendor lock-in, and raises barriers for new AI hardware. In this work, we make the case for combining just-in-time (JIT) compilation with kernel parameter autotuning to enable portable, state-of-the-art […]
May, 18

Exploration of Cryptocurrency Mining-Specific GPUs in AI Applications: A Case Study of CMP 170HX

This study systematically tests a computational power reuse scheme proposed by the open source community disabling specific instruction sets (Fused Multiply Add instructions) through CUDA source code modifications on the NVIDIA CMP 170HX platform. Experimental results validate the effectiveness of this approach, partially restoring the GPU’s computational capabilities in artificial intelligence (AI) tasks. Performance evaluations […]
May, 4

LIFT: LLM-Based Pragma Insertion for HLS via GNN Supervised Fine-Tuning

FPGAs are increasingly adopted in datacenter environments for their reconfigurability and energy efficiency. High-Level Synthesis (HLS) tools have eased FPGA programming by raising the abstraction level from RTL to untimed C/C++, yet attaining high performance still demands expert knowledge and iterative manual insertion of optimization pragmas to modify the microarchitecture. To address this challenge, we […]
May, 4

Mìmir: A real-time interactive visualization library for CUDA programs

Real-time visualization of computational simulations running over graphics processing units (GPU) is a valuable feature in modern science and technological research, as it allows researchers to visually assess the quality and correctness of their computational models during the simulation. Due to the high throughput involved in GPU-based simulations, classical visualization approaches such as ones based […]
May, 4

Scaling On-Device GPU Inference for Large Generative Models

Driven by the advancements in generative AI, large machine learning models have revolutionized domains such as image processing, audio synthesis, and speech recognition. While server-based deployments remain the locus of peak performance, the imperative for on-device inference, necessitated by privacy and efficiency considerations, persists. Recognizing GPUs as the on-device ML accelerator with the widest reach, […]
May, 4

Efficient deep learning inference on end devices

Deep Learning (DL) has become a cornerstone of modern Artificial Intelligence (AI), powering applications across healthcare, computer vision, and autonomous systems. However, executing DL inference on resource-constrained end devices—such as smartphones and IoT hardware—poses challenges due to limited computational resources, energy constraints, and real-time requirements. This thesis addresses the optimization of DL inference on Heterogeneous […]
May, 4

Dynamic Memory Management on GPUs with SYCL

Dynamic memory allocation is not traditionally available in kernels running on GPUs. This work aims to build on Ouroboros, an efficient dynamic memory management library for CUDA applications, by porting the code to SYCL, a cross-platform accelerator API. Since SYCL can be compiled to a CUDA backend, it is possible to compare the performance of […]
Apr, 27

MSCCL++: Rethinking GPU Communication Abstractions for Cutting-edge AI Applications

Modern cutting-edge AI applications are being developed over fast-evolving, heterogeneous, nascent hardware devices. This requires frequent reworking of the AI software stack to adopt bottom-up changes from new hardware, which takes time for general-purpose software libraries. Consequently, real applications often develop custom software stacks optimized for their specific workloads and hardware. Custom stacks help in […]
Apr, 27

InteropUnityCUDA: A Tool for Interoperability Between Unity and CUDA

Introduction: Unity is a powerful and versatile tool for creating real-time experiments. It includes a built-in compute shader language, a C-like programming language designed for massively parallel General-Purpose GPU (GPGPU) computing. However, as Unity is primarily developed for multi-platform game creation, its compute shader language has several limitations, including the lack of multi-GPU computation support […]

* * *

* * *

HGPU group © 2010-2025 hgpu.org

All rights belong to the respective authors

Contact us:

contact@hpgu.org