high performance computing on graphics processing units: hgpu.org

Posts

May, 25

CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

We introduce CASS, the first large-scale dataset and model suite for cross-architecture GPU code transpilation, targeting both source-level (CUDA<->HIP) and assembly-level (Nvidia SASS<->AMD RDNA3) translation. The dataset comprises 70k verified code pairs across host and device, addressing a critical gap in low-level GPU code portability. Leveraging this resource, we train the CASS family of domain-specific […]

CUDA

•

OpenCL

May, 25

FLASH: Fast All-to-All Communication in GPU Clusters

Scheduling All-to-All communications efficiently is fundamental to minimizing job completion times in distributed systems. Incast and straggler flows can slow down All-to-All transfers; and GPU clusters bring additional straggler challenges due to highly heterogeneous link capacities between technologies like NVLink and Ethernet. Existing schedulers all suffer high overheads relative to theoretically optimal transfers. Classical, simple […]

May, 25

Low-cost edge computing using upcycled smartphones

Smartphone users often replace their devices prematurely for newer models, contributing to the growing issue of waste electrical and electronic equipment (WEEE). Repurposing these devices to extend their life cycle by assigning them new roles can help mitigate this problem. This thesis explores the feasibility of creating a cluster using upcycled smartphones deployed with the […]

May, 18

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Graph embeddings provide continuous vector representations of nodes in a graph, which are widely applicable in community detection, recommendations, and various scientific fields. However, existing graph embedding systems either face scalability challenges due to the high cost of RAM and multiple GPUs, or rely on disk storage at the expense of I/O efficiency. In this […]

CUDA

May, 18

Can Large Language Models Predict Parallel Code Performance?

Accurate determination of the performance of parallel GPU code typically requires execution-time profiling on target hardware — an increasingly prohibitive step due to limited access to high-end GPUs. This paper explores whether Large Language Models (LLMs) can offer an alternative approach for GPU performance prediction without relying on hardware. We frame the problem as a […]

CUDA

May, 18

Comparing Parallel Functional Array Languages: Programming and Performance

Parallel functional array languages are an emerging class of programming languages that promise to combine low-effort parallel programming with good performance and performance portability. We systematically compare the designs and implementations of five different functional array languages: Accelerate, APL, DaCe, Futhark, and SaC. We demonstrate the expressiveness of functional array programming by means of four […]

CUDA

•

OpenCL

May, 18

GPU Performance Portability needs Autotuning

As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. Today’s reliance on a single dominant platform limits portability, creates vendor lock-in, and raises barriers for new AI hardware. In this work, we make the case for combining just-in-time (JIT) compilation with kernel parameter autotuning to enable portable, state-of-the-art […]

CUDA

May, 18

Exploration of Cryptocurrency Mining-Specific GPUs in AI Applications: A Case Study of CMP 170HX

This study systematically tests a computational power reuse scheme proposed by the open source community disabling specific instruction sets (Fused Multiply Add instructions) through CUDA source code modifications on the NVIDIA CMP 170HX platform. Experimental results validate the effectiveness of this approach, partially restoring the GPU’s computational capabilities in artificial intelligence (AI) tasks. Performance evaluations […]

CUDA

•

OpenCL

May, 4

LIFT: LLM-Based Pragma Insertion for HLS via GNN Supervised Fine-Tuning

FPGAs are increasingly adopted in datacenter environments for their reconfigurability and energy efficiency. High-Level Synthesis (HLS) tools have eased FPGA programming by raising the abstraction level from RTL to untimed C/C++, yet attaining high performance still demands expert knowledge and iterative manual insertion of optimization pragmas to modify the microarchitecture. To address this challenge, we […]

May, 4

Mìmir: A real-time interactive visualization library for CUDA programs

Real-time visualization of computational simulations running over graphics processing units (GPU) is a valuable feature in modern science and technological research, as it allows researchers to visually assess the quality and correctness of their computational models during the simulation. Due to the high throughput involved in GPU-based simulations, classical visualization approaches such as ones based […]

CUDA

May, 4

Scaling On-Device GPU Inference for Large Generative Models

Driven by the advancements in generative AI, large machine learning models have revolutionized domains such as image processing, audio synthesis, and speech recognition. While server-based deployments remain the locus of peak performance, the imperative for on-device inference, necessitated by privacy and efficiency considerations, persists. Recognizing GPUs as the on-device ML accelerator with the widest reach, […]

CUDA

•

OpenCL

May, 4

Efficient deep learning inference on end devices

Deep Learning (DL) has become a cornerstone of modern Artificial Intelligence (AI), powering applications across healthcare, computer vision, and autonomous systems. However, executing DL inference on resource-constrained end devices—such as smartphones and IoT hardware—poses challenges due to limited computational resources, energy constraints, and real-time requirements. This thesis addresses the optimization of DL inference on Heterogeneous […]

OpenCL

high performance computing on graphics processing units: hgpu.org

Posts

CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

FLASH: Fast All-to-All Communication in GPU Clusters

Low-cost edge computing using upcycled smartphones

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Comparing Parallel Functional Array Languages: Programming and Performance

GPU Performance Portability needs Autotuning

Exploration of Cryptocurrency Mining-Specific GPUs in AI Applications: A Case Study of CMP 170HX

LIFT: LLM-Based Pragma Insertion for HLS via GNN Supervised Fine-Tuning

Mìmir: A real-time interactive visualization library for CUDA programs

Scaling On-Device GPU Inference for Large Generative Models

Efficient deep learning inference on end devices

Recent source codes

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

SYCL Container

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

CFAL-bench

Efficient Graph Embedding at Scale: Optimizing CPU-GPU-SSD Integration

Can Large Language Models Predict Parallel Code Performance?

Most viewed papers (last 30 days)