high performance computing on graphics processing units: hgpu.org

Posts

Apr, 27

InteropUnityCUDA: A Tool for Interoperability Between Unity and CUDA

Introduction: Unity is a powerful and versatile tool for creating real-time experiments. It includes a built-in compute shader language, a C-like programming language designed for massively parallel General-Purpose GPU (GPGPU) computing. However, as Unity is primarily developed for multi-platform game creation, its compute shader language has several limitations, including the lack of multi-GPU computation support […]

CUDA

•

OpenGL

Apr, 27

Data-efficient LLM Fine-tuning for Code Generation

Large language models (LLMs) have demonstrated significant potential in code generation tasks. However, there remains a performance gap between open-source and closed-source models. To address this gap, existing approaches typically generate large amounts of synthetic data for fine-tuning, which often leads to inefficient training. In this work, we propose a data selection strategy in order […]

CUDA

Apr, 27

LithOS: An Operating System for Efficient Machine Learning on GPUs

The surging demand for GPUs in datacenters for machine learning (ML) has made efficient GPU utilization crucial. However, meeting the diverse needs of ML models while optimizing resource usage is challenging. To enable transparent, fine-grained GPU management that maximizes utilization and energy efficiency while maintaining strong isolation, an operating system (OS) approach is needed. This […]

CUDA

Apr, 27

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

The increasing scale of deep learning models has led to the development of various parallelization strategies for distributed training across accelerators. For example, fully sharded approaches like DeepSpeed ZeRO-3 and FSDP partition the parameters of each layer across multiple GPUs and gather them through communication when needed. These methods rely on optimizations such as prefetching, […]

CUDA

Apr, 13

Scalability Evaluation of HPC Multi-GPU Training for ECG-based LLMs

Training large language models requires extensive processing, made possible by many high-performance computing resources. This study compares multi-node and multi-GPU environments for training large language models of electrocardiograms. It provides a detailed mapping of current frameworks for distributed deep learning in multinode and multi-GPU settings, including Horovod from Uber, DeepSpeed from Microsoft, and the built-in […]

CUDA

Apr, 13

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

CUDA (Compute Unified Device Architecture) parallel programming significantly improves computational efficiency across multiple fields. However, converting serial C code to CUDA poses challenges for non-experts, and traditional tools struggle with complex patterns. While LLMs (Large Language Models) enable automatic parallelization of complex patterns, they may generate CUDA code with synchronization and memory management issues. There […]

CUDA

Apr, 13

GigaAPI for GPU Parallelization

GigaAPI is a user-space API that simplifies multi-GPU programming, bridging the gap between the capabilities of parallel GPU systems and the ability of developers to harness their full potential. The API offers a comprehensive set of functionalities, including fundamental GPU operations, image processing, and complex GPU tasks, abstracting away the intricacies of low-level CUDA and […]

CUDA

Apr, 13

GPU-centric Communication Schemes for HPC and ML Applications

Compute nodes on modern heterogeneous supercomputing systems comprise CPUs, GPUs, and high-speed network interconnects (NICs). Parallelization is identified as a technique for effectively utilizing these systems to execute scalable simulation and deep learning workloads. The resulting inter-process communication from the distributed execution of these parallel workloads is one of the key factors contributing to its […]

Apr, 13

A Power-Efficient Scheduling Approach in a Cpu-Gpu Computing System by Thread-Based Parallel Programming

Due to their high computing performance, CPU-GPU heterogeneous computing platforms are widely used in mobile devices such as smart phones, tablet computers, and unmanned aerial vehicles. Because a mobile device is often powered by a battery, how to elegantly design a power-efficient real-time computing system becomes an important problem. In this paper, we propose a […]

Mar, 30

Advances in Semantic Patching for HPC-oriented Refactorings with Coccinelle

Currently, the most energy-efficient hardware platforms for floating point-intensive calculations (also known as High Performance Computing, or HPC) are graphical processing units (GPUs). However, porting existing scientific codes to GPUs can be far from trivial. This article summarizes our recent advances in enabling machine-assisted, HPC-oriented refactorings with reference to existing APIs and programming idioms available […]

CUDA

Mar, 30

PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch

CUDA Graphs — a recent hardware feature introduced for NVIDIA GPUs — aim to reduce CPU launch overhead by capturing and launching a series of GPU tasks (kernels) as a DAG. However, deploying CUDA Graphs faces several challenges today due to the static structure of a graph. It also incurs performance overhead due to data […]

CUDA

Mar, 30

Efficient allocation of image recognition and LLM tasks on multi-GPU system

This work is concerned with the evaluation of the performance of parallelization of learning and tuning processes for image classification and large language models. For machine learning model in image recognition, various parallelization methods are developed based on different hardware and software scenarios: simple data parallelism, distributed data parallelism, and distributed processing. A detailed description […]

CUDA

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

* * *

high performance computing on graphics processing units: hgpu.org

Posts

InteropUnityCUDA: A Tool for Interoperability Between Unity and CUDA

Data-efficient LLM Fine-tuning for Code Generation

LithOS: An Operating System for Efficient Machine Learning on GPUs

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training

Scalability Evaluation of HPC Multi-GPU Training for ECG-based LLMs

Large Language Model Powered C-to-CUDA Code Translation: A Novel Auto-Parallelization Framework

GigaAPI for GPU Parallelization

GPU-centric Communication Schemes for HPC and ML Applications

A Power-Efficient Scheduling Approach in a Cpu-Gpu Computing System by Thread-Based Parallel Programming

Advances in Semantic Patching for HPC-oriented Refactorings with Coccinelle

PyGraph: Robust Compiler Support for CUDA Graphs in PyTorch

Efficient allocation of image recognition and LLM tasks on multi-GPU system

Recent source codes

WiLLM: An Open Wireless LLM Communication System

Vcc: the Vulkan Clang Compiler

hpcbench: A set of benchmarking utilities for biomolecular simulation tools

HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration

chemtrain: Training Molecular Dynamics Potentials in JAX

microSYCL: SYCL micro-benchmarks repository

XaaS containers

CASS: Cuda-Amd aSSembly

Cluser of smartphones for edge computing application using TensorFlow

SYCL Container

Most viewed papers (last 30 days)