29281

Posts

Jul, 14

Automated C/C++ Program Repair for High-Level Synthesis via Large Language Models

In High-Level Synthesis (HLS), converting a regular C/C++ program into its HLS-compatible counterpart (HLS-C) still requires tremendous manual effort. Various program scripts have been introduced to automate this process. But the resulting codes usually contain many issues that should be manually repaired by developers. Since Large Language Models (LLMs) have the ability to automate code […]
Jul, 7

Supercharging Federated Learning with Flower and NVIDIA FLARE

Several open-source systems, such as Flower and NVIDIA FLARE, have been developed in recent years while focusing on different aspects of federated learning (FL). Flower is dedicated to implementing a cohesive approach to FL, analytics, and evaluation. Over time, Flower has cultivated extensive strategies and algorithms tailored for FL application development, fostering a vibrant FL […]
Jul, 7

Chat AI: A Seamless Slurm-Native Solution for HPC-Based Services

The increasing adoption of large language models (LLMs) has created a pressing need for an efficient, secure and private serving infrastructure, which allows researchers to run open-source or custom fine-tuned LLMs and ensures users that their data remains private and is not stored without their consent. While high-performance computing (HPC) systems equipped with state-of-the-art GPUs […]
Jul, 7

Towards Unified Analysis of GPU Consistency

After more than 30 years of research, there is a solid understanding of the consistency guarantees given by CPU systems. Unfortunately, the same is not yet true for GPUs. The growing popularity of general purpose GPU programming has been a call for action which industry players like Nvidia and Khronos have answered by formalizing their […]
Jul, 7

Automatic Code Rewriting for Performance Portability

Rewriting code for cleanliness, API changes, and new programming models is a common, yet time-consuming task. This is important for HPC applications that desire performance portability in particular, since these applications are usually very long lived and wish to run on many architectures, so they need to be written such that they can make good […]
Jul, 7

PSCToolkit: solving sparse linear systems with a large number of GPUs

In this chapter, we describe the Parallel Sparse Computation Toolkit (PSCToolkit), a suite of libraries for solving large-scale linear algebra problems in an HPC environment. In particular, we focus on the tools provided for the solution of symmetric and positive-definite linear systems using up to 8192 GPUs on the EuroHPC-JU Leonardo supercomputer. PSCToolkit is an […]
Jun, 30

CATBench: A Compiler Autotuning Benchmarking Suite for Black-box Optimization

Bayesian optimization is a powerful method for automating tuning of compilers. The complex landscape of autotuning provides a myriad of rarely considered structural challenges for black-box optimizers, and the lack of standardized benchmarks has limited the study of Bayesian optimization within the domain. To address this, we present CATBench, a comprehensive benchmarking suite that captures […]
Jun, 30

Adapting database components to heterogeneous environments

Data management has seen rapid evolution during the last years, influenced by factors such as data explosion, the prevalence of machine and deep learning, the slowdown of Moore’s law and the popularity of hardware accelerators. Data processing systems are trying to adapt to all these trends by building monolithic and highly specialized systems, which are […]
Jun, 30

A Survey of General-purpose Polyhedral Compilers

Since the 1990’s many implementations of polyhedral compilers have been written and distributed, either as source-to-source translating compilers or integrated into wider purpose compilers. This paper provides a survey on those various available implementations as of today, 2024. We list and describe most commonly available polyhedral schedulers and compiler implementations. Then, we compare the general-purpose […]
Jun, 30

Composing Distributed Computations Through Task and Kernel Fusion

We introduce Diffuse, a system that dynamically performs task and kernel fusion in distributed, task-based runtime systems. The key component of Diffuse is an intermediate representation of distributed computation that enables the necessary analyses for the fusion of distributed tasks to be performed in a scalable manner. We pair task fusion with a JIT compiler […]
Jun, 30

How to Rent GPUs on a Budget

The explosion in Machine Learning (ML) over the past ten years has led to a dramatic increase in demand for GPUs to train ML models. Because it is prohibitively expensive for most users to build and maintain a large GPU cluster, large cloud providers (Microsoft Azure, Amazon AWS, Google Cloud) have seen explosive growth in […]
Jun, 23

Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded Optimizers

Transformers and LLMs have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the training of transformers is slow and often takes in the order of weeks or months. Thanks to 3D model parallelism (data, pipeline, and tensor-level parallelism), the training can […]

* * *

* * *

HGPU group © 2010-2024 hgpu.org

All rights belong to the respective authors

Contact us: