Posts
Jul, 28
Bringing Auto-tuning to HIP: Analysis of Tuning Impact and Difficulty on AMD and Nvidia GPUs
Many studies have focused on developing and improving auto-tuning algorithms for Nvidia Graphics Processing Units (GPUs), but the effectiveness and efficiency of these approaches on AMD devices have hardly been studied. This paper aims to address this gap by introducing an auto-tuner for AMD’s HIP. We do so by extending Kernel Tuner, an open-source Python […]
Jul, 28
Efficiently Training 7B LLM with 1 Million Sequence Length on 8 GPUs
Nowadays, Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications. However, long context training poses great challenges considering the constraint of GPU memory. It not only leads to substantial activation memory consumption during training, but also incurs considerable memory fragmentation. To facilitate long context training, existing frameworks have […]
Jul, 28
RBMD: A molecular dynamics package enabling to simulate 10 million all-atom particles in a single graphics processing unit
This paper introduces a random-batch molecular dynamics (RBMD) package for fast simulations of particle systems at the nano/micro scale. Different from existing packages, the RBMD uses random batch methods for nonbonded interactions of particle systems. The long-range part of Coulomb interactions is calculated in Fourier space by the random batch Ewald algorithm, which achieves linear […]
Jul, 14
Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU Systems
Sparse matrix-vector multiplication (SpMV) is one of the important kernels of many iterative algorithms for solving sparse linear systems. The limited storage and computational resources of individual GPUs restrict both the scale and speed of SpMV computing in problem-solving. As real-world engineering problems continue to increase in complexity, the imperative for collaborative execution of iterative […]
Jul, 14
Harnessing Integrated CPU-GPU System Memory for HPC: a first look into Grace Hopper
Memory management across discrete CPU and GPU physical memory is traditionally achieved through explicit GPU allocations and data copy or unified virtual memory. The Grace Hopper Superchip, for the first time, supports an integrated CPU-GPU system page table, hardware-level addressing of system allocated memory, and cache-coherent NVLink-C2C interconnect, bringing an alternative solution for enabling a […]
Jul, 14
Automating Heterogeneous Parallelism in Numerical Differential Equations
Scientific computing is an amalgamation of numerical methods and computer science. Developments in numerical analysis have allowed stable and accurate numerical schemes, whereas computer algorithms have been successfully adopted to standard multicore systems of today, enabling parallelism. Combining efficient numerical algorithms with efficient parallelism presents a challenge mainly due to the independent development of these […]
Jul, 14
The Impact of Modern Consumer GPUs on Commonly Used Secure Password Standards
As home network-based devices and servers become more accessible [1], the need for cybersecurity awareness and best practices to secure wireless network is increasingly important. With the growing affordability of advanced hardware technology, such as modern gaming PCs equipped with powerful graphics processing units (GPUs), which can facilitate password brute force cracking on a wider […]
Jul, 14
Automated C/C++ Program Repair for High-Level Synthesis via Large Language Models
In High-Level Synthesis (HLS), converting a regular C/C++ program into its HLS-compatible counterpart (HLS-C) still requires tremendous manual effort. Various program scripts have been introduced to automate this process. But the resulting codes usually contain many issues that should be manually repaired by developers. Since Large Language Models (LLMs) have the ability to automate code […]
Jul, 7
Supercharging Federated Learning with Flower and NVIDIA FLARE
Several open-source systems, such as Flower and NVIDIA FLARE, have been developed in recent years while focusing on different aspects of federated learning (FL). Flower is dedicated to implementing a cohesive approach to FL, analytics, and evaluation. Over time, Flower has cultivated extensive strategies and algorithms tailored for FL application development, fostering a vibrant FL […]
Jul, 7
Chat AI: A Seamless Slurm-Native Solution for HPC-Based Services
The increasing adoption of large language models (LLMs) has created a pressing need for an efficient, secure and private serving infrastructure, which allows researchers to run open-source or custom fine-tuned LLMs and ensures users that their data remains private and is not stored without their consent. While high-performance computing (HPC) systems equipped with state-of-the-art GPUs […]
Jul, 7
Towards Unified Analysis of GPU Consistency
After more than 30 years of research, there is a solid understanding of the consistency guarantees given by CPU systems. Unfortunately, the same is not yet true for GPUs. The growing popularity of general purpose GPU programming has been a call for action which industry players like Nvidia and Khronos have answered by formalizing their […]
Jul, 7
Automatic Code Rewriting for Performance Portability
Rewriting code for cleanliness, API changes, and new programming models is a common, yet time-consuming task. This is important for HPC applications that desire performance portability in particular, since these applications are usually very long lived and wish to run on many architectures, so they need to be written such that they can make good […]