Posts
Aug, 14
In-Situ Techniques on GPU-Accelerated Data-Intensive Applications
The computational power of High-Performance Computing (HPC) systems is constantly increasing, however, their input/output (IO) performance grows relatively slowly, and their storage capacity is also limited. This unbalance presents significant challenges for applications such as Molecular Dynamics (MD) and Computational Fluid Dynamics (CFD), which generate massive amounts of data for further visualization or analysis. At […]
Aug, 4
LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs
As deep neural networks (DNNs) become increasingly large and complicated, pruning techniques are proposed for lower memory footprint and more efficient inference. The most critical kernel to execute pruned sparse DNNs on GPUs is Sparse-dense Matrix Multiplication (SpMM). To maximize the performance of SpMM, despite the high-performance implementation generated from advanced tensor compilers, they often […]
Aug, 4
Springald: GPU-Accelerated Window-Based Aggregates Over Out-of-Order Data Streams
An increasing number of application domains require high-throughput processing to extract insights from massive data streams. The Data Stream Processing (DSP) paradigm provides formal approaches to analyze structured data streams considered as special, unbounded relations. The most used class of stateful operators in DSP are the ones running sliding-window aggregation, which continuously extracts insights from […]
Aug, 4
Lectures on Parallel Computing
These lecture notes are designed to accompany an imaginary, virtual, undergraduate, one or two semester course on fundamentals of Parallel Computing as well as to serve as background and reference for graduate courses on High-Performance Computing, parallel algorithms and shared-memory multiprocessor programming. They introduce theoretical concepts and tools for expressing, analyzing and judging parallel algorithms […]
Aug, 4
Data-driven Forecasting of Deep Learning Performance on GPUs
Deep learning kernels exhibit predictable memory accesses and compute patterns, making GPUs’ parallel architecture well-suited for their execution. Software and runtime systems for GPUs are optimized to better utilize the stream multiprocessors, on-chip cache, and off-chip high-bandwidth memory. As deep learning models and GPUs evolve, access to newer GPUs is often limited, raising questions about […]
Aug, 4
Scheduling Deep Learning Jobs in Multi-Tenant GPU Clusters via Wise Resource Sharing
Deep learning (DL) has demonstrated significant success across diverse fields, leading to the construction of dedicated GPU accelerators within GPU clusters for high-quality training services. Efficient scheduler designs for such clusters are vital to reduce operational costs and enhance resource utilization. While recent schedulers have shown impressive performance in optimizing DL job performance and cluster […]
Jul, 28
A Comparison of OpenCL, CUDA, and HIP as Compilation Targets for a Functional Array Language
This paper compares OpenCL, CUDA, and HIP as compilation targets for Futhark, a functional array language. We compare the performance of OpenCL versus CUDA, and OpenCL versus HIP, on the code generated by the Futhark compiler on a collection of 48 application benchmarks on two different GPUs. Despite the generated code in most cases being […]
Jul, 28
Data-driven Performance Optimization for Data-intensive Applications
Data-intensive applications have attracted considerable attention from researchersin information sciences and enterprises, as these applications have made evolutionary breakthroughs in scientific fields and are extremely valuable to produce productivity in businesses. Recently, as the high speed growth of the new generated data, researchers have begun to leverage the useful knowledge hidden in such huge volume […]
Jul, 28
Bringing Auto-tuning to HIP: Analysis of Tuning Impact and Difficulty on AMD and Nvidia GPUs
Many studies have focused on developing and improving auto-tuning algorithms for Nvidia Graphics Processing Units (GPUs), but the effectiveness and efficiency of these approaches on AMD devices have hardly been studied. This paper aims to address this gap by introducing an auto-tuner for AMD’s HIP. We do so by extending Kernel Tuner, an open-source Python […]
Jul, 28
Efficiently Training 7B LLM with 1 Million Sequence Length on 8 GPUs
Nowadays, Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications. However, long context training poses great challenges considering the constraint of GPU memory. It not only leads to substantial activation memory consumption during training, but also incurs considerable memory fragmentation. To facilitate long context training, existing frameworks have […]
Jul, 28
RBMD: A molecular dynamics package enabling to simulate 10 million all-atom particles in a single graphics processing unit
This paper introduces a random-batch molecular dynamics (RBMD) package for fast simulations of particle systems at the nano/micro scale. Different from existing packages, the RBMD uses random batch methods for nonbonded interactions of particle systems. The long-range part of Coulomb interactions is calculated in Fourier space by the random batch Ewald algorithm, which achieves linear […]
Jul, 14
Optimization of Large-Scale Sparse Matrix-Vector Multiplication on Multi-GPU Systems
Sparse matrix-vector multiplication (SpMV) is one of the important kernels of many iterative algorithms for solving sparse linear systems. The limited storage and computational resources of individual GPUs restrict both the scale and speed of SpMV computing in problem-solving. As real-world engineering problems continue to increase in complexity, the imperative for collaborative execution of iterative […]