Posts
Nov, 17
GPUVM: GPU-driven Unified Virtual Memory
Graphics Processing Units (GPUs) leverage massive parallelism and large memory bandwidth to support high-performance computing applications, such as multimedia rendering, crypto-mining, deep learning, and natural language processing. These applications require models and datasets that are getting bigger in size and currently challenge the memory capacity of a single GPU, causing substantial performance overheads. To address […]
Nov, 17
Context Parallelism for Scalable Million-Token Inference
We present context parallelism for long-context large language model inference, which achieves near-linear scaling for long-context prefill latency with up to 128 H100 GPUs across 16 nodes. Particularly, our method achieves 1M context prefill with Llama3 405B model in 77s (93% parallelization efficiency, 63% FLOPS utilization) and 128K context prefill in 3.8s. We develop two […]
Nov, 17
Kokkidio: Fast, expressive, portable code, based on Kokkos and Eigen
Kokkidio is a newly developed C++ template library that combines the performance portability framework Kokkos and its strength in utilising GPUs with the expressive syntax and CPU optimisations of the linear algebra library Eigen. Its unified abstractions enable both simple data management as well as clear, succinct compute code in kernel functors, where a novel […]
Nov, 17
Improving Parallel Program Performance Through DSL-Driven Code Generation with LLM Optimizers
Mapping computations to processors and assigning data to memory are critical for maximizing performance in parallel programming. These mapping decisions are managed through the development of specialized low-level system code, called mappers, crafted by performance engineers. Each mapper is tailored to a specific application and optimized for the underlying machine architecture, a process that requires […]
Nov, 17
The Rewriting of DataRaceBench Benchmark for OpenCL Program Validations
Effective detection of data races in parallel computing environments is essential for ensuring the correctness and performance of multi-threaded applications. This paper addresses the issue with OpenCL data racing analysis. Currently, for the data racing research, there is a well-established DataRaceBench benchmark, designed for OpenMP. In our research, we rewrite the OpenMP DataRaceBench benchmark for […]
Nov, 10
LLM-Inference-Bench: Inference Benchmarking of Large Language Models on AI Accelerators
Large Language Models (LLMs) have propelled groundbreaking advancements across several domains and are commonly used for text generation applications. However, the computational demands of these complex models pose significant challenges, requiring efficient hardware acceleration. Benchmarking the performance of LLMs across diverse hardware platforms is crucial to understanding their scalability and throughput characteristics. We introduce LLM-Inference-Bench, […]
Nov, 10
On a Simplified Approach to Achieve Parallel Performance and Portability Across CPU and GPU Architectures
This paper presents software advances to easily exploit computer architectures consisting of a multi-core CPU and CPU+GPU to accelerate diverse types of high-performance computing (HPC) applications using a single code implementation. The paper describes and demonstrates the performance of the open-source C++ matrix and array (MATAR) library that uniquely offers: (1) a straightforward syntax for […]
Nov, 10
Over-synchronization in GPU Programs
The performance of GPU (Graphics Processing Unit)-accelerated functions affects a large spectrum of modern software. Efficiently synchronizing across thousands of concurrent threads is critical to the performance of GPU programs. GPU vendors have introduced advanced programming constructs, e.g., scopes, for efficiently synchronizing within a chosen subset of threads. However, programmers must explicitly employ them, where […]
Nov, 10
Profile Util library: A quick and easy way to get MPI, OpenMP and GPU runtime information
We present profile_util, a quick and simple way of profiling codes. This is a MPI, OpenMP, and GPU enabled C++17 library. The GPU interface is compatible with both HIP and CUDA and is compatible with more than a single GPU per MPI process. It provides a means of logging MPI, OpenMP and GPU related information, […]
Nov, 10
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Online LLM inference powers many exciting applications such as intelligent chatbots and autonomous agents. Modern LLM inference engines widely rely on request batching to improve inference throughput, aiming to make it cost-efficient when running on expensive GPU accelerators. However, the limited GPU memory has largely limited the batch size achieved in practice, leaving significant GPU […]
Nov, 3
Data-Driven Dynamic Autotuning: Optimizing Autotuning Overhead with Prior Tuning Data
Modern high performance computing applications often rely on heterogeneous hardware resources to achieve maximum performance. This approach presents obvious benefits, combining the processing power of multiple different processors and allowing them to be more specialized. However, since HPC applications typically need to be programmed in a hardware-aware manner to achieve maximum performance, this places more […]
Nov, 3
LLload: An Easy-to-Use HPC Utilization Tool
The increasing use and cost of high performance computing (HPC) requires new easy-to-use tools to enable HPC users and HPC systems engineers to transparently understand the utilization of resources. The MIT Lincoln Laboratory Supercomputing Center (LLSC) has developed a simple command, LLload, to monitor and characterize HPC workloads. LLload plays an important role in identifying […]